Added dataset split functionality (LRS) to use pre-configured train\validation\test data splits #1950
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
``For what I'm aware, there is no way in Recbole to use pre-configured train\validation\test data splits. Even though custom datasets can used, custom splits cannot be adopted, and only random split (RS) and leave-one-out (LS) are available.
Then, Recbole cannot be easily used for reproducible experiments by applying the same data configuration of another study.
It's been a few months since I'm using this modification of the Dataset class and it works easily with custom datasets.
A user can add 3 "atomic" files with extension ".train", ".validation", and ".test", which will be loaded and used to generate the respective dataloaders.
For instance, I used the dataset 'lastfm' and my dataset folder is like the following:
The files with extension ".train", ".validation", and ".test" are simple Pickle files containing a dictionary of two keys: uid_field and iid_field.
The main idea is that data splitting should only be based on the user ids and the item ids.
Let
uid_field = "user_id"
andiid_field = "item_id"
, the dictionary has the following format:Each vector contains the ids of users and items. Each user interacted with the item in the corresponding index, exactly like the "inter" dataframe. Then, the first interaction is
("2a4", "9h6")
, the second is("8d5", "4v3")
, and so on.Each vector can be encoded as a Python list, a numpy array or a torch.Tensor.
The function
split_by_loaded_splits
generates the splits on the basis of the interactions contained in each dictionary. The train subdataset will contain the pair of user ids and item ids contained in the ".train" file, the validation subdatasets the ones contained in the ".validation" file, the test subdatasets the ones contained in the ".test" file.The validation split is optional.
To use such split (LRS => load ready splits), the config file should include the following
eval_args
key: