Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added dataset split functionality (LRS) to use pre-configured train\validation\test data splits #1950

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jackmedda
Copy link
Contributor

``For what I'm aware, there is no way in Recbole to use pre-configured train\validation\test data splits. Even though custom datasets can used, custom splits cannot be adopted, and only random split (RS) and leave-one-out (LS) are available.
Then, Recbole cannot be easily used for reproducible experiments by applying the same data configuration of another study.

It's been a few months since I'm using this modification of the Dataset class and it works easily with custom datasets.
A user can add 3 "atomic" files with extension ".train", ".validation", and ".test", which will be loaded and used to generate the respective dataloaders.
For instance, I used the dataset 'lastfm' and my dataset folder is like the following:

├── dataset
│   ├── lastfm
│   │   ├── lastfm.inter
│   │   ├── lastfm.user
│   │   ├── lastfm.item
│   │   ├── lastfm.train
│   │   ├── lastfm.validation
│   │   ├── lastfm.test

The files with extension ".train", ".validation", and ".test" are simple Pickle files containing a dictionary of two keys: uid_field and iid_field.
The main idea is that data splitting should only be based on the user ids and the item ids.

Let uid_field = "user_id" and iid_field = "item_id", the dictionary has the following format:

{
    "user_id": ["2a4", "8d5", "1b6", ...],
    "item_id": ["9h6", "4v3", "7m5", ...],
}

Each vector contains the ids of users and items. Each user interacted with the item in the corresponding index, exactly like the "inter" dataframe. Then, the first interaction is ("2a4", "9h6"), the second is ("8d5", "4v3"), and so on.
Each vector can be encoded as a Python list, a numpy array or a torch.Tensor.

The function split_by_loaded_splits generates the splits on the basis of the interactions contained in each dictionary. The train subdataset will contain the pair of user ids and item ids contained in the ".train" file, the validation subdatasets the ones contained in the ".validation" file, the test subdatasets the ones contained in the ".test" file.
The validation split is optional.

To use such split (LRS => load ready splits), the config file should include the following eval_args key:

eval_args:
    split: {'LRS': None}
    order: RO  # not relevant
    group_by: '-' # not relevant
    mode: 'full'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant