Added dataset split functionality (LRS) to use pre-configured train\validation\test data splits #1950

jackmedda · 2023-12-18T12:05:46Z

``For what I'm aware, there is no way in Recbole to use pre-configured train\validation\test data splits. Even though custom datasets can used, custom splits cannot be adopted, and only random split (RS) and leave-one-out (LS) are available.
Then, Recbole cannot be easily used for reproducible experiments by applying the same data configuration of another study.

It's been a few months since I'm using this modification of the Dataset class and it works easily with custom datasets.
A user can add 3 "atomic" files with extension ".train", ".validation", and ".test", which will be loaded and used to generate the respective dataloaders.
For instance, I used the dataset 'lastfm' and my dataset folder is like the following:

├── dataset
│   ├── lastfm
│   │   ├── lastfm.inter
│   │   ├── lastfm.user
│   │   ├── lastfm.item
│   │   ├── lastfm.train
│   │   ├── lastfm.validation
│   │   ├── lastfm.test

The files with extension ".train", ".validation", and ".test" are simple Pickle files containing a dictionary of two keys: uid_field and iid_field.
The main idea is that data splitting should only be based on the user ids and the item ids.

Let uid_field = "user_id" and iid_field = "item_id", the dictionary has the following format:

{
    "user_id": ["2a4", "8d5", "1b6", ...],
    "item_id": ["9h6", "4v3", "7m5", ...],
}

Each vector contains the ids of users and items. Each user interacted with the item in the corresponding index, exactly like the "inter" dataframe. Then, the first interaction is ("2a4", "9h6"), the second is ("8d5", "4v3"), and so on.
Each vector can be encoded as a Python list, a numpy array or a torch.Tensor.

The function split_by_loaded_splits generates the splits on the basis of the interactions contained in each dictionary. The train subdataset will contain the pair of user ids and item ids contained in the ".train" file, the validation subdatasets the ones contained in the ".validation" file, the test subdatasets the ones contained in the ".test" file.
The validation split is optional.

To use such split (LRS => load ready splits), the config file should include the following eval_args key:

eval_args:
    split: {'LRS': None}
    order: RO  # not relevant
    group_by: '-' # not relevant
    mode: 'full'

…alidation\test data splits

Added dataset split functionality (LRS) to use pre-configured train\v…

d5938e6

…alidation\test data splits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added dataset split functionality (LRS) to use pre-configured train\validation\test data splits #1950

Added dataset split functionality (LRS) to use pre-configured train\validation\test data splits #1950

jackmedda commented Dec 18, 2023

Added dataset split functionality (LRS) to use pre-configured train\validation\test data splits #1950

Are you sure you want to change the base?

Added dataset split functionality (LRS) to use pre-configured train\validation\test data splits #1950

Conversation

jackmedda commented Dec 18, 2023