[Feature Request] Train a gradient-boosted decision tree #28

maxencefrenette · 2024-01-03T20:02:17Z

Although transformers are probably what would give the best performance with enough training and tweaking of hyperparameters, I suspect that a gradient boosted decision tree ensemble model might outperform FSRS with very little tweaking using a methodology similar to this: https://machinelearningmastery.com/xgboost-for-time-series-forecasting/. It would, however be a much heavier model with many more parameters than even the LSTM that was attempted.

This is something i'd be interested in exploring if I could have access to the training data.

L-M-Sherlock · 2024-01-04T01:43:57Z

Here are 10 users' datasets: tiny_dataset.zip

You can use them for testing your model. PR is welcome. I can help you benchmark the model.

maxencefrenette · 2024-01-04T02:23:14Z

I'll see what sort of results I can get with this. Thanks for the data!

imrryr · 2024-01-11T21:51:51Z

So, I'm trying to run your script.py with this dataset, and it creates an evaluation directory, but it is empty. (I put the dataset in the dataset directory). Can you help me with the next steps, please? By the way, this is Pavlik, working with Hannah-Joy Simms

Expertium · 2024-01-11T21:53:49Z

Not sure if that helps, but I use cmd (Windows) and the following command: set DEV_MODE=1 && python script.py

imrryr · 2024-01-11T21:58:44Z

That doesn't produce changes. I think the problem is that it may not be finding the data, but I'm not sure how to check for that.

Expertium · 2024-01-11T22:03:25Z

Do you have the fsrs-optimizer repo downloaded too? script.py relies on fsrs_optimizer.py.

if os.environ.get("DEV_MODE"):
    # for local development
    sys.path.insert(0, os.path.abspath("../fsrs-optimizer/src/fsrs_optimizer/"))

from fsrs_optimizer import (
    Optimizer,
    Trainer,
    FSRS,
    Collection,
    power_forgetting_curve,
)

imrryr · 2024-01-11T22:10:40Z

I did it like this, is it right:
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> python -m pip install fsrs-optimizer
Collecting fsrs-optimizer
Using cached FSRS_Optimizer-4.20.8-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: matplotlib>=3.7.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (3.8.2)
Requirement already satisfied: numpy>=1.22.4 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (1.26.3)
Requirement already satisfied: pandas>=1.5.3 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2.1.4)
Requirement already satisfied: pytz>=2022.7.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2023.3.post1)
Requirement already satisfied: scikit-learn>=1.2.2 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (1.3.2)
Requirement already satisfied: torch>=1.13.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2.1.2)
Collecting tqdm>=4.64.1 (from fsrs-optimizer)
Using cached tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
Collecting statsmodels>=0.13.5 (from fsrs-optimizer)
Downloading statsmodels-0.14.1-cp311-cp311-win_amd64.whl.metadata (9.8 kB)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (4.47.2)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (1.4.5)
Requirement already satisfied: packaging>=20.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (23.2)
Requirement already satisfied: pillow>=8 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (2.8.2)
Requirement already satisfied: tzdata>=2022.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.5.3->fsrs-optimizer) (2023.4)
Requirement already satisfied: scipy>=1.5.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (3.2.0)
Collecting patsy>=0.5.4 (from statsmodels>=0.13.5->fsrs-optimizer)
Downloading patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB)
Requirement already satisfied: filelock in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.13.1)
Requirement already satisfied: typing-extensions in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (4.9.0)
Requirement already satisfied: sympy in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (1.12)
Requirement already satisfied: networkx in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.2.1)
Requirement already satisfied: jinja2 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.1.3)
Requirement already satisfied: fsspec in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (2023.12.2)
Collecting colorama (from tqdm>=4.64.1->fsrs-optimizer)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Requirement already satisfied: six in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from patsy>=0.5.4->statsmodels>=0.13.5->fsrs-optimizer) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from jinja2->torch>=1.13.1->fsrs-optimizer) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from sympy->torch>=1.13.1->fsrs-optimizer) (1.3.0)
Downloading FSRS_Optimizer-4.20.8-py3-none-any.whl (25 kB)
Downloading statsmodels-0.14.1-cp311-cp311-win_amd64.whl (9.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.9/9.9 MB 19.1 MB/s eta 0:00:00
Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Downloading patsy-0.5.6-py2.py3-none-any.whl (233 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.9/233.9 kB 14.0 MB/s eta 0:00:00
Installing collected packages: patsy, colorama, tqdm, statsmodels, fsrs-optimizer
Successfully installed colorama-0.4.6 fsrs-optimizer-4.20.8 patsy-0.5.6 statsmodels-0.14.1 tqdm-4.66.1

Expertium · 2024-01-11T22:16:41Z

Try running this line in cmd again (and make sure that fsrs-benchmark and fsrs-optimizer have the same parent folder, for example C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark and C:\Users\ppavl\Dropbox\Active projects\fsrs-optimizer): set DEV_MODE=1 && python script.py
If that doesn't work, then idk, you'll have to wait for LMSherlock to respond.

L-M-Sherlock · 2024-01-12T02:21:49Z

So, I'm trying to run your script.py with this dataset, and it creates an evaluation directory, but it is empty. (I put the dataset in the dataset directory). Can you help me with the next steps, please? By the way, this is Pavlik, working with Hannah-Joy Simms

Did you see the result directory?

imrryr · 2024-01-12T13:40:42Z

Yes, it was there from the start. It is unchanged after running the script

L-M-Sherlock · 2024-01-12T14:07:18Z

Could you paste the output of script displayed in the terminal?

imrryr · 2024-01-12T15:09:14Z

Yes, but it is blank:

PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> $env:DEV_MODE="1"; python script.py
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark>

and

PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> set DEV_MODE=1
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> python script.py
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark>

L-M-Sherlock · 2024-01-12T15:36:57Z

Weird. Nothing happened after the execution? I'm sorry I can't help you because I don't have a windows device.

L-M-Sherlock · 2024-01-12T15:37:43Z

Could you check the file path of your dataset?

imrryr · 2024-01-12T15:51:21Z

You can see it on the left. I wasn't sure of the format, so I offered the tiny dataset as csv, in the folder, and as a zip.

L-M-Sherlock · 2024-01-13T04:45:55Z

It's weird. Could you add print(os.getcwd()) below if __name__ == "__main__":? I guess it's a path related problem.

imrryr · 2024-01-13T18:26:36Z

It says: C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark

L-M-Sherlock · 2024-01-14T04:28:51Z

Maybe you can print(unprocessed_files) to check whether the dataset has been read.

imrryr · 2024-01-15T18:14:44Z

So, for my configuration it wasn't overwriting the old results directory that was there in github, I renamed this directory to results2, and now it creates the results directory as expected. I'll likely have some questions, so I'll send you an email unless you prefer I post them here as new issues.

Expertium · 2024-01-20T15:00:47Z

@imrryr how's the progress?

imrryr · 2024-01-20T20:53:51Z

Well, pretty good. I'm trying to get some appropriate data to compare this with some of our methods (e.g. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Ye48zsYAAAAJ&sortby=pubdate&citation_for_view=Ye48zsYAAAAJ:iyewoVqAXLQC ). I contacted Dae and am also looking at the MaiMemo data. I'm a little confused now since I realize I don't know the formal relationship of FSRS 4.5 and SSP-MMC. I'd be happy if someone could explain that... @Expertium

Could one simply use the MaiMemo data with the FSRS 4.5 algorithm? @L-M-Sherlock

L-M-Sherlock · 2024-01-21T02:49:57Z

I'm a little confused now since I realize I don't know the formal relationship of FSRS 4.5 and SSP-MMC

They are all based on DSR model. But the difficulty of cards is predetermined because we have millions users learning the same set of vocabulary.

Could one simply use the MaiMemo data with the FSRS 4.5 algorithm?

It's hard because the MaiMemo data doesn't contains every user's entire review data.

imrryr · 2024-01-21T16:19:04Z

@L-M-Sherlock OK, got it. So, DSR= difficulty, stability, recall... So when I unpack the SSP-MMC notation in your paper I will see it corresponds closely with FSRS model, except the difficulties are fixed in SSP-MMC method? Also, I got the full data, so I may have more questions as I move forward on this with Hannah

Expertium · 2024-01-21T16:29:08Z

My bad, imrryr. All this time I thought you were the person who is implementing a decision tree algorithm.
@maxencefrenette any progress?

imrryr · 2024-01-21T18:25:26Z

@L-M-Sherlock I am looking at the revlog format in the data archive. Do you have existing code to convert it to your CSV format? I guess I need to do that.

L-M-Sherlock · 2024-01-22T01:54:01Z

Do you have existing code to convert it to your CSV format? I guess I need to do that.

Do you mean this?

https://github.com/open-spaced-repetition/fsrs-optimizer/blob/8ce183629bdd56cf6a4eced66df121caecaef92e/src/fsrs_optimizer/fsrs_optimizer.py#L476-L693

imrryr · 2024-01-22T19:40:19Z

@L-M-Sherlock Maybe I do, but the format this code creates is different than is in the dataset folder. Do you know how to make them into the same format it needs for input: e.g.

card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3

Can you elaborate on how to get to this final format? I may be able to right the code from what you sent already, but help is appreciated.

Also
review_th - this is the order the cards occurred in?
delta_t - this is the difference in the cards temporal spacings (with 0 indicating less than a day)?

L-M-Sherlock · 2024-01-23T03:03:14Z

Can you elaborate on how to get to this final format? I may be able to right the code from what you sent already, but help is appreciated.

The code used to generate that format data is at here: https://github.com/open-spaced-repetition/fsrs-benchmark/blob/main/revlogs2dataset.py

imrryr · 2024-01-26T20:24:39Z

So, this code seemed to work at first, but doesn't produce the same results as the tiny dataset had. Its weirdly similar, with the number of card_id and length the same... just corrupted review_th and delta t.... For example... correct file:
card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3
0,163,6,4
0,237,1,2
0,380,11,4
1,4,-1,3
1,14,0,1
1,16,0,1
1,21,0,3
1,30,0,3
1,111,2,3
1,160,4,4
1,340,8,3

the output I get from revlogs2dataset.py:
card_id,review_th,delta_t,rating
card_id,review_th,delta_t,rating
0,4863,-1,3
0,4864,0,3
0,4997,4,3
0,5846,5,4
0,6105,2,2
0,6745,10,4
1,4998,-1,3
1,5008,0,1
1,5010,0,1
1,5015,0,3
1,5024,0,3
1,5276,1,3
1,5843,4,4
1,6371,9,3

L-M-Sherlock · 2024-01-27T06:25:16Z

So, this code seemed to work at first, but doesn't produce the same results as the tiny dataset had.

Please open a new issue to report the details. I hope you can share the revlogs file and your script code.

Expertium · 2024-02-05T11:05:45Z

Well that's a bummer. Why did you close it?

L-M-Sherlock · 2024-02-05T12:10:04Z

Because I don't plan to implement the model and I have shared the dataset with the creator of this issue.

Expertium · 2024-02-05T12:13:30Z

Yeah, but did the creator of the issue himself say that he's not planning to work on it?

maxencefrenette · 2024-02-11T07:33:31Z

Hi all, I'm still working on this, but progress is slow since I don't have a ton of time to spend on this. I got what I wanted out of this issue, which is a public subset of the data, thanks a lot for that. I'm okay with closing this, I don't need the issue to be open to work on it.

Expertium · 2024-02-13T14:25:25Z

@maxencefrenette I think it's best to keep the number of trainable parameters around 500-600, since that's roughly how many parameters our LSTM and Transformer have. Ideally, we want to see how much architecture affects the results. If the number of parameters across different algorithms is similar, then we can clearly see which architecture is superior.

Expertium · 2024-03-01T15:49:52Z

@maxencefrenette Hello again! Me and LMSherlock have re-defined RMSE and are finishing benchmarking algorithms again. If you still want to participate (and I hope you do), now is a good time.

l3kn mentioned this issue Jan 28, 2024

[Question] A “raw” version of the tiny_dataset.zip #43

Closed

L-M-Sherlock closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024

[Feature Request] Train a gradient-boosted decision tree #28

[Feature Request] Train a gradient-boosted decision tree #28

Comments

maxencefrenette commented Jan 3, 2024

L-M-Sherlock commented Jan 4, 2024 • edited Loading

maxencefrenette commented Jan 4, 2024

imrryr commented Jan 11, 2024

Expertium commented Jan 11, 2024

imrryr commented Jan 11, 2024

Expertium commented Jan 11, 2024

imrryr commented Jan 11, 2024

Expertium commented Jan 11, 2024

L-M-Sherlock commented Jan 12, 2024 • edited Loading

imrryr commented Jan 12, 2024

L-M-Sherlock commented Jan 12, 2024 • edited Loading

imrryr commented Jan 12, 2024

L-M-Sherlock commented Jan 12, 2024

L-M-Sherlock commented Jan 12, 2024

imrryr commented Jan 12, 2024

L-M-Sherlock commented Jan 13, 2024

imrryr commented Jan 13, 2024

L-M-Sherlock commented Jan 14, 2024

imrryr commented Jan 15, 2024

Expertium commented Jan 20, 2024

imrryr commented Jan 20, 2024

L-M-Sherlock commented Jan 21, 2024

imrryr commented Jan 21, 2024 • edited Loading

Expertium commented Jan 21, 2024

imrryr commented Jan 21, 2024

L-M-Sherlock commented Jan 22, 2024

imrryr commented Jan 22, 2024

L-M-Sherlock commented Jan 23, 2024

imrryr commented Jan 26, 2024 • edited Loading

L-M-Sherlock commented Jan 27, 2024

Expertium commented Feb 5, 2024

L-M-Sherlock commented Feb 5, 2024 • edited Loading

Expertium commented Feb 5, 2024

maxencefrenette commented Feb 11, 2024

Expertium commented Feb 13, 2024

Expertium commented Mar 1, 2024

L-M-Sherlock commented Jan 4, 2024 •

edited

Loading

L-M-Sherlock commented Jan 12, 2024 •

edited

Loading

L-M-Sherlock commented Jan 12, 2024 •

edited

Loading

imrryr commented Jan 21, 2024 •

edited

Loading

imrryr commented Jan 26, 2024 •

edited

Loading

L-M-Sherlock commented Feb 5, 2024 •

edited

Loading