Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce the classification result of TimesNet #494

Closed
ArmandXiao opened this issue Aug 11, 2024 · 4 comments
Closed

Cannot reproduce the classification result of TimesNet #494

ArmandXiao opened this issue Aug 11, 2024 · 4 comments

Comments

@ArmandXiao
Copy link

大家好,我初步对比了一下我们实验代码与TSLib之间的差别,主要变化在learning rate strategy,我们在整理算法库的时候额外增加了learning rate decay的过程,但是这一设计会降低模型训练的波动性,对于一些数据量较小的数据集反而不友好,因此我们在最近的commit中已经去掉了这一过程(1c7f843 ),我这边测试是可以复现结果的,请大家再次尝试

As mentioned in the issue, I add the following two lines from commit:1c7f843.

However, I am still not able to reproduce the result. Moreover, the average result dropped after making the amendment mentioned in the commit.

Here are my results using TimesNet:

Table 17 Reproduce commit #1c7f843
EthanolConcentration 35.7 28.9 28.1
FaceDetection 68.6 66.3 68.0
Handwriting 32.1 31.8 17.4
Heartbeat 78.0 77.1 76.1
JapaneseVowels 98.4 97.3 93.0
PEMS-SF 89.6 86.7 75.7
SelfRegulationSCP1 91.8 89.8 90.1
SelfRegulationSCP2 57.2 51.1 52.2
SpokenArabicDigits 99.0 99.2 98.8
UWaveGestureLibrary 85.3 88.1 85.6
Avg 73.6 71.6 68.5

Thank you for your help.

@wuhaixu2016
Copy link
Collaborator

wuhaixu2016 commented Aug 11, 2024

Many thanks for your detailed reproduction and pointing out the problem of learning rate changing strategy.

(1) As I stated in the previous issue, some datasets of UEA suffer from serious limited data problems. Thus, their performance can be unstable. For example, under my experimental environment (w/o learning rate changing strategy), the Handwriting accuracy will be 0.33647058823529413. Here is the training log for this task.

Handwritting.log

(2) To clarify, I will public the training checkpoints under my experiments in two weeks.

@eiriksteen
Copy link

eiriksteen commented Aug 17, 2024

I have the same problem, I am not able to reproduce the results. How can we get past this when training our own models? I have a model that surpasses the TimesNet results that I have been able to reproduce, but not the ones in the paper. How can I be sure that my model is not trained in a suboptimal way leading to underestimated metrics?

In general, why aren't the metrics computed over multiple runs, with the mean and standard deviation being the final reported values?

@wuhaixu2016
Copy link
Collaborator

Many thanks for your question and valuable discussion. I have uploaded the checkpoint files and training log here: https://cloud.tsinghua.edu.cn/d/caefcdb63eee4adfad86/

Here is the summary of our experiments classification.log :

Dataset Table 17 Our Exp
EthanolConcentration 35.7 31.94
FaceDetection 68.6 67.45
Handwriting 32.1 32.47
Heartbeat 78.0 80.97
JapaneseVowels 98.4 97.84
PEMS-SF 89.6 88.44
SelfRegulationSCP1 91.8 91.46
SelfRegulationSCP2 57.2 60.00
SpokenArabicDigits 99.0 98.95
UWaveGestureLibrary 85.3 88.13
Avg 73.6 73.76

(1) The inconsistency between Table 17 and Our Exp

As stated in the previous issue #321 (comment) , our original experimental code is based on this repo: https://github.com/thuml/Flowformer . To make the open-sourced code easy to read, I spent two weeks reorganizing the code and unified five tasks in a shared code base, that is TSlib. During the code organization, I may lose some details, such as the learning rate strategy, which is fixed in this commit 1c7f843 (although I do remember that before I public this repo I have ensured all the performances could be reproduced).

In my current experiments, the averaged accuracy can be reproduced (a little bit better than the original paper). The only failed task is EthanolConcentration (35.7 v.s. 31.94). I plan to try my original code base and compare the training differences in every detail. If I have some new results, I will update them here, which may take some time.

(2) About the performance variance.

I have tried multiple runs and reported the std in our paper, which is around 0.1% for the average performance. The small subsets will be affected by random seeds differently, resulting in a kind of self-stable final performance.

To remove the high-variance tasks, I would suggest you omit EthanolConcentration, Handwriting and UWaveGestureLibrary and try some EEG datasets, which we have experimented with in this paper: https://arxiv.org/abs/2402.02475 .

Sorry for the inconvenience. If you have any questions, please email me or propose an issue in the repo.

@eiriksteen
Copy link

eiriksteen commented Aug 19, 2024

Thank you for the thorough response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants