Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix train state and modification time for unfinished project training #722

Merged
merged 2 commits into from
Aug 8, 2023

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Aug 4, 2023

If initial training of a project is not finished after any files have been created in the project's data directory, the train state and modification time information turn out incorrect (the project shows to be (fully) trained when it is not, and with a modification time). And when retraining a project is interrupted, the modification time is falsely updated.

This problem can realize more commonly when/if implementing the --prepare-only option to the train command.

This PR makes the methods inquiring the train state and modification time to ignore files in the project's datadir with pattern *-train*, tmp-* and vectorizer. The tmp- prefix is added to all temporary files, because some backends are using a tempfile for the model file during training, which can remain after unfinished training, e.g. stwfsa_predictor1mz8z4im.zip.

The train and temp files should definitely be ignored, but the vectorizer file case is not so clear:

  • a project is not trained when only the vectorizer file exists (PR implements this)
  • when the vectorizer file (of a fully trained project) changes, the results of the model change, so modification time should be based also on vectorizer file mtime (PR does not implements this)

Instead of using global ignore patterns, this functionality could use the actual model file names/patterns per backend, but the field storing them varies a bit (MODEL_FILE, INDEX_FILE, MODEL_FILE_PREFIX).

@juhoinkinen juhoinkinen added the bug label Aug 4, 2023
@codecov
Copy link

codecov bot commented Aug 4, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (02f1533) 99.67% compared to head (54f4136) 99.67%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #722   +/-   ##
=======================================
  Coverage   99.67%   99.67%           
=======================================
  Files          89       89           
  Lines        6380     6397   +17     
=======================================
+ Hits         6359     6376   +17     
  Misses         21       21           
Files Changed Coverage Δ
annif/backend/backend.py 100.00% <100.00%> (ø)
annif/util.py 98.48% <100.00%> (+0.02%) ⬆️
tests/test_project.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sonarcloud
Copy link

sonarcloud bot commented Aug 4, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@juhoinkinen
Copy link
Member Author

To make the state of projects more coherent, the first step of the training could be to remove old data files. Then there would not be the problem about the modification time of the vectorizer file (whose update without updating the rest of the model could also make the project not working either silently or at all, because of wrong index): it would be clear that the vectorizer file would not affect the modification time or train state of the project.

This would also keep the project's datadir clean of any temp files hanging around.

However, this would also mean a less gracious behavior, in the sense that if retraining a working project fails, then the project is gone, whereas now the old project could remain working (depending on the step which fails).

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implemented approach looks sensible.

Regarding this idea:

To make the state of projects more coherent, the first step of the training could be to remove old data files. Then there would not be the problem about the modification time of the vectorizer file (whose update without updating the rest of the model could also make the project not working either silently or at all, because of wrong index): it would be clear that the vectorizer file would not affect the modification time or train state of the project.

This would also keep the project's datadir clean of any temp files hanging around.

However, this would also mean a less gracious behavior, in the sense that if retraining a working project fails, then the project is gone, whereas now the old project could remain working (depending on the step which fails).

I think that this would be a step in the wrong direction. It would be better to try to make train operations more atomic, so that vectorizers, train files, models etc. would first be written into temporary files and only right at the end they would be renamed so that they replace existing files (if any). This is how annif.util.atomic_save is intended to work, but it only applies on individual files and not sets of files that need to be kept in sync (e.g. vectorizer + model).

@juhoinkinen juhoinkinen marked this pull request as ready for review August 8, 2023 08:43
@juhoinkinen juhoinkinen added this to the 1.0 milestone Aug 8, 2023
@juhoinkinen juhoinkinen merged commit 2c5ee1f into main Aug 8, 2023
13 of 14 checks passed
@juhoinkinen juhoinkinen deleted the fix-train-state-and-mtime-interrupted-train branch August 8, 2023 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants