Fix train state and modification time for unfinished project training #722

juhoinkinen · 2023-08-04T09:07:24Z

If initial training of a project is not finished after any files have been created in the project's data directory, the train state and modification time information turn out incorrect (the project shows to be (fully) trained when it is not, and with a modification time). And when retraining a project is interrupted, the modification time is falsely updated.

This problem can realize more commonly when/if implementing the --prepare-only option to the train command.

This PR makes the methods inquiring the train state and modification time to ignore files in the project's datadir with pattern *-train*, tmp-* and vectorizer. The tmp- prefix is added to all temporary files, because some backends are using a tempfile for the model file during training, which can remain after unfinished training, e.g. stwfsa_predictor1mz8z4im.zip.

The train and temp files should definitely be ignored, but the vectorizer file case is not so clear:

a project is not trained when only the vectorizer file exists (PR implements this)
when the vectorizer file (of a fully trained project) changes, the results of the model change, so modification time should be based also on vectorizer file mtime (PR does not implements this)

Instead of using global ignore patterns, this functionality could use the actual model file names/patterns per backend, but the field storing them varies a bit (MODEL_FILE, INDEX_FILE, MODEL_FILE_PREFIX).

codecov · 2023-08-04T09:11:17Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (02f1533) 99.67% compared to head (54f4136) 99.67%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #722   +/-   ##
=======================================
  Coverage   99.67%   99.67%           
=======================================
  Files          89       89           
  Lines        6380     6397   +17     
=======================================
+ Hits         6359     6376   +17     
  Misses         21       21

Files Changed	Coverage Δ
annif/backend/backend.py	`100.00% <100.00%> (ø)`
annif/util.py	`98.48% <100.00%> (+0.02%)`	⬆️
tests/test_project.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sonarcloud · 2023-08-04T09:43:11Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

juhoinkinen · 2023-08-07T10:14:37Z

To make the state of projects more coherent, the first step of the training could be to remove old data files. Then there would not be the problem about the modification time of the vectorizer file (whose update without updating the rest of the model could also make the project not working either silently or at all, because of wrong index): it would be clear that the vectorizer file would not affect the modification time or train state of the project.

This would also keep the project's datadir clean of any temp files hanging around.

However, this would also mean a less gracious behavior, in the sense that if retraining a working project fails, then the project is gone, whereas now the old project could remain working (depending on the step which fails).

osma

I think the implemented approach looks sensible.

Regarding this idea:

To make the state of projects more coherent, the first step of the training could be to remove old data files. Then there would not be the problem about the modification time of the vectorizer file (whose update without updating the rest of the model could also make the project not working either silently or at all, because of wrong index): it would be clear that the vectorizer file would not affect the modification time or train state of the project.

This would also keep the project's datadir clean of any temp files hanging around.

However, this would also mean a less gracious behavior, in the sense that if retraining a working project fails, then the project is gone, whereas now the old project could remain working (depending on the step which fails).

I think that this would be a step in the wrong direction. It would be better to try to make train operations more atomic, so that vectorizers, train files, models etc. would first be written into temporary files and only right at the end they would be renamed so that they replace existing files (if any). This is how annif.util.atomic_save is intended to work, but it only applies on individual files and not sets of files that need to be kept in sync (e.g. vectorizer + model).

Consider only model files for train state and mtime

417afa1

juhoinkinen added the bug label Aug 4, 2023

Add "tmp-" prefix to tempfile names and ignore them as model files

54f4136

osma approved these changes Aug 7, 2023

View reviewed changes

juhoinkinen marked this pull request as ready for review August 8, 2023 08:43

juhoinkinen added this to the 1.0 milestone Aug 8, 2023

juhoinkinen merged commit 2c5ee1f into main Aug 8, 2023
13 of 14 checks passed

juhoinkinen deleted the fix-train-state-and-mtime-interrupted-train branch August 8, 2023 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix train state and modification time for unfinished project training #722

Fix train state and modification time for unfinished project training #722

juhoinkinen commented Aug 4, 2023 •

edited

Loading

codecov bot commented Aug 4, 2023 •

edited

Loading

sonarcloud bot commented Aug 4, 2023

juhoinkinen commented Aug 7, 2023

osma left a comment

Fix train state and modification time for unfinished project training #722

Fix train state and modification time for unfinished project training #722

Conversation

juhoinkinen commented Aug 4, 2023 • edited Loading

codecov bot commented Aug 4, 2023 • edited Loading

Codecov Report

sonarcloud bot commented Aug 4, 2023

juhoinkinen commented Aug 7, 2023

osma left a comment

Choose a reason for hiding this comment

juhoinkinen commented Aug 4, 2023 •

edited

Loading

codecov bot commented Aug 4, 2023 •

edited

Loading