Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lhsm_archive - number of errors are accumulating until it reaches max #106

Open
geraldhofer opened this issue Jul 4, 2019 · 4 comments
Open

Comments

@geraldhofer
Copy link

geraldhofer commented Jul 4, 2019

It looks like the Lustre changelog is currently leaking records where UNLINK records that never report actual deletion (UNLINK_LAST flag is not set) when a file is still opened when deleted.

Apparently a possible reproducer is:

  • open file
  • delete file
  • sleep 1h
  • close file

We end up with orphans in the database that do have all the entries, except the path.

Apparently we need to fix the underlying issue in Lustre and I am working on that.

But some user application does apparently trigger that issue now to an extend that it starts to impact our ability to migrate files in an reasonable time.
We already set the suspend_error_min=10000000 quite high, anticipating to never hit that - but we still eventually got to a stage where we end up with too many errors and a very long runtime on the archive:

Policy 'lhsm_archive':
    Current run started on 2019/07/04 17:13:51: trigger: scheduled (daemon), target: all
    Last complete run: trigger: scheduled (daemon), target: all
        - Started on 2019/07/02 19:53:22
        - Finished on 2019/07/04 16:34:14 (duration: 1d 20h 40min 52s)
        - Summary: 48 successful actions, volume: 96.75 GB; 0 entries skipped; 10000002 errors

These are the errors we see in the log files:

2019/07/04 03:19:35 [32655/16] lhsm_archive | Warning: cannot determine if entry  is whitelisted: skipping it.
2019/07/04 03:19:35 [32655/16] Policy | [0x200128cc4:0x1709:0x0]: attribute is missing for checking fileset 'scratch'
2019/07/04 03:19:35 [32655/20] Policy | Missing attribute 'fullpath' for evaluating boolean expression on [0x200128cc4:0x1715:0x0]
2019/07/04 03:19:35 [32655/20] Policy | [0x200128cc4:0x1715:0x0]: attribute is missing for checking ignore_fileclass rule

The reason why these orphans are affecting us is that we use a fileclass that does require the path information to determine if we want to migrate the file or not:

FileClass scratch {
        definition {
            tree == "/lustre/scratch"
        }
}

So at the time of that error, the entry is already deleted from the Lustre file system. Every subsequent archive run has to go through all the old errors again, so the runtime increases and when the suspend_error_min is reached before we are archiving all the entries, we are missing files to migrate.

It looks like only a scan can remove the entries out of the database. A scan does take more that a day on this system and maxes out the database and increases the load on the MDS, so we don't want to run it that frequently.

In the end this is a database corruption issue. We basically have some entries that are corrupted/inconsistent (in this case by the Lustre bug), that are causing errors during the migration. I think it does make sense to try to rectify these database errors by reading the entries again from the file system and try to rectify these errors at that time of the archive. That would avoid the need for a scan in a more general way as database inconsistencies get rectified as they are discovered by the migration. I am in the lucky position that we do have upgraded the hardware and that I was able to optimise the scan to about a day (from a week). I would not be able to deal with that at all if my scan times are in the range of a week. But it would help to avoid a scan in a more general way if errors would trigger an rescan of that FID.

@dtcray
Copy link
Contributor

dtcray commented Jul 8, 2019

Possible option would be to set "invalid flag" in DB for these types of entry, most policies ignore entries with invalid flag set and subsequent scan would purge them from DB or set all required attributes to those ENTRIES

@tl-cea
Copy link
Member

tl-cea commented Jul 8, 2019

Gerald,

Your analysis is absolutely correct. This issue with open-deleted file is something we noticed too.
As you mention, the best way to solve this is to make Lustre raise an UNLINK record with UNLINK_LAST flag when the entry is actually deleted.

If the error limitation annoys you, you can set "suspend_error_min = 0" (which is the default) to disable this limit.

As mentioned by dtcray, the policy run should set the entry as invalid if it no longer exists.
The check if done before or after the rate limiting, depending on the policy parameters
(see https://github.com/cea-hpc/robinhood/wiki/robinhood_v3_admin_doc#policy-parameters).
If it is not the case, try:
pre_sched_match = auto_update;

Regards,
Thomas

@geraldhofer
Copy link
Author

It looks like the option:
pre_sched_match = auto_update;
has worked around this issue successfully.

The suspend_error_min = 0 is not really helping as the runtime is getting far to long when too many errors accumulating. In my example it was already running more than a day, Usually it was running in a few minutes (not counting the database query).

@tl-cea
Copy link
Member

tl-cea commented Jul 9, 2019

Good.
@geraldhofer Do you have an open LU about the UNLINK_LAST flag for open-deleted files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants