-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS targets show in confirmation even if not needed #15
Comments
Pastebin content: alexr@dev101:~/src/resolve/src/resolve/ml$ drake --- 0. Skipped (up-to-date): hdfs://user/alexr/resolve-ml/gold-annotations <- data/gold-annotations --- 1. Skipped (up-to-date): hdfs://user/alexr/resolve-ml/good-pairs <- hdfs://user/alexr/resolve-ml/gold-annotations --- 2. Skipped (up-to-date): hdfs://user/alexr/resolve-ml/uuid-and-attrs <- hdfs://user/alexr/resolve-ml/gold-annotations --- 3. Skipped (up-to-date): hdfs://user/alexr/resolve-ml/all-pairs <- hdfs://user/alexr/resolve-ml/gold-annotations --- 4. Skipped (up-to-date): data/good-pairs <- hdfs://user/alexr/resolve-ml/good-pairs --- 5. Skipped (up-to-date): data/all-pairs <- hdfs://user/alexr/resolve-ml/all-pairs --- 6. Skipped (up-to-date): hdfs://user/alexr/resolve-ml/bad-pairs <- data/all-pairs, data/good-pairs --- 7. Skipped (up-to-date): hdfs://user/alexr/resolve-ml/good-pairs-with-features <- hdfs://user/alexr/resolve-ml/good-pairs, hdfs://user/alexr/resolve-ml/uuid-and-attrs --- 8. Skipped (up-to-date): hdfs://user/alexr/resolve-ml/bad-pairs-with-features <- hdfs://user/alexr/resolve-ml/bad-pairs, hdfs://user/alexr/resolve-ml/uuid-and-attrs Done (0 steps run). |
This is extremely strange. How is it possible that "gold-annotations" listed with "missing output" reason, but then not build with "up-to-date" reason? Does the file really exist or not? One of those is definitely wrong, but which one? Can you dig into it a little further, i.e. ls -l inputs and outputs of the first target? Thanks! |
So, after further investigation, this issue is intermittent, and seems to occur when I'm moving from local to hdfs, or back. I wonder if it is a synchronization issue? Sometimes the hdfs last modified time ends up slightly different than the local one. If I rerun the workflow sometimes it is a problem again, and sometimes not. |
Yes, it is most likely the synchronization issue. I ran into this issue before, and reported to Philip. After he fixed it, it was gone. It probably happened again. Even 1 second de-synchronization could create this issue. I won't close the bug just yet - but let me know once you talk to Philip. |
either one of you guys up for adding details on this to the wiki? sounds like an annoying gotcha that we should warn folks about |
when you start FAQ, I'll add it there :) |
So, is there is a way we could make drake more tolerant of this? Right now if the servers get out of sync by even a millisecond, we'll have a problem. The problem only surfaces when the dependencies go the wrong way across the divide. Could we add some sort of configurable delay between steps? If we waited one second after each step, then the servers would have to be off by more than one second to see the problem. That seems much less likely than being one millisecond off. In most data workflows an extra second or two per step is not going to be a big deal. Alternately, could we just configure a "fuzz" factor into the out-of-date calculation that calls things ok if their timestamp is after their dependency, after adding the fuzz factor? I think the problem shows up mostly when using dummy commands that don't take very long. If the commands were longer, it'd overcome the server time difference. That said, I'm seeing the issue around hdfs -copyToLocal and -put, which are going to be something I use in the future. Adding a 30 second sleep to each command does fix the problem. |
Maybe we could control the scope of the forced delay? The rule would be like: If using HDFS and step was fast, then artificially pause. Or is there some elegant way to detect the problem beforehand? |
AFAICT, the issue only comes up when moving between filesystems, so we could add the (configurable?) delay before any step that uses both local and hdfs filesystems. Also, we could try to detect the unsync condition between the two systems and correct for it. Maybe make a tmp file on both systems at the same time, compare eventual access times, then adjust later times by the difference? This might work, except that network lag may not be consistent, so the adjustment we derive may only be accurate for that single point in time. I think there isn't an easy fix because distributed systems have to deal with network lag that can make time comparisons tough. The delay method will work as long as the delay is bigger than the time difference between the two systems, I think. |
Alex, thank you very much for your thoughts. There isn't an easy fix. Ideally, every system would have its time synchronized over NTP. I like both your ideas. We can add configurable delay (say, 300-400 ms) before every step that uses more than one filesystem. It would not solve all possible problems, but probably a big chunk. I'm a little more worried about "fuzzy" timestamp evaluation. If it relaxed the requirements (i.e. more targets would be evaluated than otherwise), it would be OK. But it tightens the requirements, which can be problematic. Consider, for example, a user that runs a script which runs I also like the idea to test the filesystem delay. We can have a special flag ( But if the desynchronization is seconds, nothing will really help. I'm not sure if all of the above is the highest priority - would you like to help us with the code? I'd be more than happy to review or point you to the right place. |
Added --step-delay flag in feature/vvv: ee833c5 |
The --step-delay flag delays every step. While that would work, we really I wonder if the fs_test you mention should run as a precondition whenever "But if the desynchronization is seconds, nothing will really help." Yeah, |
Yes, I remember your suggestion to implement it only for steps crossing over 2 or more filesystems, and it's a fine one. But I had to implement it this way, because the problem seems to be fundamental on certain filesystems - #36. We could have another flag that would turn control the behavior of I agree we should fail fast, and I think we can be even smarter with fs_test. We can put a flag under .drake/ directory where Drake keeps all temporary files including logs and script files, which would indicate whether the filesystem testing happened for this workflow or not. We can repeat it every week (day?), if needed, and, of course, one needs to be able to disable it completely. I agree we can easily detect whether the workflow uses multiple filesystems. I have to say this issue is quite low on my priority list for now. But I'd be more than happy to review anyone's code contributions and provide direction and guidance. |
Oh, I didn't realize HDFS was limited to 1s resolution. Your change makes sense in light of that. |
I'm actually not sure what timestamp resolution HDFS has. If you could run any workflow that uses HDFS with --debug flag and see what timestamps it reports, it would be helpful. @larsyencken was talking about HFS+ which is the file system OS X uses. |
Alex, we should probably close this bug since it is related to Factual's HDFS/NFS desynchronization, and if Philip fixed it, this problem should go away. I liked all your other ideas, however, and I was wondering if you could file a feature request for what you think we could do to make it even better (i.e. detection of multiple filesystems, automated tests) etc.? |
Terminal output here: http://pastebin.com/J08GAk1Y
drake has already run once, to completion. No files have been modified. drake correctly notices this as skips all the steps. Why does it still say it is going to do the steps?
All the steps involve at least one hdfs location. A very similar workflow that was all local didn't exhibit this same behavior.
The text was updated successfully, but these errors were encountered: