Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in star_imports cache #13826

Closed
vbraun opened this issue Dec 12, 2012 · 28 comments
Closed

Race condition in star_imports cache #13826

vbraun opened this issue Dec 12, 2012 · 28 comments

Comments

@vbraun
Copy link
Member

vbraun commented Dec 12, 2012

See https://groups.google.com/d/topic/sage-devel/SN88f9qEIV8/discussion

The patchbot there sporadically fails on various tests with errors of
the type

Traceback (most recent call last): 
  File "/mnt/storage2TB/patchbot/.sage/tmp/ 
volker_desktop.stp.dias.ie-14095/multireplace_29004.py", line 6, in 
<module> 
    from sage.all_cmdline import *; 
  File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ 
site-packages/sage/all_cmdline.py", line 14, in <module> 
    from sage.all import * 
  File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ 
site-packages/sage/all.py", line 72, in <module> 
    from sage.rings.all      import * 
  File "/mnt/storage2TB/patchbot/Sage/sage-5.5.rc0/local/lib/python/ 
site-packages/sage/rings/all.py", line 169, in <module> 
    lazy_import("sage.rings.universal_cyclotomic_field.all","*") 
  File "lazy_import.pyx", line 850, in 
sage.misc.lazy_import.lazy_import (sage/misc/lazy_import.c:5168) 
    names[ix:ix+1] = get_star_imports(module) 
  File "lazy_import.pyx", line 900, in 
sage.misc.lazy_import.get_star_imports (sage/misc/lazy_import.c:5924) 
    star_imports = pickle.load(open(cache_file)) 
EOFError 

The patchbot is on a separate harddisk, mounted under /mnt/storage2TB. But my temp directory is tmpfs. So the pickle is moved across block devices, which is of course not atomic. Hence the patchbot sometimes dies here while opening a half-written file. The temporary file should be created in the target directory, only then can we be sure that the move is atomic.

Apply attachment: 13826_star_imports_race_v2.patch

Depends on #14292

CC: @nbruin @robertwb

Component: build

Author: Volker Braun, Jeroen Demeyer

Reviewer: Volker Braun

Merged: sage-5.9.rc0

Issue created by migration from https://trac.sagemath.org/ticket/13826

@vbraun
Copy link
Member Author

vbraun commented Dec 12, 2012

Author: Volker Braun

@nbruin
Copy link
Contributor

nbruin commented Dec 12, 2012

comment:3

While we're nitpicking on possible race conditions anyway, isn't this unsafe?

    if star_imports is None:
        cache_file = get_cache_file()
        if os.path.exists(cache_file):
            star_imports = pickle.load(open(cache_file))
        else:
            star_imports = {}

The file could get deleted between the existence check and the open. Shouldn't one do

    if star_imports is None:
        cache_file = get_cache_file()
        try:
            opened_cache_file=open(cache_file)
        except IOError, <whatever other errors should be caught>:
            opened_cache_file=None
        if opened_cache_file:
            star_imports = pickle.load(opened_cache_file)
            opened_cache_file.close()
        else:
            star_imports = {}

This is of course assuming that either the cachefile is valid or does not exist. A "mv" on the filename shouldn't affect an already opened file.

@vbraun
Copy link
Member Author

vbraun commented Dec 12, 2012

comment:4

I agree that one should always open() and then ask for forgiveness. Fixed, too.

@nbruin
Copy link
Contributor

nbruin commented Dec 12, 2012

comment:5

I think you're guarding a little too much in the try: ... except now. Wouldn't we want to know if unpickling went wrong? We have a good reason to catch errors from open: That's how we catch the file being legitimately non-existent. I think it's a real error we want reported if the file exists but doesn't contain a valid pickle. With your proposed code one could end up with a corrupted cache and consistently, silently, lose the efficiency of having the cache.

To some extent it's a matter of taste: Do you want to continue on a best-effort basis as much as possible or do you want the user to know that something didn't go as expected? I think the latter leads to an easier to debug program and therefore is more suitable for sage.

@vbraun
Copy link
Member Author

vbraun commented Dec 13, 2012

Updated patch

@vbraun
Copy link
Member Author

vbraun commented Dec 13, 2012

comment:6

Attachment: trac_13826_star_imports_race.patch.gz

I've added a warning if the pickle is corrupted and converted sage.sandpiles to lazily import so we actually have something to test. Though neither of these changes is anywhere near as important as fixing the race condition.

@nbruin
Copy link
Contributor

nbruin commented Dec 14, 2012

Reviewer: Nils Bruin

@nbruin
Copy link
Contributor

nbruin commented Dec 14, 2012

comment:7

Looks good to me (mostly). Few small issues:

  • whitespace plugin complains about introduced whitespace
  • the code writing the cache file first deletes the cachefile, writes the new one in a temporary file, and then moves it in place. Why not leave the old cache file in place while the new one gets written?
    I'm setting review to positive anyway.

@vbraun
Copy link
Member Author

vbraun commented Dec 14, 2012

comment:8

Replying to @nbruin:

  • whitespace plugin complains about introduced whitespace

So? Go ahead and design a workflow that takes care of it automatically if it bothers you.

  • the code writing the cache file first deletes the cachefile, writes the new one in a temporary file, and then moves it in place.

Because os.rename() will fail if the destination exists on Windows (thanks for the well-thought out filesystem semantics, Bill G.!)

@nbruin
Copy link
Contributor

nbruin commented Dec 14, 2012

comment:9

OK, it seems the issues I pointed out are a little less straightforward than I thought, so they should probably be cleared up before this ticket gets merged.

So? Go ahead and design a workflow that takes care of it automatically if it bothers you.

It's a space after a colon on a line you introduce, not some whitespace on an empty line left by an auto-indenting editor.

Because os.rename() will fail if the destination exists on Windows (thanks for the well-thought out filesystem semantics, Bill G.!)

That's not what I meant. You can first write the temp file and then unlink the old one and move the new one in place. Presently the old cache file is already gone while you're writing the new one. An old cache file is better than none at all, it seems in this setting?

@vbraun
Copy link
Member Author

vbraun commented Dec 14, 2012

comment:10

Replying to @nbruin:

It's a space after a colon on a line you introduce, not some whitespace on an empty line left by an auto-indenting editor.

So?

That's not what I meant. You can first write the temp file and then unlink the old one and move the new one in place. Presently the old cache file is already gone while you're writing the new one. An old cache file is better than none at all, it seems in this setting?

Or one could argue that a potentially stale cache is worse than a non-existent one. Doesn't really matter either way, serializing a dict takes no time.

@nbruin
Copy link
Contributor

nbruin commented Dec 14, 2012

comment:11

OK, I don't know what we're supposed to do with whitespace anymore. I hope someone else can sign off on that.

Concerning the caching strategy: I'm not so sure the current code is particularly safe in the face of now-current development practices: If a cache file is present and contains an entry for a module then a lazy star import of that module will use that entry to initialize the namespace. I don't think the cache ever gets updated should it be discovered the data is out-of-sync.

The cache-file is only specific to the "branch" of sage. Now that people mostly use mercurial queues, the branch name hardly identifies the particular sage version. In particular, if I apply a patch that changes a lazily imported module, the cache will not be marked as stale and I will continue working with that. The cache will be resaved upon startup (see all.py), but only the entries of modules that weren't in the cache before will be updated, because the cache is blindly trusted for the other ones.

I don't quite know what the best solution is. Perhaps sage -b should just always mark the cache as invalid (i.e., delete it)? In that case, sage -b should perhaps create the cache as well. In that case, there is no need to rewrite the cache every time and there's no race condition either.

It means that the cache would only reflect the sage "system library" and would need to be located somewhere in the tree rather than in a user's .sage. The problem is that in general the only way to know if the cache is stale is by looking at the modification dates of the module files and once we're statting all of those, we've lost what we were trying to gain. For the sage library we can reasonably assume that only sage -b changes that, so there we can cache state without having to stat.

I realize that this is escalating the scope of this ticket a bit, but if lazy_import lies about objects available in the to-be-imported namespace (due to using a stale cache), we have a rather critical problem. I'm including Robert on the CC here in the hope he can explain why the current approach is safe.

Or perhaps we should simply not do lazy star imports, only specific ones. Then the whole cache is not necessary.

@nbruin
Copy link
Contributor

nbruin commented Dec 14, 2012

Changed reviewer from Nils Bruin to none

@vbraun
Copy link
Member Author

vbraun commented Dec 14, 2012

comment:12

sage -b deletes the lazy import cache already. Are you saying that it doesn't work on your system?

[vbraun@volker-desktop ~]$ ll .sage/cache/
total 4
-rw-------. 1 vbraun vbraun 464 Dec 13 18:11 _home_vbraun_opt_sage-5.5.rc0_devel_sage-main-lazy_import_cache.pickle
[vbraun@volker-desktop ~]$ sage -b >& /dev/null
[vbraun@volker-desktop ~]$ ll .sage/cache/
total 0
[vbraun@volker-desktop ~]$ 

Its better to be specific about what you import if you know what you need. On the other hand, sage.all of course star-imports other .all files.

@robertwb
Copy link
Contributor

comment:13

Yes, sage -b deletes the cache file, so the cache should never be stale. Making the replacement atomic should be fixed though, and this patch does that.

The code looks good, except that I think the explicit unlink of the cache file before writing a new one is unnecessary (and could cause issues with multiple stage startups at the same time). We've decided to ignore whitespace, so modulo that one issue, positive review.

@vbraun
Copy link
Member Author

vbraun commented Dec 14, 2012

comment:14

As I said before, os.rename() will fail on Windows if the target exists. We could count on os.rename() fail on Windows and succeed on Unix, but thats also a bit wonky. Explicitly deleting the cache ensures that save_cache_file() does what you would think it does.

@nbruin
Copy link
Contributor

nbruin commented Dec 14, 2012

comment:15

OK, code is probably less ambiguous than trying to explain in words. Why do you do

+    try:
+        os.unlink(cache_file)
+    except OSError:
+        pass   # cache_file does not exist, fine
+    # important: create temp dir in cache_dir (trac #13826) to move atomically
+    _, tmp_file = tempfile.mkstemp(dir=cache_dir)
+    with open(tmp_file, "w") as tmp:
+        pickle.dump(star_imports, tmp)
+    try:
+        os.rename(tmp_file, cache_file)
+    except OSError:
+        pass   # Windows can end up here, ignore

instead of

+    # important: create temp dir in cache_dir (trac #13826) to move atomically
+    _, tmp_file = tempfile.mkstemp(dir=cache_dir)
+    with open(tmp_file, "w") as tmp:
+        pickle.dump(star_imports, tmp)
+    try:
+        os.unlink(cache_file)
+    except OSError:
+        pass   # cache_file does not exist, fine
+    try:
+        os.rename(tmp_file, cache_file)
+    except OSError:
+        remove tmp_file if rename failed.

Replying to @vbraun:

sage -b deletes the lazy import cache already. Are you saying that it doesn't work on your system?

Ah, ok, it does. That alleviates the problem to some degree. I haven't been able to locate the code that does it, but it will only delete the cache in the .sage/cache of the UID running sage -b. So other UIDs are still out of luck. It seems to me that if sage -b knows when the cache needs to be refreshed then it should do so and the cache should probably be somewhere in sage_root rather than in a user directory.

Perhaps rationalizing where the cache is held is something for a follow-up ticket.

@vbraun
Copy link
Member Author

vbraun commented Dec 14, 2012

comment:16

The cache is deleted in setup.py, so yes this will cause problem if you run a Sage version that somebody else built.

I don't care about where to put the unlink, I put it right after the mkdir since it is part of preparing the directory. Of course other people like their bike shed pink. Or are you saying that you want to clean up after a potential race on Windows? Feel free to implement that if you want. Erasing can again fail on Windows for existing files. Doesn't strike me as the most urgent problem either.

@vbraun
Copy link
Member Author

vbraun commented Jan 24, 2013

comment:17

I think we should plug this obvious race with the patch I posted. So please review if you care about patchbot results :-P

@jdemeyer
Copy link

jdemeyer commented Apr 4, 2013

comment:18

See #14292 for a general approach to solving the problem of race conditions when writing to a file, which should probably be used here.

@jdemeyer
Copy link

jdemeyer commented Apr 4, 2013

Changed author from Volker Braun to Volker Braun, Jeroen Demeyer

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link

jdemeyer commented Apr 4, 2013

Dependencies: #14292

@vbraun
Copy link
Member Author

vbraun commented Apr 13, 2013

Reviewer: Volker Braun

@jdemeyer
Copy link

comment:22

Added

diff --git a/sage/misc/lazy_import.pyx b/sage/misc/lazy_import.pyx
--- a/sage/misc/lazy_import.pyx
+++ b/sage/misc/lazy_import.pyx
@@ -909,6 +909,7 @@
     """
     global star_imports
     if star_imports is None:
+        star_imports = {}
         try:
             with open(get_cache_file()) as cache_file:
                 star_imports = pickle.load(cache_file)
@@ -917,7 +918,6 @@
         except Exception:  # unpickling failed
             import warnings
             warnings.warn('star_imports cache is corrupted')
-        star_imports = {}
     try:
         return star_imports[module_name]
     except KeyError:

@jdemeyer
Copy link

comment:23

Attachment: 13826_star_imports_race_v2.patch.gz

@jdemeyer
Copy link

Merged: sage-5.10.beta0

@jdemeyer
Copy link

Changed merged from sage-5.10.beta0 to sage-5.9.rc0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants