-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce unnecessary file downloading in archive/file states #43518
Conversation
26b7f6b
to
34fd89c
Compare
34fd89c
to
32e117b
Compare
I have had to use cp to cache a dir then had to use cp.is_cached to get a path to a file. Is there a need for cp.cache_local_path (_extrn_path) so can get file name translation from source to local cache? With regards to caching http and http, can we use If-Modified-Since or keep a copy of the header and compare key values in the header to determine of the file needs to be downloaded again. Maybe cache/.../extern_meta which contains the http/https headers for extrn_files Other thing which might help is keep the time stamp on the file the same as the time stamp on the source if possible. I can raise these idea's as an issue if you like. |
I honestly have no idea what you're talking about here, sorry. The source_hash will be used to drive whether or not the file is downloaded. If the locally-cached file matches the configured source_hash, we don't download. Before, we'd download every time, and then compare the hash to the source_hash. If it didn't match, we'd fail the state anyway, so there would be a lot of unnecessary downloading for what result in a failed state. This would allow the state to continue to work if the remote file changes, and if you want the new copy of the file, you just update the source_hash. |
The follow are two example of http response headers
From inkscape.
For example if https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since Or using the etag Or send both. Or use HEAD first and compare and then GET (fetch) if needed, which might be easier to code.
By keeping the header response from GET, it allows minion to determine of the file is valid in the cache or not. cache/.../extern_meta/...../putty-64bit-0.70-installer.msi would contain the Header from the original GET request.
If the HEAD request does not match the GET Header response in cache/.../extern_meta/...../putty-64bit-0.70-installer.msi when comparing Last-Modified, ETag, Content-Length, and Content-Type then fetch the file again, instead of using the cache version. But it must contain Last-Modified or ETag or both to be able to do a valid comparison. If the header missing Last-Modified and ETag and or Content-Length general means the contents is dynamic. |
My feedback is that this change is unnecessarily complex and the documentation very lacking.
|
@terminalmage I'm wondering here about 2017.7 versus develop. Seems like the latter might be a better target but let me know your thoughts. |
It's not an arbitrary change. The new caching behavior doesn't apply just to the I'm more than willing to have my fellow long-time contributors weigh in on this though, and use
Using a The new With regard to the
Since it's designed to be used internally within state functions, and not "in conjunction" with them, I don't know how critical adding a bunch of documentation is. I can by all means add a couple code examples, but this state will rarely be useful in an SLS file, and I have already mentioned in the docstring for |
I've added some code examples to the docstring for the @cachedout The unnecessary re-downloading started happening because of changes to |
2d75df1
to
c745ea9
Compare
@terminalmage I suppose it depends on our view of performance regressions. My view here is that changes to the file client are so fundamental to the behavior and stability of Salt that I like to take the more conservative route and defer them to major releases. I'm not necessarily opposed to merging it here but I'd feel more comfortable in the develop branch. |
@cachedout I don't have an objection to rebasing against develop. |
Thank you for all your work on this project. It has been something very important to me for the last two years. I come from the perspective of a user of salt, who struggled to first learn and understand its concepts and live in fear of every upgrade and what part of my configuration will be broken. I love the concepts in salt, but hate its fluid nature and lack of documentation, particularly examples of best practice. With that out of the way, I get that you feel 'cache_source' is clearer, but to me it isn't. That's just one person so make of it what you will. When I read that I see "cache" as a noun, not a verb, so I expect to put a path as the value, not a boolean. "keep" is always a verb. But more importantly, breaking my working states with each upgrade because of a better name isn't very helpful to your users. It might be that "ls" isn't the most intuitive way to get a directory listing in Unix, there is no discussion about changing it now. Remember that every time you change some naming you break not only everyone's working system (yes, everyone should read the release notes in detail and understand every implication, but they don't), but also you invalidate all documentation, forum posts and other information already out there. Perhaps this is a little painting the bike shed, but for something as business critical as a configuration management system, clarity and consistency are so so important. Thanks for listening, and thanks for your detailed explanation. I hope that you could put even a fraction of what you explained to me in the official documentation. Your description of archive_hash could go into the docs almost as is. Everyone would be much the wiser. |
The
Buy why? The choice of where we store the hash files tracked by |
c745ea9
to
50c1b8e
Compare
I've modified the PR so that @cachedout I've also rebased onto develop. I think I have a way of incorporating the non-fileclient fixes from this PR against 2017.7, which I can do in a separate PR. |
be9bdb6
to
9d749af
Compare
9d749af
to
c8ba934
Compare
The file.managed state, which is used by the archive.extracted state to download the source archive, at some point recently was modified to clear the file from the minion cache. This caused unnecessary re-downloading on subsequent runs, which slows down states considerably when dealing with larger archives. This commit makes the following changes to improve this: 1. The fileclient now accepts a `source_hash` argument, which will cause the client's get_url function to skip downloading http(s) and ftp files if the file is already cached, and its hash matches the passed hash. This argument has also been added to the `cp.get_url` and `cp.cache_file` function. 2. We no longer try to download the file when it's an http(s) or ftp URL when running `file.source_list`. 3. Where `cp.cache_file` is used, we pass the `source_hash` if it is available. 4. A `cache_source` argument has been added to the `file.managed` state, defaulting to `True`. This is now used to control whether or not the source file is cleared from the minion cache when the state completes. 5. Two new states (`file.cached` and `file.not_cached`) have been added to managed files in the minion cache. In addition, the `archive.extracted` state has been modified in the following ways: 1. For consistency with `file.managed`, a `cache_source` argument has been added. This also deprecates `keep`. If `keep` is used, `cache_source` assumes its value, and a warning is added to the state return to let the user know to update their SLS. 2. The variable name `cached_source` (used internally in the `archive.extracted` state) has been renamed to `cached` to reduce confusion with the new `cache_source` argument. 3. The new `file.cached` and `file.not_cached` states are now used to manage the source tarball instead of `file.managed`. This improves disk usage and reduces unnecessary complexity in the state as we no longer keep a copy of the archive in a separate location within the cachedir. We now only use the copy downloaded using `cp.cache_file` within the `file.cached` state. This change has also necessitated a new home for hash files tracked by the `source_hash_update` argument, in a subdirectory of the minion cachedir called `archive_hash`.
…e.extracted The code used to have a salt.state.HighState instance call the state. That method pre-dated availability of __states__ to use for executing the state function. The HighState instance handles exception catching and produces a list as output if there were errors which arose before the state was executed. Running the state function using __states__ does not give you any such protection. This commit removes the type check on the return data, as it will never be a list when run via __states__, and wraps the state function in a try/except to catch any exceptions that may be raised by invoking the file.cached state.
This new argument name is ambiguous as "cache" can either be interpreted as a noun or a verb.
c8ba934
to
8b237c1
Compare
The
file.managed
state, which is used by thearchive.extracted
state to download the source archive, at some point recently was modified to clear the file from the minion cache. This caused unnecessary re-downloading on subsequent runs, which slows down states considerably when dealing with larger archives.This PR makes the following changes to improve this:
The fileclient now accepts a
source_hash
argument, which will cause the client'sget_url
function to skip downloading http(s) and ftp files if the file is already cached, and its hash matches the passed hash. This argument has also been added to thecp.get_url
andcp.cache_file
functions.We no longer try to download the file when it's an http(s) or ftp URL when running
file.source_list
.Where
cp.cache_file
is used, we pass thesource_hash
if it is available.A
cache_source
keep_source
argument has been added to thefile.managed
state, defaulting toTrue
. This is now used to control whether or not the source file is cleared from the minion cache when the state completes.Two new states (
file.cached
andfile.not_cached
) have been added to manage files in the minion cache.In addition, the
archive.extracted
state has been modified in the following ways:For consistency with
file.managed
, acache_source
keep_source
argument has been added. This also deprecateskeep
. Ifkeep
is used,keep_source
assumes its value, and a warning is added to the state return to let the user know to update their SLS.The variable name
cached_source
(used internally in thearchive.extracted
state) has been renamed tocached
to reduce confusion with the newcache_source
keep_source
argument. (Note: whilecache_source
is nowkeep_source
, I'm not going to go back and rename these variables back tocached_source
. They will stay as-is.)The new
file.cached
andfile.not_cached
states are now used to manage the source tarball instead offile.managed
. This improves disk usage and reduces unnecessary complexity in the state as we no longer keep a copy of the archive in a separate location within the cachedir. We now only use the copy downloaded usingcp.cache_file
within thefile.cached
state. This change has also necessitated a new home for hash files tracked by thesource_hash_update
argument, in a subdirectory of the minion cachedir calledarchive_hash
.Resolves #38971.