Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file.managed and archive.extracted don't cache remote files #42681

Closed
ghost opened this issue Aug 2, 2017 · 17 comments
Closed

file.managed and archive.extracted don't cache remote files #42681

ghost opened this issue Aug 2, 2017 · 17 comments
Assignees
Labels
Duplicate Duplicate of another issue or PR - will be closed
Milestone

Comments

@ghost
Copy link

ghost commented Aug 2, 2017

hello,
in our saltstack configuration, we use file.managed or archive.extracted with https:// source, and noticed, that salt downloads source files regardless files changed on source/minion or not.
Quite big files (e.g. jdk binary) deployed as state on tens of machines, caused us unexpected network problems in our infrastructure. After investigation I found, that salt downloads those files each time.
I did workaround, adding:
- unless:
- curl -m 15 -Ls {{ file_source }}.sha1 > /tmp/test.sha1
- echo " {{ file_downloaded }}" >> /tmp/test.sha1
- sha1sum -c /tmp/test.sha1
But it shouldn't be like that - first step should be downloading only of sha1 file and if it is different from actual, then download from source.
salt-call 2017.7.0 (Nitrogen) on many systems (Centos7, Ubuntu 16.04, Debian 8, etc.)

@gtmanfred gtmanfred added Feature new functionality including changes to functionality and code refactors, etc. Core relates to code central or existential to Salt labels Aug 2, 2017
@gtmanfred gtmanfred added this to the Approved milestone Aug 2, 2017
@gtmanfred
Copy link
Contributor

I do not believe that this is possible right now, but i can see the purpose behind it.

@terminalmage is there a reason that we do not check the hash sum from the remote against the already cached file.

Thanks,
Daniel

@terminalmage
Copy link
Contributor

Well, the main reason is because we dispose of the archive after downloading it, by default. This is because if someone downloads a large archive to unpack it, they don't necessarily want that archive to be taking up space in the minion file cache.

@ghost
Copy link
Author

ghost commented Aug 8, 2017

Well, you can keep file's checksum in cache - it isn't large, but still enough not to download large files to tens of minions.

@morganwillcock
Copy link
Contributor

I'm also struggling with this (as described in #38971). I've got quite a few large archives to extract and the default behavior means the master and minion get put under quite a lot of load, whereas previously it was a one time process and subsequent extractions were just skipped.

@terminalmage
Copy link
Contributor

@morganwillcock I don't understand, extraction should still be skipped when there is no change internally with regard to the paths in the archive.

The best way to skip downloading when the hash hasn't changed is probably to add an additional argument to the file.managed and archive.extracted states, which we can pass through to the fileclient so that it knows not to download if the hash matches a given value.

@morganwillcock
Copy link
Contributor

@terminalmage the load is purely coming from the transfer (salt file server and file client) and the hashing rather than the extraction, and in some cases the transfers may be over wifi or VPN. Previously I had no issues with if_missing but now it seems this is checked too late to skip all the other actions. There is also the complication that the default minion cache on Windows is deleted when the minion is upgraded, so if_missing was also stopping the archives being re-downloaded after each upgrade.

@terminalmage
Copy link
Contributor

@morganwillcock OK thanks for the followup. I'll get something worked out.

@anitakrueger
Copy link
Contributor

I happen to have a state file that uses both, file.managed and archive.extracted on files remotely located on a web server. I have control over all of them, but was very surprised to see that on subsequent state runs, all of them got downloaded again and again and again.
I was expecting the default behavior to be that once the file is managed and hasn't changed, it wouldn't be downloaded. Adding keep=True to the archive.extracted state did not make a difference. It feels wrong to have to use a cmd.run state to achieve this.

Is there a path to resolution for this issue?

Thanks!

@terminalmage
Copy link
Contributor

@anitakrueger I believe the reason for the current behavior had something to do with large archives filling up space in /var, but I'm not sure as I wasn't involved in writing that code.

The way I see it, what needs to be done is something like this:

  1. Change the caching behavior for archive.extracted, adding a clean argument and set it to False. Continued use of keep will produce a warning in the state return, and keep=False will be equivalent to clean=True.
  2. Add an option to file.managed and archive.extracted (no idea what to call it yet) which tells the state not to download the file again if one is in the cache which matches the source_hash. The current behavior is to always download, so that we can tell if the remote file has changed. Note that this means that use of skip_verify=True will result in the archive being downloaded each time the state is run.

@thatch45 thoughts? Also, do you think it's reasonable to change the default behavior to skip downloading the remote file if one exists in the minion cache and it matches the source_hash? I suggested in item 2 above that we add an argument to skip the download when we have a cached copy matching the hash, but I can also see an argument for changing the default behavior and requiring an argument to force downloading when the source_hash matches. My reasoning for this is because the source_hash can then be used to control whether or not we attempt to download the remote file. If the file has changed on the remote end, the user would just update the source_hash. We would then alter the file module to check for a cached copy before the point where we cache the file, and if the hash differs, we would re-download.

@terminalmage
Copy link
Contributor

The only potential problem I can see with this is that a minion that has run the state in the past and already has a cached copy, but the remote file's hash has changed since it downloaded the file, will now behave differently from one which doesn't have the file cached. Assuming identical SLS for both minions, the minion without the cached file would download it, try to verify the hash, and fail because the remote hash has changed, while the minion with the older cached copy would proceed and the state would succeed.

@morganwillcock
Copy link
Contributor

@terminalmage I think that if there was to be some sort intelligent cache control, that should probably be implemented in the file client / transfer method rather than in individual modules (otherwise you are reliant on reading the docs to clarify how hashes are handled in each case). Having to update source_hash on the remote end seems inline with how everything else is operating, so I would vote for that.

@terminalmage
Copy link
Contributor

What do you mean by "updating source_hash on the remote end"? I just want to make sure we are speaking about the same thing. When I say "remote" here I am speaking of the http(s) server from which we download the file. Essentially, what I am proposing is that we let the source_hash configured in the SLS drive whether or not we download. What this means however is that if the remote file changes, you wouldn't download the file unless the source_hash in the SLS is updated. But, since we would ultimately fail the state if the remote file's hash doesn't match the SLS, I don't see a problem with this.

Ultimately however, you are correct that the logic on whether or not to download would be done in the fileclient. The call to cp.cache_file would initiate a fileclient request that would decide whether or not to download the file. If it is determined that the file shouldn't be downloaded because we have a cached copy with the specified source_hash, then the fileclient would skip the download and simply return the locally-cached path. We would, however, need to pass the existing hash to cp.cache_file so that the fileclient has the hash to compare against the cached location.

@morganwillcock
Copy link
Contributor

By remote I was thinking of the SLS file on the master, so that all sounds fine to me.

jdelic added a commit to jdelic/saltshaker that referenced this issue Sep 6, 2017
@jdelic
Copy link
Contributor

jdelic commented Sep 6, 2017

This ticket should probably be merged with #38971. In my opinion this is not a feature request, I explain why here: #38971 (comment).

@anitakrueger
Copy link
Contributor

I have to say I agree with @jdelic. I was using Salt 1.5 years ago for all my configurations and never ran into this problem. Now, being back on the Salt train, this is the first thing I noticed. That remote files were being downloaded again despite me just running the same state 5 mins ago. And there was no property to "fix" this issue.

@terminalmage terminalmage added Duplicate Duplicate of another issue or PR - will be closed and removed Core relates to code central or existential to Salt Feature new functionality including changes to functionality and code refactors, etc. labels Sep 7, 2017
@terminalmage terminalmage self-assigned this Sep 7, 2017
@terminalmage
Copy link
Contributor

I agree, this should not be seen as a feature request. And yes, it appears to be a duplicate of #38971. I'm working on this now.

@terminalmage
Copy link
Contributor

#43518 has been opened to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Duplicate of another issue or PR - will be closed
Projects
None yet
Development

No branches or pull requests

5 participants