file.managed and archive.extracted don't cache remote files #42681

ghost · 2017-08-02T10:52:50Z

hello,
in our saltstack configuration, we use file.managed or archive.extracted with https:// source, and noticed, that salt downloads source files regardless files changed on source/minion or not.
Quite big files (e.g. jdk binary) deployed as state on tens of machines, caused us unexpected network problems in our infrastructure. After investigation I found, that salt downloads those files each time.
I did workaround, adding:
- unless:
- curl -m 15 -Ls {{ file_source }}.sha1 > /tmp/test.sha1
- echo " {{ file_downloaded }}" >> /tmp/test.sha1
- sha1sum -c /tmp/test.sha1
But it shouldn't be like that - first step should be downloading only of sha1 file and if it is different from actual, then download from source.
salt-call 2017.7.0 (Nitrogen) on many systems (Centos7, Ubuntu 16.04, Debian 8, etc.)

gtmanfred · 2017-08-02T15:32:38Z

I do not believe that this is possible right now, but i can see the purpose behind it.

@terminalmage is there a reason that we do not check the hash sum from the remote against the already cached file.

Thanks,
Daniel

terminalmage · 2017-08-07T19:11:36Z

Well, the main reason is because we dispose of the archive after downloading it, by default. This is because if someone downloads a large archive to unpack it, they don't necessarily want that archive to be taking up space in the minion file cache.

ghost · 2017-08-08T05:32:06Z

Well, you can keep file's checksum in cache - it isn't large, but still enough not to download large files to tens of minions.

morganwillcock · 2017-08-09T20:05:26Z

I'm also struggling with this (as described in #38971). I've got quite a few large archives to extract and the default behavior means the master and minion get put under quite a lot of load, whereas previously it was a one time process and subsequent extractions were just skipped.

terminalmage · 2017-08-10T14:52:02Z

@morganwillcock I don't understand, extraction should still be skipped when there is no change internally with regard to the paths in the archive.

The best way to skip downloading when the hash hasn't changed is probably to add an additional argument to the file.managed and archive.extracted states, which we can pass through to the fileclient so that it knows not to download if the hash matches a given value.

morganwillcock · 2017-08-10T16:32:39Z

@terminalmage the load is purely coming from the transfer (salt file server and file client) and the hashing rather than the extraction, and in some cases the transfers may be over wifi or VPN. Previously I had no issues with if_missing but now it seems this is checked too late to skip all the other actions. There is also the complication that the default minion cache on Windows is deleted when the minion is upgraded, so if_missing was also stopping the archives being re-downloaded after each upgrade.

terminalmage · 2017-08-10T20:27:35Z

@morganwillcock OK thanks for the followup. I'll get something worked out.

anitakrueger · 2017-09-04T23:08:14Z

I happen to have a state file that uses both, file.managed and archive.extracted on files remotely located on a web server. I have control over all of them, but was very surprised to see that on subsequent state runs, all of them got downloaded again and again and again.
I was expecting the default behavior to be that once the file is managed and hasn't changed, it wouldn't be downloaded. Adding keep=True to the archive.extracted state did not make a difference. It feels wrong to have to use a cmd.run state to achieve this.

Is there a path to resolution for this issue?

Thanks!

terminalmage · 2017-09-05T14:49:21Z

@anitakrueger I believe the reason for the current behavior had something to do with large archives filling up space in /var, but I'm not sure as I wasn't involved in writing that code.

The way I see it, what needs to be done is something like this:

Change the caching behavior for archive.extracted, adding a clean argument and set it to False. Continued use of keep will produce a warning in the state return, and keep=False will be equivalent to clean=True.
Add an option to file.managed and archive.extracted (no idea what to call it yet) which tells the state not to download the file again if one is in the cache which matches the source_hash. The current behavior is to always download, so that we can tell if the remote file has changed. Note that this means that use of skip_verify=True will result in the archive being downloaded each time the state is run.

@thatch45 thoughts? Also, do you think it's reasonable to change the default behavior to skip downloading the remote file if one exists in the minion cache and it matches the source_hash? I suggested in item 2 above that we add an argument to skip the download when we have a cached copy matching the hash, but I can also see an argument for changing the default behavior and requiring an argument to force downloading when the source_hash matches. My reasoning for this is because the source_hash can then be used to control whether or not we attempt to download the remote file. If the file has changed on the remote end, the user would just update the source_hash. We would then alter the file module to check for a cached copy before the point where we cache the file, and if the hash differs, we would re-download.

terminalmage · 2017-09-05T14:56:29Z

The only potential problem I can see with this is that a minion that has run the state in the past and already has a cached copy, but the remote file's hash has changed since it downloaded the file, will now behave differently from one which doesn't have the file cached. Assuming identical SLS for both minions, the minion without the cached file would download it, try to verify the hash, and fail because the remote hash has changed, while the minion with the older cached copy would proceed and the state would succeed.

morganwillcock · 2017-09-05T17:02:50Z

@terminalmage I think that if there was to be some sort intelligent cache control, that should probably be implemented in the file client / transfer method rather than in individual modules (otherwise you are reliant on reading the docs to clarify how hashes are handled in each case). Having to update source_hash on the remote end seems inline with how everything else is operating, so I would vote for that.

terminalmage · 2017-09-05T17:19:55Z

What do you mean by "updating source_hash on the remote end"? I just want to make sure we are speaking about the same thing. When I say "remote" here I am speaking of the http(s) server from which we download the file. Essentially, what I am proposing is that we let the source_hash configured in the SLS drive whether or not we download. What this means however is that if the remote file changes, you wouldn't download the file unless the source_hash in the SLS is updated. But, since we would ultimately fail the state if the remote file's hash doesn't match the SLS, I don't see a problem with this.

Ultimately however, you are correct that the logic on whether or not to download would be done in the fileclient. The call to cp.cache_file would initiate a fileclient request that would decide whether or not to download the file. If it is determined that the file shouldn't be downloaded because we have a cached copy with the specified source_hash, then the fileclient would skip the download and simply return the locally-cached path. We would, however, need to pass the existing hash to cp.cache_file so that the fileclient has the hash to compare against the cached location.

morganwillcock · 2017-09-05T17:32:42Z

By remote I was thinking of the SLS file on the master, so that all sounds fine to me.

jdelic · 2017-09-06T14:33:41Z

This ticket should probably be merged with #38971. In my opinion this is not a feature request, I explain why here: #38971 (comment).

anitakrueger · 2017-09-06T15:31:44Z

I have to say I agree with @jdelic. I was using Salt 1.5 years ago for all my configurations and never ran into this problem. Now, being back on the Salt train, this is the first thing I noticed. That remote files were being downloaded again despite me just running the same state 5 mins ago. And there was no property to "fix" this issue.

terminalmage · 2017-09-07T15:00:03Z

I agree, this should not be seen as a feature request. And yes, it appears to be a duplicate of #38971. I'm working on this now.

terminalmage · 2017-09-15T00:47:21Z

#43518 has been opened to fix this.

gtmanfred added Feature new functionality including changes to functionality and code refactors, etc. Core relates to code central or existential to Salt labels Aug 2, 2017

gtmanfred added this to the Approved milestone Aug 2, 2017

jdelic added a commit to jdelic/saltshaker that referenced this issue Sep 6, 2017

add work-around for saltstack/salt#42681

f1645d7

terminalmage added Duplicate Duplicate of another issue or PR - will be closed and removed Core relates to code central or existential to Salt Feature new functionality including changes to functionality and code refactors, etc. labels Sep 7, 2017

terminalmage self-assigned this Sep 7, 2017

gtmanfred mentioned this issue Sep 8, 2017

file.managed always redownloads the file #43408

Closed

terminalmage closed this as completed Sep 15, 2017

JaseFace mentioned this issue Sep 25, 2017

file.managed doesn't hash currently existing file prior to download #40479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file.managed and archive.extracted don't cache remote files #42681

file.managed and archive.extracted don't cache remote files #42681

ghost commented Aug 2, 2017

gtmanfred commented Aug 2, 2017

terminalmage commented Aug 7, 2017

ghost commented Aug 8, 2017

morganwillcock commented Aug 9, 2017

terminalmage commented Aug 10, 2017

morganwillcock commented Aug 10, 2017

terminalmage commented Aug 10, 2017

anitakrueger commented Sep 4, 2017

terminalmage commented Sep 5, 2017

terminalmage commented Sep 5, 2017

morganwillcock commented Sep 5, 2017

terminalmage commented Sep 5, 2017

morganwillcock commented Sep 5, 2017

jdelic commented Sep 6, 2017

anitakrueger commented Sep 6, 2017

terminalmage commented Sep 7, 2017

terminalmage commented Sep 15, 2017

file.managed and archive.extracted don't cache remote files #42681

file.managed and archive.extracted don't cache remote files #42681

Comments

ghost commented Aug 2, 2017

gtmanfred commented Aug 2, 2017

terminalmage commented Aug 7, 2017

ghost commented Aug 8, 2017

morganwillcock commented Aug 9, 2017

terminalmage commented Aug 10, 2017

morganwillcock commented Aug 10, 2017

terminalmage commented Aug 10, 2017

anitakrueger commented Sep 4, 2017

terminalmage commented Sep 5, 2017

terminalmage commented Sep 5, 2017

morganwillcock commented Sep 5, 2017

terminalmage commented Sep 5, 2017

morganwillcock commented Sep 5, 2017

jdelic commented Sep 6, 2017

anitakrueger commented Sep 6, 2017

terminalmage commented Sep 7, 2017

terminalmage commented Sep 15, 2017