-
-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: upload DB dump to AWS S3 #10863
Conversation
Quality Gate passedIssues Measures |
@@ -21,6 +21,16 @@ for export in en.openfoodfacts.org.products.csv fr.openfoodfacts.org.products.cs | |||
mv -f new.$export.gz $export.gz | |||
done | |||
|
|||
# Copy CSV and RDF files to AWS S3 using MinIO client | |||
mc cp \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of those files are quite big and could take a long time to upload. What happens in the mean time? Is there a temporary file on s3, and the existing file is replaced once the full file has been received, or someone could download a file that is only partially uploaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why not doing it in the background to let the script continue, adding a &
at the end of the command?
If a command is terminated by the control operator &, the shell executes the command in the background in a subshell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CharlesNepote the problem is that if you use an '&' it's harder to know there has been an error. If we do this we have to join all children at the end and ensure they ended up correctly (or fail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV+RDF files took 5 minutes to upload (Total: 30.18 GiB, Transferred: 30.18 GiB, Speed: 107.80 MiB/s)
Other dumps took 3 minutes to upload (Total: 16.99 GiB, Transferred: 16.99 GiB, Speed: 106.60 MiB/s)
From what I can see in the doc (and what I've observed during upload), PUT operations on an existing key are atomic: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel
So either the file is fully uploaded and it's available, or the old version is served.
@@ -21,6 +21,16 @@ for export in en.openfoodfacts.org.products.csv fr.openfoodfacts.org.products.cs | |||
mv -f new.$export.gz $export.gz | |||
done | |||
|
|||
# Copy CSV and RDF files to AWS S3 using MinIO client | |||
mc cp \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why not doing it in the background to let the script continue, adding a &
at the end of the command?
If a command is terminated by the control operator &, the shell executes the command in the background in a subshell.
${PREFIX}-products.jsonl.gz \ | ||
${PREFIX}_recent_changes.jsonl.gz \ | ||
${PREFIX}-mongodbdump.gz \ | ||
s3/openfoodfacts-ds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment: you can add &
if you want to do it in the background.
As the subject of the time taken by each command during the daily dump comes up, here is a quick analysis I did using the latest logs. It's not 100% accurate, as some tasks such as 'product_countries.js' don't log anything, but it's useful to have to big numbers:
Almost 6 hours for CSV and RDF seems like a lot to me. Ideally, we would have a single source of truth (JSONL or MongoDB), and all the other dumps would be generated on another server, where we have plenty of CPUs/IO to do this. Another advantage of this approach is that generation could be parallelized. |
Some processes seem to crash in the script by the way, probably due to python command missing (replaced by python3):
|
🤖 I have created a release *beep* *boop* --- ## [2.46.0](v2.45.0...v2.46.0) (2024-10-18) ### Features * upload DB dump to AWS S3 ([#10863](#10863)) ([34ae5e4](34ae5e4)) ### Bug Fixes * docs (paragraph 24) ([#10849](#10849)) ([354c22c](354c22c)) * docs fix a broken Internal URL to Open Prices ([#10852](#10852)) ([d318472](d318472)) * docs Syntax issues ([#10851](#10851)) ([56275c4](56275c4)) * downgrade jquery-ui ([#10877](#10877)) ([2cd6fd5](2cd6fd5)) * In the Folksonomy Engine table, property and value headers were not at the right place ([#10857](#10857)) ([7547657](7547657)) * remove off days banner ([#10908](#10908)) ([855ae0c](855ae0c)) * update paths for EAN8 and short barcodes (padding with zeroes) - DO NOT MERGE ([#10472](#10472)) ([3c18781](3c18781)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Upload the following dump files to AWS S3, just after being created in gen_feed_daily script:
Also add redirects (HTTP 302) to AWS S3 in the off nginx configuration so that we save I/O.
We use minio client (mc) for synchronization. We expect
/home/off/.mc/config.json
to contain AWS credentials.