feat: upload DB dump to AWS S3 #10863

raphael0202 · 2024-10-03T12:15:14Z

Upload the following dump files to AWS S3, just after being created in gen_feed_daily script:

en.openfoodfacts.org.products.csv
en.openfoodfacts.org.products.csv.gz
fr.openfoodfacts.org.products.csv
fr.openfoodfacts.org.products.csv.gz
en.openfoodfacts.org.products.rdf
fr.openfoodfacts.org.products.rdf
openfoodfacts-products.jsonl.gz
openfoodfacts-mongodbdump.gz
openfoodfacts_recent_changes.jsonl.gz

Also add redirects (HTTP 302) to AWS S3 in the off nginx configuration so that we save I/O.

We use minio client (mc) for synchronization. We expect /home/off/.mc/config.json to contain AWS credentials.

sonarcloud · 2024-10-03T12:20:24Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

stephanegigandet · 2024-10-03T13:02:07Z

scripts/gen_feeds_daily_off.sh

@@ -21,6 +21,16 @@ for export in en.openfoodfacts.org.products.csv fr.openfoodfacts.org.products.cs
   mv -f new.$export.gz $export.gz
 done

+# Copy CSV and RDF files to AWS S3 using MinIO client
+mc cp \


Some of those files are quite big and could take a long time to upload. What happens in the mean time? Is there a temporary file on s3, and the existing file is replaced once the full file has been received, or someone could download a file that is only partially uploaded?

Also, why not doing it in the background to let the script continue, adding a & at the end of the command?

If a command is terminated by the control operator &, the shell executes the command in the background in a subshell.

@CharlesNepote the problem is that if you use an '&' it's harder to know there has been an error. If we do this we have to join all children at the end and ensure they ended up correctly (or fail).

CSV+RDF files took 5 minutes to upload (Total: 30.18 GiB, Transferred: 30.18 GiB, Speed: 107.80 MiB/s)
Other dumps took 3 minutes to upload (Total: 16.99 GiB, Transferred: 16.99 GiB, Speed: 106.60 MiB/s)

From what I can see in the doc (and what I've observed during upload), PUT operations on an existing key are atomic: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel

So either the file is fully uploaded and it's available, or the old version is served.

CharlesNepote · 2024-10-03T17:28:25Z

scripts/gen_feeds_daily_off.sh

@@ -21,6 +21,16 @@ for export in en.openfoodfacts.org.products.csv fr.openfoodfacts.org.products.cs
   mv -f new.$export.gz $export.gz
 done

+# Copy CSV and RDF files to AWS S3 using MinIO client
+mc cp \


Also, why not doing it in the background to let the script continue, adding a & at the end of the command?

If a command is terminated by the control operator &, the shell executes the command in the background in a subshell.

CharlesNepote · 2024-10-03T17:30:23Z

scripts/mongodb_dump.sh

+    ${PREFIX}-products.jsonl.gz \
+    ${PREFIX}_recent_changes.jsonl.gz \
+    ${PREFIX}-mongodbdump.gz \
+    s3/openfoodfacts-ds


Same comment: you can add & if you want to do it in the background.

raphael0202 · 2024-10-04T14:47:43Z

As the subject of the time taken by each command during the daily dump comes up, here is a quick analysis I did using the latest logs. It's not 100% accurate, as some tasks such as 'product_countries.js' don't log anything, but it's useful to have to big numbers:

2.15AM: start of the job
1. removing empty products (<1min)
2.17AM->3.29AM completed products
product stats per country (<1min)
3.30AM->6.01AM: CSV and RDF files for english (2h40)
6.01AM->9.11AM: CSV and RDF files for french (3h10)
9.11AM->9.16AM: transfer to AWS S3 of CSV+RDF
9.16AM->12.11AM: MongoDB, JSONL export (~3h)
12.11AM->12.14AM: transfer to AWS S3 of MongoDB+JSONL dumps
12.16AM->12.19AM export_products_data_and_images.pl (3 min)
12.19AM->12.21AM: diff files (actually the command crashed: "Oct 04 12:21:21 off gen_feeds_daily_off.sh[1013528]: [Fri Oct  4 12:21:21 2024] export_csv_file.pl: MongoDB::DatabaseError: Error in $cursor stage :: caused by :: operation exceeded time limit"

Almost 6 hours for CSV and RDF seems like a lot to me.

Ideally, we would have a single source of truth (JSONL or MongoDB), and all the other dumps would be generated on another server, where we have plenty of CPUs/IO to do this. Another advantage of this approach is that generation could be parallelized.

raphael0202 · 2024-10-04T14:53:36Z

Some processes seem to crash in the script by the way, probably due to python command missing (replaced by python3):

Oct 04 12:19:41 off gen_feeds_daily_off.sh[1013487]: /srv/off/scripts/gen_feeds_daily_off.sh: ./generate_dump_for_offline_apps_off.py: /usr/bin/python: bad interpreter: No such file or directory

🤖 I have created a release *beep* *boop* --- ## [2.46.0](v2.45.0...v2.46.0) (2024-10-18) ### Features * upload DB dump to AWS S3 ([#10863](#10863)) ([34ae5e4](34ae5e4)) ### Bug Fixes * docs (paragraph 24) ([#10849](#10849)) ([354c22c](354c22c)) * docs fix a broken Internal URL to Open Prices ([#10852](#10852)) ([d318472](d318472)) * docs Syntax issues ([#10851](#10851)) ([56275c4](56275c4)) * downgrade jquery-ui ([#10877](#10877)) ([2cd6fd5](2cd6fd5)) * In the Folksonomy Engine table, property and value headers were not at the right place ([#10857](#10857)) ([7547657](7547657)) * remove off days banner ([#10908](#10908)) ([855ae0c](855ae0c)) * update paths for EAN8 and short barcodes (padding with zeroes) - DO NOT MERGE ([#10472](#10472)) ([3c18781](3c18781)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

raphael0202 added 2 commits October 3, 2024 13:31

chore: upload data dump to AWS S3

511e318

chore: redirect (HTTP 302) some dump files to AWS S3

4cbeda4

raphael0202 requested a review from a team as a code owner October 3, 2024 12:15

github-actions bot assigned raphael0202 Oct 3, 2024

github-actions bot added export MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products NGINX Data exports labels Oct 3, 2024

fix: upload CSV and RDF right after creation

4ff2a7e

stephanegigandet reviewed Oct 3, 2024

View reviewed changes

CharlesNepote approved these changes Oct 3, 2024

View reviewed changes

raphael0202 merged commit 34ae5e4 into main Oct 7, 2024
12 checks passed

raphael0202 deleted the aws-s3-upload branch October 7, 2024 13:07

openfoodfacts-bot mentioned this pull request Oct 4, 2024

chore(main): release 2.46.0 #10858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: upload DB dump to AWS S3 #10863

feat: upload DB dump to AWS S3 #10863

raphael0202 commented Oct 3, 2024

sonarcloud bot commented Oct 3, 2024

stephanegigandet Oct 3, 2024

CharlesNepote Oct 3, 2024

alexgarel Oct 4, 2024

raphael0202 Oct 4, 2024

CharlesNepote Oct 3, 2024

CharlesNepote Oct 3, 2024

raphael0202 commented Oct 4, 2024 •

edited

Loading

raphael0202 commented Oct 4, 2024

feat: upload DB dump to AWS S3 #10863

feat: upload DB dump to AWS S3 #10863

Conversation

raphael0202 commented Oct 3, 2024

sonarcloud bot commented Oct 3, 2024

Quality Gate passed

stephanegigandet Oct 3, 2024

Choose a reason for hiding this comment

CharlesNepote Oct 3, 2024

Choose a reason for hiding this comment

alexgarel Oct 4, 2024

Choose a reason for hiding this comment

raphael0202 Oct 4, 2024

Choose a reason for hiding this comment

CharlesNepote Oct 3, 2024

Choose a reason for hiding this comment

CharlesNepote Oct 3, 2024

Choose a reason for hiding this comment

raphael0202 commented Oct 4, 2024 • edited Loading

raphael0202 commented Oct 4, 2024

raphael0202 commented Oct 4, 2024 •

edited

Loading