Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: upload DB dump to AWS S3 #10863

Merged
merged 3 commits into from
Oct 7, 2024
Merged

feat: upload DB dump to AWS S3 #10863

merged 3 commits into from
Oct 7, 2024

Conversation

raphael0202
Copy link
Contributor

Upload the following dump files to AWS S3, just after being created in gen_feed_daily script:

  • en.openfoodfacts.org.products.csv
  • en.openfoodfacts.org.products.csv.gz
  • fr.openfoodfacts.org.products.csv
  • fr.openfoodfacts.org.products.csv.gz
  • en.openfoodfacts.org.products.rdf
  • fr.openfoodfacts.org.products.rdf
  • openfoodfacts-products.jsonl.gz
  • openfoodfacts-mongodbdump.gz
  • openfoodfacts_recent_changes.jsonl.gz

Also add redirects (HTTP 302) to AWS S3 in the off nginx configuration so that we save I/O.

We use minio client (mc) for synchronization. We expect /home/off/.mc/config.json to contain AWS credentials.

@raphael0202 raphael0202 requested a review from a team as a code owner October 3, 2024 12:15
@github-actions github-actions bot added export MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products NGINX Data exports labels Oct 3, 2024
Copy link

sonarcloud bot commented Oct 3, 2024

@@ -21,6 +21,16 @@ for export in en.openfoodfacts.org.products.csv fr.openfoodfacts.org.products.cs
mv -f new.$export.gz $export.gz
done

# Copy CSV and RDF files to AWS S3 using MinIO client
mc cp \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of those files are quite big and could take a long time to upload. What happens in the mean time? Is there a temporary file on s3, and the existing file is replaced once the full file has been received, or someone could download a file that is only partially uploaded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why not doing it in the background to let the script continue, adding a & at the end of the command?

If a command is terminated by the control operator &, the shell executes the command in the background in a subshell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CharlesNepote the problem is that if you use an '&' it's harder to know there has been an error. If we do this we have to join all children at the end and ensure they ended up correctly (or fail).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV+RDF files took 5 minutes to upload (Total: 30.18 GiB, Transferred: 30.18 GiB, Speed: 107.80 MiB/s)
Other dumps took 3 minutes to upload (Total: 16.99 GiB, Transferred: 16.99 GiB, Speed: 106.60 MiB/s)

From what I can see in the doc (and what I've observed during upload), PUT operations on an existing key are atomic: https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel

So either the file is fully uploaded and it's available, or the old version is served.

@@ -21,6 +21,16 @@ for export in en.openfoodfacts.org.products.csv fr.openfoodfacts.org.products.cs
mv -f new.$export.gz $export.gz
done

# Copy CSV and RDF files to AWS S3 using MinIO client
mc cp \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why not doing it in the background to let the script continue, adding a & at the end of the command?

If a command is terminated by the control operator &, the shell executes the command in the background in a subshell.

${PREFIX}-products.jsonl.gz \
${PREFIX}_recent_changes.jsonl.gz \
${PREFIX}-mongodbdump.gz \
s3/openfoodfacts-ds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment: you can add & if you want to do it in the background.

@raphael0202
Copy link
Contributor Author

raphael0202 commented Oct 4, 2024

As the subject of the time taken by each command during the daily dump comes up, here is a quick analysis I did using the latest logs. It's not 100% accurate, as some tasks such as 'product_countries.js' don't log anything, but it's useful to have to big numbers:

2.15AM: start of the job
1. removing empty products (<1min)
2.17AM->3.29AM completed products
product stats per country (<1min)
3.30AM->6.01AM: CSV and RDF files for english (2h40)
6.01AM->9.11AM: CSV and RDF files for french (3h10)
9.11AM->9.16AM: transfer to AWS S3 of CSV+RDF
9.16AM->12.11AM: MongoDB, JSONL export (~3h)
12.11AM->12.14AM: transfer to AWS S3 of MongoDB+JSONL dumps
12.16AM->12.19AM export_products_data_and_images.pl (3 min)
12.19AM->12.21AM: diff files (actually the command crashed: "Oct 04 12:21:21 off gen_feeds_daily_off.sh[1013528]: [Fri Oct  4 12:21:21 2024] export_csv_file.pl: MongoDB::DatabaseError: Error in $cursor stage :: caused by :: operation exceeded time limit"

Almost 6 hours for CSV and RDF seems like a lot to me.

Ideally, we would have a single source of truth (JSONL or MongoDB), and all the other dumps would be generated on another server, where we have plenty of CPUs/IO to do this. Another advantage of this approach is that generation could be parallelized.

@raphael0202
Copy link
Contributor Author

Some processes seem to crash in the script by the way, probably due to python command missing (replaced by python3):

Oct 04 12:19:41 off gen_feeds_daily_off.sh[1013487]: /srv/off/scripts/gen_feeds_daily_off.sh: ./generate_dump_for_offline_apps_off.py: /usr/bin/python: bad interpreter: No such file or directory

@raphael0202 raphael0202 merged commit 34ae5e4 into main Oct 7, 2024
12 checks passed
@raphael0202 raphael0202 deleted the aws-s3-upload branch October 7, 2024 13:07
stephanegigandet pushed a commit that referenced this pull request Oct 18, 2024
🤖 I have created a release *beep* *boop*
---


##
[2.46.0](v2.45.0...v2.46.0)
(2024-10-18)


### Features

* upload DB dump to AWS S3
([#10863](#10863))
([34ae5e4](34ae5e4))


### Bug Fixes

* docs (paragraph 24)
([#10849](#10849))
([354c22c](354c22c))
* docs fix a broken Internal URL to Open Prices
([#10852](#10852))
([d318472](d318472))
* docs Syntax issues
([#10851](#10851))
([56275c4](56275c4))
* downgrade jquery-ui
([#10877](#10877))
([2cd6fd5](2cd6fd5))
* In the Folksonomy Engine table, property and value headers were not at
the right place
([#10857](#10857))
([7547657](7547657))
* remove off days banner
([#10908](#10908))
([855ae0c](855ae0c))
* update paths for EAN8 and short barcodes (padding with zeroes) - DO
NOT MERGE
([#10472](#10472))
([3c18781](3c18781))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data exports export MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products NGINX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants