Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move products_tags to Postgres to improve performance #8676

Closed
8 tasks done
Tracked by #8764
john-gom opened this issue Jul 10, 2023 · 6 comments · Fixed by #8947
Closed
8 tasks done
Tracked by #8764

Move products_tags to Postgres to improve performance #8676

john-gom opened this issue Jul 10, 2023 · 6 comments · Fixed by #8947

Comments

@john-gom
Copy link
Contributor

john-gom commented Jul 10, 2023

Description

Many aggregate queries time-out, even when using the products_tags collection in MongoDB. Initial tests indicate that Postgres would offer much better performance for this type of query.

Acceptance criteria

  • Aggregate queries run more quickly
  • Results are the same as when using MongoDB

What would a demo look like

Show improved query performance

Notes

Strategy is to create a new openfoodfacts-data repo to wrap the data storage functionality. This would periodically query MongoDB for changed products (needs to check products_obsolete too) and update a cache in Postgres.

Can hopefully use the minon database for this, but might want to consider something separate.

Future work could include joining to taxonomies in the query to avoid the need for tag matching in Perl.

Tasks

  • Create off-data repo
  • Create dockerised version of off-data with a dev dependency on MongoDB and Postgres
  • Perform one-off import of product data from MongoDB
  • Update off-server to pull in the off-data image as a dependency
  • Verify can call through to off-data from off-server for an aggregate query
  • Implement query and verify improved performance
  • Implement remaining queries
  • Remove references to products_tags

Part of epic

@john-gom
Copy link
Contributor Author

May want to consider using Citus. https://github.com/citusdata/citus

@john-gom
Copy link
Contributor Author

john-gom commented Jul 30, 2023

Evaluating Citus vs DuckDB. First, export all the tags into a CSV file. To import into Postgres:

CREATE table product_tag_columnar ( product_id text NULL, tag_type text NULL, "sequence" int4 NULL, value text NULL ) using columnar;

copy product_tag_columnar from '/mnt/data/product_tags.csv' (format 'csv');

Import took 10 minutes, 52 seconds.

DuckDB
CREATE table product_tag_columnar ( product_id varchar, tag_type varchar, "sequence" integer, value varchar );

copy product_tag_columnar from 'product_tags.csv';
Import took 2 minutes, 38 seconds

Queries:

select value,count(*) from product_tag_columnar where tag_type ='nutrition_grades_tags' group by value order by 1;

DuckDB: 16s, Citus: 39s
Second run: DuckDB: 16s, Citus 39s

select value,count(*) from product_tag_columnar pt where tag_type ='nutrition_grades_tags' and exists (select * from product_tag_columnar pt2 where pt2.product_id = pt.product_id and pt2.tag_type='ecoscore_tags' and pt2.value = 'd') group by value order by 1;

DuckDB: 50s, Citus: 87s
DuckDB: 50s, Citus 84s

The Citus figures are a lot slower than before. Investigating...

@john-gom
Copy link
Contributor Author

Tried inserting into Citus in a different order:
insert into product_tag_2 select * from product_tag_columnar order by 2,4,1;
Took 7 minutes 37 seconds to insert.
Now statement 1 took: 0.8s
Statement 2: 1.6s
So structure of index makes a big difference

@john-gom
Copy link
Contributor Author

For DuckDB, re-exported data sorted by tag_type and value and re-imported into a fresh database. Import took 2 minutes 56 seconds.
Statement 1: 220ms
Statement 2: 270ms
So order of data also makes a big difference for DuckDB

@john-gom
Copy link
Contributor Author

john-gom commented Jul 31, 2023

Tried another approach with Postgres - create a table for each tag type. Just created the two being used:
create table off.product_nutrition_grades_tags (product_id text,value text, constraint "product_nutrition_grades_tags_pkey" primary key (product_id,value))
insert into product_nutrition_grades_tags select distinct product_id,value from product_tag_columnar where tag_type='nutrition_grades_tags' and product_id is not null and value is not null;
create table off.product_ecoscore_tags (product_id text,value text, constraint "product_ecoscore_tags_pkey" primary key (product_id,value));
insert into product_ecoscore_tags select distinct product_id,value from product_tag_columnar where tag_type='ecoscore_tags' and product_id is not null and value is not null;

Query 1 is now:
select value,count(*) from product_nutrition_grades_tags group by value order by 1;
Took 500ms

Query 2:
select value,count(*) from product_nutrition_grades_tags pt where exists (select * from product_ecoscore_tags pt2 where pt2.product_id = pt.product_id and pt2.value = 'd') group by value order by 1;
Took 760ms

@john-gom
Copy link
Contributor Author

john-gom commented Aug 21, 2023

Things to do:

  • Update readme
  • Split code into services
  • Add tests
  • Return -1 on count if not supported
  • Return special code on aggregate if not supported
  • Readme on architecture
  • Tidy up local vs PO profiles
  • Add github workflows to run tests on PR
  • Add branch protection
  • Add github workflows to build image
  • Update PO deploy workflows to set QUERY_URL (not needed for staging as query joins the po_webnet network
  • deploy scripts for staging
  • PR for off-server to use query service
  • deploy scripts for production

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant