Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance performance when exporting data on endpoint api/v1/data/<form_id>.<format> #2460

Merged
merged 35 commits into from
Aug 23, 2023

Conversation

kelvin-muchiri
Copy link
Contributor

@kelvin-muchiri kelvin-muchiri commented Aug 1, 2023

Changes / Features implemented

  • Remove the default ordering by id on endpoint api/v1/data/<form_id>.<format is making queries run extremely slow when exporting large amounts of data. To sort data by id in ascending order, the query parameter sort={"_id":1} will be applied to the endpoint while to sort data by id in descending order the query parameter sort={"_id"-1}. For further reference read Sort submitted data of a specific form using existing fields
  • Remove duplicate database queries when exporting CSV. Remove CSVDataFrameBuilder._query_data method which was querying the database twice despite data being passed to CSVDataFrameBuilder.to_flat_csv_export. Make use of the data passed to ensure lose coupling between the utility class CSVDataFrameBuilder and the code responsible for querying the data.
  • Remove test case onadata.apps.api.tests.viewsets.test_note_viewset.TestNoteViewSet.test_csv_export_form_w_notes. Notes are excluded when exporting the csv. The column _notes is usually added in the CSV but the value in each record is always overriden to be blank as per the implementation in the CSVDataFrameBuilder class. So this test case is futile
  • Refactor code to optimize performance

Steps taken to verify this change does what is intended

  • QA

Side effects of implementing this change

Faster when exporting data on endpoint api/v1/data/<form_id>.csv

Before submitting this PR for review, please make sure you have:

  • Included tests
  • Updated documentation

Closes #

@kelvin-muchiri kelvin-muchiri marked this pull request as draft August 1, 2023 06:45
@kelvin-muchiri kelvin-muchiri changed the title remove default ordering by id when exporting data from Data ViewSet Remove default ordering by id on endpoint api/v1/data/<form_id> Aug 1, 2023
@kelvin-muchiri kelvin-muchiri changed the title Remove default ordering by id on endpoint api/v1/data/<form_id> Remove default ordering by id on endpoint api/v1/data/<form_id> Aug 1, 2023
@kelvin-muchiri kelvin-muchiri force-pushed the data-viewset-export-performance branch from f843300 to 93d9368 Compare August 8, 2023 09:21
@kelvin-muchiri kelvin-muchiri requested review from FrankApiyo, ukanga, KipSigei and ciremusyoka and removed request for FrankApiyo and ukanga August 8, 2023 09:21
@kelvin-muchiri kelvin-muchiri marked this pull request as ready for review August 8, 2023 09:22
@kelvin-muchiri kelvin-muchiri marked this pull request as draft August 8, 2023 11:59
onadata/apps/viewer/models/parsed_instance.py Dismissed Show dismissed Hide dismissed
onadata/apps/viewer/models/parsed_instance.py Dismissed Show dismissed Hide dismissed
@kelvin-muchiri kelvin-muchiri changed the title Remove default ordering by id on endpoint api/v1/data/<form_id> Enhance performance when exporting data on endpoint api/v1/data/<form_id>.<format> Aug 10, 2023
@kelvin-muchiri kelvin-muchiri marked this pull request as ready for review August 10, 2023 14:42
the default ordering by id is making queries for run extremely slow when exporting large amounts of data.
to sort data by id in ascending order, the query parameter sort={"_id":1} will be used. For more info read https://github.com/onaio/onadata/blob/main/docs/data.rst#sort-submitted-data-of-a-specific-form-using-existing-fields
Notes are excluded when exporting the csv. The column _notes is usually added in the CSV but its always overriden to be blank as per the implmentation in the CSVDataFrameBuilder class. So this test case is futile
fix flaky tests by ensuring queryset is always ordered
@@ -815,6 +844,7 @@ def generate_external_export( # noqa C901
filter_query = options.get("query")
meta = options.get("meta")
token = options.get("token")
sort = options.get("sort")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kelvin-muchiri Could we add documentation for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -653,76 +658,114 @@ def list(self, request, *args, **kwargs):

return Response(serializer.data)

return custom_response_handler(request, xform, query, export_type)
return custom_response_handler(request, xform, query, export_type, sort=sort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think its necessary to have a new sort param since you can get it from request

@kelvin-muchiri kelvin-muchiri merged commit 3fa23c5 into main Aug 23, 2023
8 checks passed
@kelvin-muchiri kelvin-muchiri deleted the data-viewset-export-performance branch August 23, 2023 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants