Enforce datatype constraints on CSV imports #1716

DavisRayM · 2019-11-18T10:46:11Z

Enforce integer datatype constraint
Enforce decimal datatype constraint
Enforce datetime datatype constrain
Process CSV Imports in two steps validation then upload

Changes implemented by this PR

Split the CSV Import / submit_csv process into two steps Validation and Import with new function validate_csv
We now validate that the imported CSV data doesn't infringe on an forms date, date time, integer & decimal datatype constraints
We now validate the entire CSV and return all errors found with the data to the user
Fix an issue where we wouldn't convert Excel dates (floats) if the first row of data had the field as null

Fix #1669

onadata/libs/utils/csv_import.py

pld · 2019-11-18T14:17:11Z

onadata/libs/utils/csv_import.py

+                    try:
+                        decimal = float(row.get('key', ''))
+                    except ValueError:
+                        raise Exception(


why are we raising an exception here, whereas below we are passing on the ValueError?

We were passing it before on date and datetime columns is since we assumed if a ValueError was raised the string was of the correct format that being yyyy-MM-dd or the respective datetime format.

We'll no longer be passing on the date and datetime checks below too. As they allow strings like asdasd to be passed in datetime and date columns.

Cool, I think we should abstract a function for all of these conditionals then, so we're forced to maintain the same behavior

pld

The way we're raising of exceptions shows a more general problem with the import procedure. If I'm a user importing a CSV, and my CSV has multiple errors in multiple columns and rows, every time I fix an error, I'd have to upload the CSV again to see the next error, then fix the next error, etc.

This seems inconvenient for the user and for the system. Inconvenient for the user since I'd have to modify and re-upload my CSV multiple times, and inconvenient for the system because it's putting more load on the system, causing more network IO to and from the system.

What if we split this into validation step, which would collect an errors and raise a single exception, and an upload step, which only occurs after it passes the validation step?

ukanga · 2019-11-18T20:56:52Z

We would need to optimize for minimal memory footprint at the same time.

pld

I have a number of comments, but did not do a thorough review yet, I think it's best for you to address those and do another pass first.

Did you throw a large CSV file at this to see how it performs? @ukanga raised a valid question about performance

pld · 2019-12-05T19:02:34Z

onadata/libs/utils/csv_import.py

            row_uuid = row.get('meta/instanceID') or 'uuid:{}'.format(
-                row.get('_uuid')) if row.get('_uuid') else None
+                    row.get('_uuid')) if row.get('_uuid') else None


would row.get('_uuid') or None work here?

also please replace string here with this constant, https://github.com/onaio/onadata/blob/master/onadata/libs/utils/common_tags.py

I don't think row.get('_uuid') or None would work here as we are trying to format the value of row.get('_uuid') if present and if not we should set the entire variable of row_uuid to None.

also please replace string here with this constant, https://github.com/onaio/onadata/blob/master/onadata/libs/utils/common_tags.py

Replaced in the latest commits

pld · 2019-12-05T20:27:05Z

onadata/libs/utils/csv_import.py

-        if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE:
+        row = 1
+
+        # In some cases where the field is not required the first row may have


can you rewrite this comment in response here? I don't understand it

While testing out the datatype constraint enforcement, I came across a bug while importing xls files. If you take this xls file here and import it into a form matching it's columns we get this.

With the above mentioned xls file the Date column date is malformed / not converted into ISO-formatted dates. Due to the fact that we do not convert the excel date (the date currently shown in the form above).

We don't convert them currently due to the fact that we collect columns that contain date values only if the first row has the said value. In the case of the example above the first row had an empty cell on the first row under the Date column as such we didn't convert the whole column.

ok cool, that makes sense, can you

change "In some cases where" to "If" on line 339

add a comma at the end of line 341

change line 342 to "therefore we find the first non-empty row."

remove line 343

also, another question, XLS Dates count as floats?

also, another question, XLS Dates count as floats?

Yes, their stored format is float.

ok cool, that makes sense, can you

* change "In some cases where" to "If" on line 339 * add a comma at the end of line 341 * change line 342 to "therefore we find the first non-empty row." * remove line 343

Changed in the latest commits.

pld · 2019-12-05T20:32:12Z

onadata/libs/utils/csv_import.py

+    except UnicodeDecodeError:
+        return async_status(
+            FAILED, 'CSV file must be utf-8 encoded')
+    except Exception as e:


We should catch something more specific, if anything else here

pld · 2019-12-05T20:32:50Z

onadata/libs/utils/csv_import.py

+    if isinstance(csv_file, str):
+        csv_file = BytesIO(csv_file)
+    elif csv_file is None or not hasattr(csv_file, 'read'):
+        raise Exception(


I'm not convinced this should raise an exception versus returning an error. This incurs more overhead, instead let's return an error.

If we do raise, and I'm suggesting we do not, it should certainly not be a generic Exception, https://stackoverflow.com/questions/2052390/manually-raising-throwing-an-exception-in-python/24065533#24065533

This comment applies throughout this function.

@ukanga do you have thoughts on exception versus returning an error, in situations where we can properly handle a returned error, which this appears to be?

This is currently changed in the PR. We now return errors and handle them within the submit_csv function.

onadata/libs/utils/csv_import.py

DavisRayM · 2019-12-06T13:18:52Z

@pld Still haven't tested how performant this is. I'll deploy this onto a staging server and see how it fairs. I'll also share the findings on here.

pld

good progress, performance tests and remove the Exception raises still left to work on

pld · 2019-12-09T16:39:13Z

onadata/libs/utils/csv_import.py

-        if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE:
+        row = 1
+
+        # In some cases where the field is not required the first row may have


ok cool, that makes sense, can you

change "In some cases where" to "If" on line 339

add a comma at the end of line 341

change line 342 to "therefore we find the first non-empty row."

remove line 343

pld · 2019-12-09T16:39:45Z

onadata/libs/utils/csv_import.py

-        if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE:
+        row = 1
+
+        # In some cases where the field is not required the first row may have


also, another question, XLS Dates count as floats?

onadata/libs/utils/csv_import.py

pld · 2019-12-09T17:08:21Z

@pld Still haven't tested how performant this is. I'll deploy this onto a staging server and see how it fairs. I'll also share the findings on here.

Can you run a performance test directly against the code? You can do this with a unit test and a large CSV file

DavisRayM · 2019-12-10T14:48:13Z

@pld Still haven't tested how performant this is. I'll deploy this onto a staging server and see how it fairs. I'll also share the findings on here.

Can you run a performance test directly against the code? You can do this with a unit test and a large CSV file

Yes, I'll add the unit test and change how I'm currently handling Exceptions.

pld

awesome, a couple quite small comments, but after that will be good to go in my opinion, going to share, would be good to get another set of eyes on this review too

onadata/libs/utils/csv_import.py

ukanga

Looks good, my worry is the second pass through the data is entirely in memory. This I believe should not be the case, we easily get CSV files that could utilize all available memory.

ukanga · 2019-12-19T04:29:46Z

onadata/libs/utils/csv_import.py


    if overwrite:
        xform.instances.filter(deleted_at__isnull=True)\
            .update(deleted_at=timezone.now(),
                    deleted_by=User.objects.get(username=username))

+    validated_rows = validated_data.get('data')


Are we having all these rows in memory? I see likely hood of memory heavy implementation, perhaps reading from a buffer or reading through the file again once we know all the records are valid could ensure we have a small memory footprint.

There are a few things we do to the data during the validation process like making sure the date and dateTime datatypes are isoformatted not quite sure how we can do that with out in someway storing the data in memory.

I believe @ukanga's point is to limit the amount of data in memory at any one point in time, not to keep it all out of memory

Changed this to now create the instance immediately after validation.

onadata/libs/utils/csv_import.py

pld

I think this looks good, and is an improvement from current state but there's still more work to be done towards purer functions, we can do that incrementally, lgtm

- Add tests for CSV Datatype constraint enforcement - Update and add test fixtures - Fix failing tests

- Update tests - Fix issue where we wouldn't convert xls dates if the first row was null

- Utilize common tag NA_REP instead of magic string 'n/a'

- Create instances immediately after validating instance data is valid - Rollback instances incase of an error - Modify tests

- Minor code cleanup and commenting

DavisRayM requested a review from ukanga November 18, 2019 10:46

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch from 04ddf58 to e83ff00 Compare November 18, 2019 11:06

pld requested changes Nov 18, 2019

View reviewed changes

pld reviewed Nov 18, 2019

View reviewed changes

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch from 7703687 to 519a45e Compare November 22, 2019 09:18

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 12 times, most recently from 80b6c7e to 1917a0c Compare December 5, 2019 10:22

DavisRayM changed the title ~~[WIP] Enforce datatype constraints on CSV import~~ Enforce datatype constraints on CSV import Dec 5, 2019

DavisRayM changed the title ~~Enforce datatype constraints on CSV import~~ Enforce datatype constraints on CSV imports Dec 5, 2019

pld requested changes Dec 5, 2019

View reviewed changes

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 2 times, most recently from 407793a to 076aec3 Compare December 6, 2019 11:20

pld requested changes Dec 9, 2019

View reviewed changes

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch from 076aec3 to 9c5b072 Compare December 10, 2019 14:46

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 2 times, most recently from 0d7bd60 to 04d1289 Compare December 11, 2019 07:59

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 5 times, most recently from 0d4f2fc to de5dd6f Compare December 18, 2019 12:43

pld requested changes Dec 18, 2019

View reviewed changes

onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved

onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved

onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved

ukanga reviewed Dec 19, 2019

View reviewed changes

pld previously approved these changes Dec 19, 2019

View reviewed changes

DavisRayM dismissed pld’s stale review via 1fcbebd December 19, 2019 14:45

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 2 times, most recently from 43b25b5 to 9224d06 Compare January 6, 2020 06:12

pld reviewed Jan 6, 2020

View reviewed changes

onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved

onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved

pld previously approved these changes Jan 7, 2020

View reviewed changes

DavisRayM added 12 commits January 9, 2020 10:04

Validate csv before creation of submission instance

63149a4

- Add tests for CSV Datatype constraint enforcement - Update and add test fixtures - Fix failing tests

Split submit_csv validation functionality into validate_csv

43245d0

- Update tests - Fix issue where we wouldn't convert xls dates if the first row was null

Add large_csv_upload test

dc5aaea

Handle errors encountered in validate_csv function better

8a8cb18

Utilize proper form fixture for test_reject_spaces_in_headers

6ab6ef1

Format CSV Imported dateTime values as ISO 8601 strings

63ce65f

Cleanup code

3e851f1

- Utilize common tag NA_REP instead of magic string 'n/a'

Utilize temporary files to store validated csv imports

441fd24

Add validate_csv_file function

071e1fc

Reduce memory footprint left due to the submit csv process

da7202f

- Create instances immediately after validating instance data is valid - Rollback instances incase of an error - Modify tests

Remove unneeded validate_csv function

2303fd5

- Minor code cleanup and commenting

Fix typos

881a3d2

DavisRayM dismissed pld’s stale review via 881a3d2 January 9, 2020 07:05

DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch from 74ec1c4 to 881a3d2 Compare January 9, 2020 07:05

ukanga approved these changes Jan 9, 2020

View reviewed changes

ukanga merged commit fdd91d7 into master Jan 9, 2020

ukanga deleted the 1669-enforce-csv-datatype-constraint branch January 9, 2020 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce datatype constraints on CSV imports #1716

Enforce datatype constraints on CSV imports #1716

DavisRayM commented Nov 18, 2019 •

edited

Loading

pld Nov 18, 2019

DavisRayM Nov 18, 2019 •

edited

Loading

DavisRayM Nov 18, 2019

pld Nov 18, 2019

pld left a comment

ukanga commented Nov 18, 2019

pld left a comment

pld Dec 5, 2019

pld Dec 5, 2019

DavisRayM Dec 6, 2019

DavisRayM Dec 6, 2019

pld Dec 5, 2019

DavisRayM Dec 6, 2019

DavisRayM Dec 6, 2019 •

edited

Loading

pld Dec 9, 2019

pld Dec 9, 2019

DavisRayM Dec 10, 2019 •

edited

Loading

DavisRayM Dec 10, 2019

pld Dec 5, 2019

pld Dec 5, 2019

DavisRayM Dec 18, 2019

DavisRayM commented Dec 6, 2019

pld left a comment

pld Dec 9, 2019

pld Dec 9, 2019

pld commented Dec 9, 2019 •

edited

Loading

DavisRayM commented Dec 10, 2019

pld left a comment

ukanga left a comment

ukanga Dec 19, 2019

DavisRayM Dec 19, 2019

pld Dec 19, 2019

DavisRayM Jan 7, 2020

pld left a comment

Enforce datatype constraints on CSV imports #1716

Enforce datatype constraints on CSV imports #1716

Conversation

DavisRayM commented Nov 18, 2019 • edited Loading

Changes implemented by this PR

Choose a reason for hiding this comment

DavisRayM Nov 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pld left a comment

Choose a reason for hiding this comment

ukanga commented Nov 18, 2019

pld left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavisRayM Dec 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavisRayM Dec 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavisRayM commented Dec 6, 2019

pld left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pld commented Dec 9, 2019 • edited Loading

DavisRayM commented Dec 10, 2019

pld left a comment

Choose a reason for hiding this comment

ukanga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pld left a comment

Choose a reason for hiding this comment

DavisRayM commented Nov 18, 2019 •

edited

Loading

DavisRayM Nov 18, 2019 •

edited

Loading

DavisRayM Dec 6, 2019 •

edited

Loading

DavisRayM Dec 10, 2019 •

edited

Loading

pld commented Dec 9, 2019 •

edited

Loading