Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce datatype constraints on CSV imports #1716

Merged
merged 12 commits into from
Jan 9, 2020

Conversation

DavisRayM
Copy link
Contributor

@DavisRayM DavisRayM commented Nov 18, 2019

  • Enforce integer datatype constraint
  • Enforce decimal datatype constraint
  • Enforce datetime datatype constrain
  • Process CSV Imports in two steps validation then upload

Changes implemented by this PR

  • Split the CSV Import / submit_csv process into two steps Validation and Import with new function validate_csv
  • We now validate that the imported CSV data doesn't infringe on an forms date, date time, integer & decimal datatype constraints
  • We now validate the entire CSV and return all errors found with the data to the user
  • Fix an issue where we wouldn't convert Excel dates (floats) if the first row of data had the field as null

Fix #1669

@DavisRayM DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch from 04ddf58 to e83ff00 Compare November 18, 2019 11:06
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
try:
decimal = float(row.get('key', ''))
except ValueError:
raise Exception(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we raising an exception here, whereas below we are passing on the ValueError?

Copy link
Contributor Author

@DavisRayM DavisRayM Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were passing it before on date and datetime columns is since we assumed if a ValueError was raised the string was of the correct format that being yyyy-MM-dd or the respective datetime format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll no longer be passing on the date and datetime checks below too. As they allow strings like asdasd to be passed in datetime and date columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think we should abstract a function for all of these conditionals then, so we're forced to maintain the same behavior

Copy link
Member

@pld pld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we're raising of exceptions shows a more general problem with the import procedure. If I'm a user importing a CSV, and my CSV has multiple errors in multiple columns and rows, every time I fix an error, I'd have to upload the CSV again to see the next error, then fix the next error, etc.

This seems inconvenient for the user and for the system. Inconvenient for the user since I'd have to modify and re-upload my CSV multiple times, and inconvenient for the system because it's putting more load on the system, causing more network IO to and from the system.

What if we split this into validation step, which would collect an errors and raise a single exception, and an upload step, which only occurs after it passes the validation step?

@ukanga
Copy link
Member

ukanga commented Nov 18, 2019

We would need to optimize for minimal memory footprint at the same time.

@DavisRayM DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch from 7703687 to 519a45e Compare November 22, 2019 09:18
@DavisRayM DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 12 times, most recently from 80b6c7e to 1917a0c Compare December 5, 2019 10:22
@DavisRayM DavisRayM changed the title [WIP] Enforce datatype constraints on CSV import Enforce datatype constraints on CSV import Dec 5, 2019
@DavisRayM DavisRayM changed the title Enforce datatype constraints on CSV import Enforce datatype constraints on CSV imports Dec 5, 2019
Copy link
Member

@pld pld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a number of comments, but did not do a thorough review yet, I think it's best for you to address those and do another pass first.

Did you throw a large CSV file at this to see how it performs? @ukanga raised a valid question about performance

row_uuid = row.get('meta/instanceID') or 'uuid:{}'.format(
row.get('_uuid')) if row.get('_uuid') else None
row.get('_uuid')) if row.get('_uuid') else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would row.get('_uuid') or None work here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think row.get('_uuid') or None would work here as we are trying to format the value of row.get('_uuid') if present and if not we should set the entire variable of row_uuid to None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also please replace string here with this constant, https://github.com/onaio/onadata/blob/master/onadata/libs/utils/common_tags.py

Replaced in the latest commits

if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE:
row = 1

# In some cases where the field is not required the first row may have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rewrite this comment in response here? I don't understand it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing out the datatype constraint enforcement, I came across a bug while importing xls files. If you take this xls file here and import it into a form matching it's columns we get this.

Copy link
Contributor Author

@DavisRayM DavisRayM Dec 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the above mentioned xls file the Date column date is malformed / not converted into ISO-formatted dates. Due to the fact that we do not convert the excel date (the date currently shown in the form above).

We don't convert them currently due to the fact that we collect columns that contain date values only if the first row has the said value. In the case of the example above the first row had an empty cell on the first row under the Date column as such we didn't convert the whole column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool, that makes sense, can you

  • change "In some cases where" to "If" on line 339
  • add a comma at the end of line 341
  • change line 342 to "therefore we find the first non-empty row."
  • remove line 343

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, another question, XLS Dates count as floats?

Copy link
Contributor Author

@DavisRayM DavisRayM Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, another question, XLS Dates count as floats?

Yes, their stored format is float.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool, that makes sense, can you

* change "In some cases where" to "If" on line 339

* add a comma at the end of line 341

* change line 342 to "therefore we find the first non-empty row."

* remove line 343

Changed in the latest commits.

except UnicodeDecodeError:
return async_status(
FAILED, 'CSV file must be utf-8 encoded')
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should catch something more specific, if anything else here

if isinstance(csv_file, str):
csv_file = BytesIO(csv_file)
elif csv_file is None or not hasattr(csv_file, 'read'):
raise Exception(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this should raise an exception versus returning an error. This incurs more overhead, instead let's return an error.

If we do raise, and I'm suggesting we do not, it should certainly not be a generic Exception, https://stackoverflow.com/questions/2052390/manually-raising-throwing-an-exception-in-python/24065533#24065533

This comment applies throughout this function.

@ukanga do you have thoughts on exception versus returning an error, in situations where we can properly handle a returned error, which this appears to be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently changed in the PR. We now return errors and handle them within the submit_csv function.

onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
@DavisRayM DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 2 times, most recently from 407793a to 076aec3 Compare December 6, 2019 11:20
@DavisRayM
Copy link
Contributor Author

@pld Still haven't tested how performant this is. I'll deploy this onto a staging server and see how it fairs. I'll also share the findings on here.

Copy link
Member

@pld pld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good progress, performance tests and remove the Exception raises still left to work on

if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE:
row = 1

# In some cases where the field is not required the first row may have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool, that makes sense, can you

  • change "In some cases where" to "If" on line 339
  • add a comma at the end of line 341
  • change line 342 to "therefore we find the first non-empty row."
  • remove line 343

if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE:
row = 1

# In some cases where the field is not required the first row may have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, another question, XLS Dates count as floats?

onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
@pld
Copy link
Member

pld commented Dec 9, 2019

@pld Still haven't tested how performant this is. I'll deploy this onto a staging server and see how it fairs. I'll also share the findings on here.

Can you run a performance test directly against the code? You can do this with a unit test and a large CSV file

@DavisRayM DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch from 076aec3 to 9c5b072 Compare December 10, 2019 14:46
@DavisRayM
Copy link
Contributor Author

@pld Still haven't tested how performant this is. I'll deploy this onto a staging server and see how it fairs. I'll also share the findings on here.

Can you run a performance test directly against the code? You can do this with a unit test and a large CSV file

Yes, I'll add the unit test and change how I'm currently handling Exceptions.

@DavisRayM DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 2 times, most recently from 0d7bd60 to 04d1289 Compare December 11, 2019 07:59
@DavisRayM DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 5 times, most recently from 0d4f2fc to de5dd6f Compare December 18, 2019 12:43
Copy link
Member

@pld pld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, a couple quite small comments, but after that will be good to go in my opinion, going to share, would be good to get another set of eyes on this review too

onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
Copy link
Member

@ukanga ukanga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, my worry is the second pass through the data is entirely in memory. This I believe should not be the case, we easily get CSV files that could utilize all available memory.


if overwrite:
xform.instances.filter(deleted_at__isnull=True)\
.update(deleted_at=timezone.now(),
deleted_by=User.objects.get(username=username))

validated_rows = validated_data.get('data')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we having all these rows in memory? I see likely hood of memory heavy implementation, perhaps reading from a buffer or reading through the file again once we know all the records are valid could ensure we have a small memory footprint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things we do to the data during the validation process like making sure the date and dateTime datatypes are isoformatted not quite sure how we can do that with out in someway storing the data in memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @ukanga's point is to limit the amount of data in memory at any one point in time, not to keep it all out of memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to now create the instance immediately after validation.

pld
pld previously approved these changes Dec 19, 2019
@DavisRayM DavisRayM force-pushed the 1669-enforce-csv-datatype-constraint branch 2 times, most recently from 43b25b5 to 9224d06 Compare January 6, 2020 06:12
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
onadata/libs/utils/csv_import.py Outdated Show resolved Hide resolved
pld
pld previously approved these changes Jan 7, 2020
Copy link
Member

@pld pld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, and is an improvement from current state but there's still more work to be done towards purer functions, we can do that incrementally, lgtm

- Add tests for CSV Datatype constraint enforcement
- Update and add test fixtures
- Fix failing tests
- Update tests
- Fix issue where we wouldn't convert xls dates if the first row
  was null
- Utilize common tag NA_REP instead of magic string 'n/a'
- Create instances immediately after validating instance data is valid
- Rollback instances incase of an error
- Modify tests
- Minor code cleanup and commenting
@ukanga ukanga merged commit fdd91d7 into master Jan 9, 2020
@ukanga ukanga deleted the 1669-enforce-csv-datatype-constraint branch January 9, 2020 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV upload to ONA forms not enforcing datatype constraints
3 participants