-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce datatype constraints on CSV imports #1716
Conversation
04ddf58
to
e83ff00
Compare
onadata/libs/utils/csv_import.py
Outdated
try: | ||
decimal = float(row.get('key', '')) | ||
except ValueError: | ||
raise Exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we raising an exception here, whereas below we are passing on the ValueError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were passing it before on date
and datetime
columns is since we assumed if a ValueError
was raised the string was of the correct format that being yyyy-MM-dd
or the respective datetime
format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll no longer be passing on the date
and datetime
checks below too. As they allow strings like asdasd
to be passed in datetime
and date
columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think we should abstract a function for all of these conditionals then, so we're forced to maintain the same behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we're raising of exceptions shows a more general problem with the import procedure. If I'm a user importing a CSV, and my CSV has multiple errors in multiple columns and rows, every time I fix an error, I'd have to upload the CSV again to see the next error, then fix the next error, etc.
This seems inconvenient for the user and for the system. Inconvenient for the user since I'd have to modify and re-upload my CSV multiple times, and inconvenient for the system because it's putting more load on the system, causing more network IO to and from the system.
What if we split this into validation step, which would collect an errors and raise a single exception, and an upload step, which only occurs after it passes the validation step?
We would need to optimize for minimal memory footprint at the same time. |
7703687
to
519a45e
Compare
80b6c7e
to
1917a0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a number of comments, but did not do a thorough review yet, I think it's best for you to address those and do another pass first.
Did you throw a large CSV file at this to see how it performs? @ukanga raised a valid question about performance
onadata/libs/utils/csv_import.py
Outdated
row_uuid = row.get('meta/instanceID') or 'uuid:{}'.format( | ||
row.get('_uuid')) if row.get('_uuid') else None | ||
row.get('_uuid')) if row.get('_uuid') else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would row.get('_uuid') or None
work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also please replace string here with this constant, https://github.com/onaio/onadata/blob/master/onadata/libs/utils/common_tags.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think row.get('_uuid') or None
would work here as we are trying to format the value of row.get('_uuid')
if present and if not we should set the entire variable of row_uuid
to None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also please replace string here with this constant, https://github.com/onaio/onadata/blob/master/onadata/libs/utils/common_tags.py
Replaced in the latest commits
onadata/libs/utils/csv_import.py
Outdated
if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE: | ||
row = 1 | ||
|
||
# In some cases where the field is not required the first row may have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rewrite this comment in response here? I don't understand it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above mentioned xls file the Date
column date is malformed / not converted into ISO-formatted dates. Due to the fact that we do not convert the excel date (the date currently shown in the form above).
We don't convert them currently due to the fact that we collect columns that contain date
values only if the first row has the said value. In the case of the example above the first row had an empty cell on the first row under the Date
column as such we didn't convert the whole column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool, that makes sense, can you
- change "In some cases where" to "If" on line 339
- add a comma at the end of line 341
- change line 342 to "therefore we find the first non-empty row."
- remove line 343
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, another question, XLS Dates count as floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, another question, XLS Dates count as floats?
Yes, their stored format is float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool, that makes sense, can you
* change "In some cases where" to "If" on line 339 * add a comma at the end of line 341 * change line 342 to "therefore we find the first non-empty row." * remove line 343
Changed in the latest commits.
onadata/libs/utils/csv_import.py
Outdated
except UnicodeDecodeError: | ||
return async_status( | ||
FAILED, 'CSV file must be utf-8 encoded') | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should catch something more specific, if anything else here
onadata/libs/utils/csv_import.py
Outdated
if isinstance(csv_file, str): | ||
csv_file = BytesIO(csv_file) | ||
elif csv_file is None or not hasattr(csv_file, 'read'): | ||
raise Exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this should raise an exception versus returning an error. This incurs more overhead, instead let's return an error.
If we do raise, and I'm suggesting we do not, it should certainly not be a generic Exception
, https://stackoverflow.com/questions/2052390/manually-raising-throwing-an-exception-in-python/24065533#24065533
This comment applies throughout this function.
@ukanga do you have thoughts on exception versus returning an error, in situations where we can properly handle a returned error, which this appears to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently changed in the PR. We now return errors and handle them within the submit_csv
function.
407793a
to
076aec3
Compare
@pld Still haven't tested how performant this is. I'll deploy this onto a staging server and see how it fairs. I'll also share the findings on here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good progress, performance tests and remove the Exception
raises still left to work on
onadata/libs/utils/csv_import.py
Outdated
if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE: | ||
row = 1 | ||
|
||
# In some cases where the field is not required the first row may have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool, that makes sense, can you
- change "In some cases where" to "If" on line 339
- add a comma at the end of line 341
- change line 342 to "therefore we find the first non-empty row."
- remove line 343
onadata/libs/utils/csv_import.py
Outdated
if first_sheet.cell_type(1, index) == xlrd.XL_CELL_DATE: | ||
row = 1 | ||
|
||
# In some cases where the field is not required the first row may have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, another question, XLS Dates count as floats?
Can you run a performance test directly against the code? You can do this with a unit test and a large CSV file |
076aec3
to
9c5b072
Compare
Yes, I'll add the unit test and change how I'm currently handling |
0d7bd60
to
04d1289
Compare
0d4f2fc
to
de5dd6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, a couple quite small comments, but after that will be good to go in my opinion, going to share, would be good to get another set of eyes on this review too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, my worry is the second pass through the data is entirely in memory. This I believe should not be the case, we easily get CSV files that could utilize all available memory.
onadata/libs/utils/csv_import.py
Outdated
|
||
if overwrite: | ||
xform.instances.filter(deleted_at__isnull=True)\ | ||
.update(deleted_at=timezone.now(), | ||
deleted_by=User.objects.get(username=username)) | ||
|
||
validated_rows = validated_data.get('data') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we having all these rows in memory? I see likely hood of memory heavy implementation, perhaps reading from a buffer or reading through the file again once we know all the records are valid could ensure we have a small memory footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things we do to the data during the validation process like making sure the date
and dateTime
datatypes are isoformatted
not quite sure how we can do that with out in someway storing the data in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @ukanga's point is to limit the amount of data in memory at any one point in time, not to keep it all out of memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to now create the instance immediately after validation.
43b25b5
to
9224d06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, and is an improvement from current state but there's still more work to be done towards purer functions, we can do that incrementally, lgtm
- Add tests for CSV Datatype constraint enforcement - Update and add test fixtures - Fix failing tests
- Update tests - Fix issue where we wouldn't convert xls dates if the first row was null
- Utilize common tag NA_REP instead of magic string 'n/a'
- Create instances immediately after validating instance data is valid - Rollback instances incase of an error - Modify tests
- Minor code cleanup and commenting
74ec1c4
to
881a3d2
Compare
integer
datatype constraintdecimal
datatype constraintdatetime
datatype constrainvalidation
thenupload
Changes implemented by this PR
submit_csv
process into two steps Validation and Import with new functionvalidate_csv
Fix #1669