-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid processing vast empty areas of buggy / strange workbooks #612
Conversation
- somehow it happens that spreadsheets sometimes think they have data for 100's or 1000's of rows/columns even though all cells are empty. - xls / xlsx processing refactored to stop processing after 20 adjacent empty columns (when getting the headers) or rows (when reading rows). - add tests for xls / xlsx to show bad input processed as expected. - the change to not output empty rows seems to have broken the "flatxlsformtest" case which needs further investigation - seems to just be a question numbering issue and different itext lang order.
Thanks, @lindsay-stevens! I think this is a reasonable approach. Did you also consider @yanokwa's suggestion at the bottom of #604 to reset the dimensions above a certain reported column count? |
@lognaturel thanks for taking a look! The source for calculate_dimensions iterates through rows until a single break in truthiness, and uses the largest column index from cells indexed in those rows. A cell object can represent formatting info only (no value data). In the example UCL file the re-calculated column dimension at column AMJ (1024th) is from a break in the grey background colour formatting which then extends on forever from column AMK. The relevant XLSForm data goes to column P (16th) only. So if a user inserts an empty row or column for formatting, but doesn't colour or style it, and the worksheet dimensions are still saved incorrectly somehow, then the recalc might cut off some data. In other words, it'd result in an extra iteration of the worksheet data (2nd time is to read values), and in doing so might iterate too far or not enough. The approach in this PR also guards against accidentally adding irrelevant data or styling way off in at the edges of the worksheet. Maybe not as much of an issue for XLS (max cols 256 x rows 65536) but there doesn't seem to be an equivalent method in |
It might be nice to add a warning when do this? Just in case someone has some number of empty rows/columns? |
Thanks for the additional explanation, @lindsay-stevens, the approach sounds good. I don't feel a strong need for a warning and I imagine it'd be a fair amount of plumbing so doesn't seem worth it. I'm not sure about the logic and have commented inline. Otherwise the rest of the implementation and the tests look good. I verified whether it might address #611 and unfortunately it does not. I verified that it does address all 3 forms from the thread in #604. 👍 If at all possible, it would be great to get this released by the end of the week! |
Thanks for the review @lognaturel. I've fixed up the algo so this could be merged. I'll have a look at #611 and open a docs PR for the blank col/row processing behaviour. |
pyxform/xls2json_backends.py
Outdated
if is_empty(column_header): | ||
# Preserve column order (will filter later) | ||
column_header_list.append(None) | ||
if last_col_empty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rows looks good but this still doesn’t look quite right! I don’t think you need last_col_empty at all. You can always increment in this branch and set to 0 in the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, updated both cols/rows algos to just use adjacent_empty_*
instead of flag variables.
pyxform/xls2json_backends.py
Outdated
# so that any warning messages that mention row numbers are accurate. | ||
result_rows.append(row_dict) | ||
|
||
if trim_trailing_empty_rows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think you need trim_trailing_empty_rows either. Instead, it should always be appropriate to trim adjacent_empty_rows. That will also handle the case in which there are eg 2 empty columns before the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to always trim cols/rows, based on adjacent_empty_*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! There's a slightly confusing off-by-one-from-intent, I think, but it doesn't affect users so I'm fine to leave it and will let you decide whether it bothers you enough to patch it with a future PR now that I've mentioned it! 😄
# Preserve column order (will filter later) | ||
column_header_list.append(None) | ||
# After a run of empty cols, assume we've reached the end of the data. | ||
if max_adjacent_empty < adjacent_empty_cols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OBOB! This will stop after 21 empty columns are identified, the increment should really be before the test. But since 20 is arbitrary anyway, I suppose it doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yes now it has been bothering me! 😄 Will add to a later PR. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😁
|
||
if 0 == len(row_dict): | ||
# After a run of empty rows, assume we've reached the end of the data. | ||
if max_adjacent_empty < adjacent_empty_rows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, it will only stop after reaching 21 empty rows.
Closes #604
Might also address #611 but haven't checked yet.
Why is this the best possible solution? Were any other approaches considered?
It's not clear how / why these buggy spreadsheets come to be. From the associated tickets / forum threads it seems like re-saving with Excel may or may not work. It's possible there is a bug in xlrd or openpyxl to blame, or perhaps these libraries could handle these situations better, but in either case it may be hard to get an upstream fix considering the poor reproduce-ability. The most practical solution seemed to be to have pyxform only process the workbook sheets for as long as there seems to be data in the columns / rows.
What are the regression risks?
As mentioned above there's a broken test which seems at least in part to involve a mismatch from using the row number for a automatically generated question name. But if that's no good then we can still output the empty rows.
Does this change require updates to documentation? If so, please file an issue here and include the link below.
It doesn't seem so at this stage.
Before submitting this PR, please make sure you have:
tests
nosetests
and verified all tests passblack pyxform tests
to format code