-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use read-only openpyxl representation to reduce memory usage #596
Conversation
try: | ||
sheet = wb.get_sheet_by_name(sheet_name) | ||
except KeyError: | ||
return False | ||
if sheet.max_row < 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was a no-op. I added a test for a case with just the header at https://github.com/XLSForm/pyxform/pull/596/files#diff-56abab35a27949c49dd2950913ccd05332e683da4a81894108a4df6da67fb653 and https://github.com/XLSForm/pyxform/pull/596/files#diff-e5efd53ccd5cb96180d000bb4cd9157844e81d6b367d1253a2d8f23d7341ddcfR619 tests the empty sheet case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK 👍 Since we can create XLSX from code now with openpyxl I would prefer to create test fixtures from code where possible. Even if it's using openpyxl directly for edge cases that the Markdown methods don't like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you were saying I shouldn't have created the empty_sheets.xlsx
file and instead built it programmatically. I've removed that test because it actually wasn't relevant. I ended up thinking this was about any empty sheet but it's only for select_one_external
and the external_choices
sheet.
This code is never reached if the external_choices
sheet is either empty or has only a header. So I think it's ok not to build a real XLSX file to check those two cases. It seems very strange to me that this re-opens the XLSX again for parsing but I didn't look any deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, but the empty sheet case is not supported by md to json. So I'll have the two cases use a real xlsx so it's really what users experience.
@@ -433,7 +448,7 @@ def test_external_other_extension_instances(self): | |||
| | select_multiple_from_file neighbourhoods.pdf | neighbourhoods | Neighbourhoods | | |||
""", # noqa | |||
errored=True, | |||
error_contains=["should be a choices sheet in this xlsform"], | |||
error__contains=["should be a choices sheet in this xlsform"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A not so great finding: error_contains
passes no matter what the expected error message is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looks like this particular test assertion has been that way for some years. This sort of thing happened a few months ago as well (for warnings__contains, I think). It could be avoided if assertPyxformXform
had explicit keyword arguments instead of accepting **kwargs
only. Would that refactor be OK with you? I could put it in backlog ticket for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big priority but agreed an issue would be good! 🙏
I'm planning to do a |
Sadly I don't understand why Windows tests are failing.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a test with the problem case to demonstrates the memory consumption improvement. For example 1) generate the XLSX, 2) record resident memory (see below), 3) run xls2xform_convert, 4) assert resident memory not grown by some amount or percentage.
# Not perfect but good indication
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
if not is_empty(value): | ||
row_dict[key] = xlsx_value_to_str(value) | ||
except IndexError: | ||
pass # rows may not have values for every column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to reproduce this IndexError. It seems that openpyxl emits the same length row tuple even if some cells are empty. It would be problematic if it did emit different length tuples, since the columns are accessed by index. Can this try/except be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the structure of the XLSX document. I assume it's more likely to happen with machine-written files (e.g. the one from test_itemset_csv_generated_from_external_choices
). I'll add a test explicitly targeting it.
try: | ||
sheet = wb.get_sheet_by_name(sheet_name) | ||
except KeyError: | ||
return False | ||
if sheet.max_row < 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK 👍 Since we can create XLSX from code now with openpyxl I would prefer to create test fixtures from code where possible. Even if it's using openpyxl directly for edge cases that the Markdown methods don't like.
@@ -402,9 +402,6 @@ def test_itemset_csv_generated_from_external_choices(self): | |||
xls2xform_convert( | |||
xlsform_path=wb_path, | |||
xform_path=get_xml_path(wb_path), | |||
validate=True, | |||
pretty_print=False, | |||
enketo=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a dealbreaker but I am wondering why these arguments are removed from the test? I specified them so that the test was explicit about how the file should be generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly wanted to make sure to remove the validate=True
because it slows the tests a lot. We always run with Validate before release. It seemed to me the others didn't really matter but I'm happy to bring them back. How strongly do you feel? In this case the pretty_print
and enketo
values don't change the output, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always run with Validate before release.
Ah, except that these aren't pyxform test cases so the flag change won't affect them. I still think that's better because it runs the tests faster. I also happen to know that the entire "fast external itemset" feature is ignored by JavaRosa/Validate.
@@ -433,7 +448,7 @@ def test_external_other_extension_instances(self): | |||
| | select_multiple_from_file neighbourhoods.pdf | neighbourhoods | Neighbourhoods | | |||
""", # noqa | |||
errored=True, | |||
error_contains=["should be a choices sheet in this xlsform"], | |||
error__contains=["should be a choices sheet in this xlsform"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looks like this particular test assertion has been that way for some years. This sort of thing happened a few months ago as well (for warnings__contains, I think). It could be avoided if assertPyxformXform
had explicit keyword arguments instead of accepting **kwargs
only. Would that refactor be OK with you? I could put it in backlog ticket for now.
I haven't tested but it could be that the workbook is still open, as mentioned here: https://openpyxl.readthedocs.io/en/latest/optimized.html |
@@ -601,6 +603,26 @@ def test_workbook_to_json__optional_sheets_ok(self): | |||
warnings_count=0, | |||
) | |||
|
|||
def test_xls2xform_convert__e2e_row_with_no_column_value(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove the code you commented on here, this test should fail.
a1c859f
to
0054045
Compare
Thanks so much for helping me read the docs. That seemed like a likely culprit but either I'm closing in the wrong spot or something else is going on. Unfortunately I'm out of time for this tonight. Having chatted with @yanokwa and Team Central, I think this fix should go in the pyxform release so I haven't released yet. We don't know how the extra columns get generated and it seems like the kind of thing that others would run into so best to patch it now. If you have a chance to review during your day, great. Ideas on the open files issue much appreciated! I'll come back to it tomorrow. |
I came up with a workaround for the Windows test failures, PR here to your branch. Otherwise, this PR seems good to go. |
- presumably, windows antivirus scanning is being activated by 1) a new file created and 2) that file being read. Seems to take a while to release the file lock so tests fail on permission error. - approach here is truncate files and clear them up on a subsequent run. - possibly a nicer approach is to not use files at all, but refactor so that we can pass around an io.BytesIO object or similar.
Nice working with you on this, @lindsay-stevens 🤝 |
From @lindsay-stevens on the PR to my fork here:
|
Closes #595
Why is this the best possible solution? Were any other approaches considered?
Even though #595 is not critical, this seems like a good change. We don't ever modify an Excel file so there's no point in opening it for writing.
There are slight differences between the writeable and read-only workbook representations I had to handle. The biggest decision I made was not to worry about the difference in dealing with trailing empty rows. That is, if there are empty rows in between some rows with contents, the two representations are the same. If there are empty rows after which there are no more rows with content, the read-only representation includes those empty rows whereas the writeable one does not. One of the equivalency tests between XLS and XLSX picked up on that difference. I tried dropping all empty rows but this would change the output for a rogue
table-list
feature that uses row number to name nodes (!). I verified that empty rows don't affect the output and decided to change the test XLSX instead of changing the code. That seems like the less risky option. I changed the XLSX and XLS files the same way for thegroup
andspecify_other
forms. I deleted the rows with formatting but no contents.What are the regression risks?
There could be some side effects to the different handling of empty rows and columns. I am not thinking of any but that would be good to get a second opinion on.
Before submitting this PR, please make sure you have:
tests
nosetests
and verified all tests passblack pyxform tests
to format code