Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use read-only openpyxl representation to reduce memory usage #596

Merged
merged 6 commits into from
Mar 16, 2022

Conversation

lognaturel
Copy link
Contributor

@lognaturel lognaturel commented Mar 14, 2022

Closes #595

Why is this the best possible solution? Were any other approaches considered?

Even though #595 is not critical, this seems like a good change. We don't ever modify an Excel file so there's no point in opening it for writing.

There are slight differences between the writeable and read-only workbook representations I had to handle. The biggest decision I made was not to worry about the difference in dealing with trailing empty rows. That is, if there are empty rows in between some rows with contents, the two representations are the same. If there are empty rows after which there are no more rows with content, the read-only representation includes those empty rows whereas the writeable one does not. One of the equivalency tests between XLS and XLSX picked up on that difference. I tried dropping all empty rows but this would change the output for a rogue table-list feature that uses row number to name nodes (!). I verified that empty rows don't affect the output and decided to change the test XLSX instead of changing the code. That seems like the less risky option. I changed the XLSX and XLS files the same way for the group and specify_other forms. I deleted the rows with formatting but no contents.

What are the regression risks?

There could be some side effects to the different handling of empty rows and columns. I am not thinking of any but that would be good to get a second opinion on.

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run nosetests and verified all tests pass
  • run black pyxform tests to format code
  • verified that any code or assets from external sources are properly credited in comments

try:
sheet = wb.get_sheet_by_name(sheet_name)
except KeyError:
return False
if sheet.max_row < 2:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK 👍 Since we can create XLSX from code now with openpyxl I would prefer to create test fixtures from code where possible. Even if it's using openpyxl directly for edge cases that the Markdown methods don't like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you were saying I shouldn't have created the empty_sheets.xlsx file and instead built it programmatically. I've removed that test because it actually wasn't relevant. I ended up thinking this was about any empty sheet but it's only for select_one_external and the external_choices sheet.

This code is never reached if the external_choices sheet is either empty or has only a header. So I think it's ok not to build a real XLSX file to check those two cases. It seems very strange to me that this re-opens the XLSX again for parsing but I didn't look any deeper.

Copy link
Contributor Author

@lognaturel lognaturel Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, but the empty sheet case is not supported by md to json. So I'll have the two cases use a real xlsx so it's really what users experience.

@@ -433,7 +448,7 @@ def test_external_other_extension_instances(self):
| | select_multiple_from_file neighbourhoods.pdf | neighbourhoods | Neighbourhoods |
""", # noqa
errored=True,
error_contains=["should be a choices sheet in this xlsform"],
error__contains=["should be a choices sheet in this xlsform"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A not so great finding: error_contains passes no matter what the expected error message is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looks like this particular test assertion has been that way for some years. This sort of thing happened a few months ago as well (for warnings__contains, I think). It could be avoided if assertPyxformXform had explicit keyword arguments instead of accepting **kwargs only. Would that refactor be OK with you? I could put it in backlog ticket for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big priority but agreed an issue would be good! 🙏

@lognaturel
Copy link
Contributor Author

I'm planning to do a pyxform release first thing tomorrow (GMT-7) and would like to include this if possible. @lindsay-stevens would be fantastic if you have time for a quick review. If you don't have time or think it's riskier than I believe, let me know and we can release without!

@lognaturel lognaturel changed the title Perf Use read-only openpyxl representation to reduce memory usage Mar 14, 2022
@lognaturel lognaturel marked this pull request as ready for review March 14, 2022 22:11
@lognaturel
Copy link
Contributor Author

Sadly I don't understand why Windows tests are failing.

ERROR: Should find that XLSForm conversion produces itemsets.csv from external_choices.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\pyxform\pyxform\tests\test_external_instances_for_selects.py", line 418, in test_itemset_csv_generated_from_external_choices
    self.assertEqual('"suburb","Footscray","vic","melbourne"\n', rows[-1])
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\contextlib.py", line 119, in __exit__
    next(self.gen)
  File "D:\a\pyxform\pyxform\tests\utils.py", line 74, in get_temp_dir
    shutil.rmtree(temp_dir)
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 516, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 400, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 398, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmp9ms014r1\\select_one_external.xlsx'

Copy link
Contributor

@lindsay-stevens lindsay-stevens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a test with the problem case to demonstrates the memory consumption improvement. For example 1) generate the XLSX, 2) record resident memory (see below), 3) run xls2xform_convert, 4) assert resident memory not grown by some amount or percentage.

# Not perfect but good indication
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

if not is_empty(value):
row_dict[key] = xlsx_value_to_str(value)
except IndexError:
pass # rows may not have values for every column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to reproduce this IndexError. It seems that openpyxl emits the same length row tuple even if some cells are empty. It would be problematic if it did emit different length tuples, since the columns are accessed by index. Can this try/except be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the structure of the XLSX document. I assume it's more likely to happen with machine-written files (e.g. the one from test_itemset_csv_generated_from_external_choices). I'll add a test explicitly targeting it.

pyxform/xls2json_backends.py Outdated Show resolved Hide resolved
try:
sheet = wb.get_sheet_by_name(sheet_name)
except KeyError:
return False
if sheet.max_row < 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK 👍 Since we can create XLSX from code now with openpyxl I would prefer to create test fixtures from code where possible. Even if it's using openpyxl directly for edge cases that the Markdown methods don't like.

@@ -402,9 +402,6 @@ def test_itemset_csv_generated_from_external_choices(self):
xls2xform_convert(
xlsform_path=wb_path,
xform_path=get_xml_path(wb_path),
validate=True,
pretty_print=False,
enketo=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a dealbreaker but I am wondering why these arguments are removed from the test? I specified them so that the test was explicit about how the file should be generated.

Copy link
Contributor Author

@lognaturel lognaturel Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly wanted to make sure to remove the validate=True because it slows the tests a lot. We always run with Validate before release. It seemed to me the others didn't really matter but I'm happy to bring them back. How strongly do you feel? In this case the pretty_print and enketo values don't change the output, right?

Copy link
Contributor Author

@lognaturel lognaturel Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always run with Validate before release.

Ah, except that these aren't pyxform test cases so the flag change won't affect them. I still think that's better because it runs the tests faster. I also happen to know that the entire "fast external itemset" feature is ignored by JavaRosa/Validate.

@@ -433,7 +448,7 @@ def test_external_other_extension_instances(self):
| | select_multiple_from_file neighbourhoods.pdf | neighbourhoods | Neighbourhoods |
""", # noqa
errored=True,
error_contains=["should be a choices sheet in this xlsform"],
error__contains=["should be a choices sheet in this xlsform"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looks like this particular test assertion has been that way for some years. This sort of thing happened a few months ago as well (for warnings__contains, I think). It could be avoided if assertPyxformXform had explicit keyword arguments instead of accepting **kwargs only. Would that refactor be OK with you? I could put it in backlog ticket for now.

@lindsay-stevens
Copy link
Contributor

Sadly I don't understand why Windows tests are failing.

I haven't tested but it could be that the workbook is still open, as mentioned here: https://openpyxl.readthedocs.io/en/latest/optimized.html

image

@@ -601,6 +603,26 @@ def test_workbook_to_json__optional_sheets_ok(self):
warnings_count=0,
)

def test_xls2xform_convert__e2e_row_with_no_column_value(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove the code you commented on here, this test should fail.

@lognaturel
Copy link
Contributor Author

lognaturel commented Mar 16, 2022

it could be that the workbook is still open

Thanks so much for helping me read the docs. That seemed like a likely culprit but either I'm closing in the wrong spot or something else is going on. Unfortunately I'm out of time for this tonight.

Having chatted with @yanokwa and Team Central, I think this fix should go in the pyxform release so I haven't released yet. We don't know how the extra columns get generated and it seems like the kind of thing that others would run into so best to patch it now. If you have a chance to review during your day, great. Ideas on the open files issue much appreciated! I'll come back to it tomorrow.

@lindsay-stevens
Copy link
Contributor

I came up with a workaround for the Windows test failures, PR here to your branch. Otherwise, this PR seems good to go.

- presumably, windows antivirus scanning is being activated by 1)
  a new file created and 2) that file being read. Seems to take a while
  to release the file lock so tests fail on permission error.
- approach here is truncate files and clear them up on a subsequent run.
- possibly a nicer approach is to not use files at all, but refactor
  so that we can pass around an io.BytesIO object or similar.
@lognaturel lognaturel merged commit f49cdae into XLSForm:master Mar 16, 2022
@lognaturel lognaturel deleted the perf branch March 16, 2022 16:22
@lognaturel
Copy link
Contributor Author

Nice working with you on this, @lindsay-stevens 🤝

@lognaturel
Copy link
Contributor Author

From @lindsay-stevens on the PR to my fork here:

Just a note for posterity about my guess for antivirus. Main clue was the Windows "Resource Monitor" tab for Disk activity. This activity list shows active file paths and the PID and image name doing read/write on them. When running the tests, the file was being read by a process with image name "System" and description "NT Kernel & System". Not sure what else the system would be doing with files except to scan them - but whatever it's doing we can't exactly terminate the system process 💥

The same thing happened when I changed the temp root to pyxform/tests/test_output so it doesn't seem to be related to using the system/user temporary directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

For some forms, v1.8 uses an unacceptable amount of memory
2 participants