Use read-only openpyxl representation to reduce memory usage #596

lognaturel · 2022-03-14T21:57:50Z

Closes #595

Why is this the best possible solution? Were any other approaches considered?

Even though #595 is not critical, this seems like a good change. We don't ever modify an Excel file so there's no point in opening it for writing.

There are slight differences between the writeable and read-only workbook representations I had to handle. The biggest decision I made was not to worry about the difference in dealing with trailing empty rows. That is, if there are empty rows in between some rows with contents, the two representations are the same. If there are empty rows after which there are no more rows with content, the read-only representation includes those empty rows whereas the writeable one does not. One of the equivalency tests between XLS and XLSX picked up on that difference. I tried dropping all empty rows but this would change the output for a rogue table-list feature that uses row number to name nodes (!). I verified that empty rows don't affect the output and decided to change the test XLSX instead of changing the code. That seems like the less risky option. I changed the XLSX and XLS files the same way for the group and specify_other forms. I deleted the rows with formatting but no contents.

What are the regression risks?

There could be some side effects to the different handling of empty rows and columns. I am not thinking of any but that would be good to get a second opinion on.

Before submitting this PR, please make sure you have:

included test cases for core behavior and edge cases in tests
run nosetests and verified all tests pass
run black pyxform tests to format code
verified that any code or assets from external sources are properly credited in comments

lognaturel · 2022-03-14T21:59:12Z

pyxform/utils.py

    try:
        sheet = wb.get_sheet_by_name(sheet_name)
    except KeyError:
        return False
-    if sheet.max_row < 2:


I believe this was a no-op. I added a test for a case with just the header at https://github.com/XLSForm/pyxform/pull/596/files#diff-56abab35a27949c49dd2950913ccd05332e683da4a81894108a4df6da67fb653 and https://github.com/XLSForm/pyxform/pull/596/files#diff-e5efd53ccd5cb96180d000bb4cd9157844e81d6b367d1253a2d8f23d7341ddcfR619 tests the empty sheet case

OK 👍 Since we can create XLSX from code now with openpyxl I would prefer to create test fixtures from code where possible. Even if it's using openpyxl directly for edge cases that the Markdown methods don't like.

I think you were saying I shouldn't have created the empty_sheets.xlsx file and instead built it programmatically. I've removed that test because it actually wasn't relevant. I ended up thinking this was about any empty sheet but it's only for select_one_external and the external_choices sheet.

This code is never reached if the external_choices sheet is either empty or has only a header. So I think it's ok not to build a real XLSX file to check those two cases. It seems very strange to me that this re-opens the XLSX again for parsing but I didn't look any deeper.

Ah, right, but the empty sheet case is not supported by md to json. So I'll have the two cases use a real xlsx so it's really what users experience.

lognaturel · 2022-03-14T22:00:23Z

tests/test_external_instances_for_selects.py

@@ -433,7 +448,7 @@ def test_external_other_extension_instances(self):
            |        | select_multiple_from_file neighbourhoods.pdf | neighbourhoods | Neighbourhoods |
            """,  # noqa
            errored=True,
-            error_contains=["should be a choices sheet in this xlsform"],
+            error__contains=["should be a choices sheet in this xlsform"],


A not so great finding: error_contains passes no matter what the expected error message is.

I looks like this particular test assertion has been that way for some years. This sort of thing happened a few months ago as well (for warnings__contains, I think). It could be avoided if assertPyxformXform had explicit keyword arguments instead of accepting **kwargs only. Would that refactor be OK with you? I could put it in backlog ticket for now.

Not a big priority but agreed an issue would be good! 🙏

lognaturel · 2022-03-14T22:10:46Z

I'm planning to do a pyxform release first thing tomorrow (GMT-7) and would like to include this if possible. @lindsay-stevens would be fantastic if you have time for a quick review. If you don't have time or think it's riskier than I believe, let me know and we can release without!

lognaturel · 2022-03-14T22:25:18Z

Sadly I don't understand why Windows tests are failing.

ERROR: Should find that XLSForm conversion produces itemsets.csv from external_choices.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\pyxform\pyxform\tests\test_external_instances_for_selects.py", line 418, in test_itemset_csv_generated_from_external_choices
    self.assertEqual('"suburb","Footscray","vic","melbourne"\n', rows[-1])
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\contextlib.py", line 119, in __exit__
    next(self.gen)
  File "D:\a\pyxform\pyxform\tests\utils.py", line 74, in get_temp_dir
    shutil.rmtree(temp_dir)
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 516, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 400, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 398, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmp9ms014r1\\select_one_external.xlsx'

lindsay-stevens

I think there should be a test with the problem case to demonstrates the memory consumption improvement. For example 1) generate the XLSX, 2) record resident memory (see below), 3) run xls2xform_convert, 4) assert resident memory not grown by some amount or percentage.

# Not perfect but good indication
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

lindsay-stevens · 2022-03-15T05:38:48Z

pyxform/xls2json_backends.py

+                    if not is_empty(value):
+                        row_dict[key] = xlsx_value_to_str(value)
+                except IndexError:
+                    pass  # rows may not have values for every column


I wasn't able to reproduce this IndexError. It seems that openpyxl emits the same length row tuple even if some cells are empty. It would be problematic if it did emit different length tuples, since the columns are accessed by index. Can this try/except be removed?

It depends on the structure of the XLSX document. I assume it's more likely to happen with machine-written files (e.g. the one from test_itemset_csv_generated_from_external_choices). I'll add a test explicitly targeting it.

pyxform/xls2json_backends.py

lindsay-stevens · 2022-03-15T05:48:29Z

pyxform/utils.py

    try:
        sheet = wb.get_sheet_by_name(sheet_name)
    except KeyError:
        return False
-    if sheet.max_row < 2:


OK 👍 Since we can create XLSX from code now with openpyxl I would prefer to create test fixtures from code where possible. Even if it's using openpyxl directly for edge cases that the Markdown methods don't like.

lindsay-stevens · 2022-03-15T05:50:56Z

tests/test_external_instances_for_selects.py

@@ -402,9 +402,6 @@ def test_itemset_csv_generated_from_external_choices(self):
                xls2xform_convert(
                    xlsform_path=wb_path,
                    xform_path=get_xml_path(wb_path),
-                    validate=True,
-                    pretty_print=False,
-                    enketo=False,


Not a dealbreaker but I am wondering why these arguments are removed from the test? I specified them so that the test was explicit about how the file should be generated.

I mostly wanted to make sure to remove the validate=True because it slows the tests a lot. We always run with Validate before release. It seemed to me the others didn't really matter but I'm happy to bring them back. How strongly do you feel? In this case the pretty_print and enketo values don't change the output, right?

We always run with Validate before release.

Ah, except that these aren't pyxform test cases so the flag change won't affect them. I still think that's better because it runs the tests faster. I also happen to know that the entire "fast external itemset" feature is ignored by JavaRosa/Validate.

lindsay-stevens · 2022-03-15T05:56:45Z

tests/test_external_instances_for_selects.py

@@ -433,7 +448,7 @@ def test_external_other_extension_instances(self):
            |        | select_multiple_from_file neighbourhoods.pdf | neighbourhoods | Neighbourhoods |
            """,  # noqa
            errored=True,
-            error_contains=["should be a choices sheet in this xlsform"],
+            error__contains=["should be a choices sheet in this xlsform"],


I looks like this particular test assertion has been that way for some years. This sort of thing happened a few months ago as well (for warnings__contains, I think). It could be avoided if assertPyxformXform had explicit keyword arguments instead of accepting **kwargs only. Would that refactor be OK with you? I could put it in backlog ticket for now.

lindsay-stevens · 2022-03-15T06:05:45Z

Sadly I don't understand why Windows tests are failing.

I haven't tested but it could be that the workbook is still open, as mentioned here: https://openpyxl.readthedocs.io/en/latest/optimized.html

tests/test_external_instances_for_selects.py

lognaturel · 2022-03-16T03:35:23Z

tests/test_xls2json.py

@@ -601,6 +603,26 @@ def test_workbook_to_json__optional_sheets_ok(self):
            warnings_count=0,
        )

+    def test_xls2xform_convert__e2e_row_with_no_column_value(self):


If you remove the code you commented on here, this test should fail.

tests/test_xls2json.py

lognaturel · 2022-03-16T04:27:11Z

it could be that the workbook is still open

Thanks so much for helping me read the docs. That seemed like a likely culprit but either I'm closing in the wrong spot or something else is going on. Unfortunately I'm out of time for this tonight.

Having chatted with @yanokwa and Team Central, I think this fix should go in the pyxform release so I haven't released yet. We don't know how the extra columns get generated and it seems like the kind of thing that others would run into so best to patch it now. If you have a chance to review during your day, great. Ideas on the open files issue much appreciated! I'll come back to it tomorrow.

tests/test_xls2json.py

lindsay-stevens · 2022-03-16T12:05:51Z

I came up with a workaround for the Windows test failures, PR here to your branch. Otherwise, this PR seems good to go.

- presumably, windows antivirus scanning is being activated by 1) a new file created and 2) that file being read. Seems to take a while to release the file lock so tests fail on permission error. - approach here is truncate files and clear them up on a subsequent run. - possibly a nicer approach is to not use files at all, but refactor so that we can pass around an io.BytesIO object or similar.

lognaturel · 2022-03-16T16:24:35Z

Nice working with you on this, @lindsay-stevens 🤝

lognaturel · 2022-03-17T15:37:23Z

From @lindsay-stevens on the PR to my fork here:

Just a note for posterity about my guess for antivirus. Main clue was the Windows "Resource Monitor" tab for Disk activity. This activity list shows active file paths and the PID and image name doing read/write on them. When running the tests, the file was being read by a process with image name "System" and description "NT Kernel & System". Not sure what else the system would be doing with files except to scan them - but whatever it's doing we can't exactly terminate the system process 💥

The same thing happened when I changed the temp root to pyxform/tests/test_output so it doesn't seem to be related to using the system/user temporary directory.

Use read-only openpyxl workbook to produce itemsets.csv

548f058

lognaturel commented Mar 14, 2022

View reviewed changes

lognaturel requested a review from lindsay-stevens March 14, 2022 22:01

Use read-only openpyxl workbook to read form definition

b85ac18

lognaturel force-pushed the perf branch from 9703b8e to b85ac18 Compare March 14, 2022 22:08

lognaturel changed the title ~~Perf~~ Use read-only openpyxl representation to reduce memory usage Mar 14, 2022

lognaturel marked this pull request as ready for review March 14, 2022 22:11

lognaturel mentioned this pull request Mar 14, 2022

For some forms, v1.8 uses an unacceptable amount of memory #595

Closed

2 tasks

lindsay-stevens reviewed Mar 15, 2022

View reviewed changes

lognaturel commented Mar 16, 2022

View reviewed changes

tests/test_external_instances_for_selects.py Show resolved Hide resolved

lognaturel commented Mar 16, 2022

View reviewed changes

Improve tests, close workbooks

efe22a2

lognaturel force-pushed the perf branch 2 times, most recently from a1c859f to 0054045 Compare March 16, 2022 04:18

lognaturel commented Mar 16, 2022

View reviewed changes

tests/test_xls2json.py Show resolved Hide resolved

lognaturel force-pushed the perf branch from 0054045 to 02250e2 Compare March 16, 2022 04:19

Add test to document memory usage regression

0fa6096

lognaturel force-pushed the perf branch from 02250e2 to 0fa6096 Compare March 16, 2022 04:25

lognaturel requested a review from lindsay-stevens March 16, 2022 04:28

lognaturel commented Mar 16, 2022

View reviewed changes

tests/test_xls2json.py Show resolved Hide resolved

lindsay-stevens mentioned this pull request Mar 16, 2022

596: Use read-only openpyxl representation to reduce memory usage lognaturel/pyxform#1

Merged

lindsay-stevens approved these changes Mar 16, 2022

View reviewed changes

lindsay-stevens mentioned this pull request Mar 16, 2022

Refactor PyxformTestCase.assertPyxformXform to use explicit keyword arguments instead of kwargs to avoid tests passing by mistake #597

Closed

dev: linter / formatting

be969c1

lognaturel merged commit f49cdae into XLSForm:master Mar 16, 2022

lognaturel deleted the perf branch March 16, 2022 16:22

yanokwa mentioned this pull request Apr 12, 2022

Form conversion is slow on form with lots of blank rows and columns #604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use read-only openpyxl representation to reduce memory usage #596

Use read-only openpyxl representation to reduce memory usage #596

lognaturel commented Mar 14, 2022 •

edited

Loading

lognaturel Mar 14, 2022

lindsay-stevens Mar 15, 2022

lognaturel Mar 15, 2022

lognaturel Mar 15, 2022 •

edited

Loading

lognaturel Mar 14, 2022

lindsay-stevens Mar 15, 2022

lognaturel Mar 15, 2022

lognaturel commented Mar 14, 2022

lognaturel commented Mar 14, 2022

lindsay-stevens left a comment

lindsay-stevens Mar 15, 2022

lognaturel Mar 15, 2022

lindsay-stevens Mar 15, 2022

lindsay-stevens Mar 15, 2022

lognaturel Mar 15, 2022 •

edited

Loading

lognaturel Mar 16, 2022 •

edited

Loading

lindsay-stevens Mar 15, 2022

lindsay-stevens commented Mar 15, 2022

lognaturel Mar 16, 2022

lognaturel commented Mar 16, 2022 •

edited

Loading

lindsay-stevens commented Mar 16, 2022

lognaturel commented Mar 16, 2022

lognaturel commented Mar 17, 2022

Use read-only openpyxl representation to reduce memory usage #596

Use read-only openpyxl representation to reduce memory usage #596

Conversation

lognaturel commented Mar 14, 2022 • edited Loading

Why is this the best possible solution? Were any other approaches considered?

What are the regression risks?

Before submitting this PR, please make sure you have:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lognaturel Mar 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lognaturel commented Mar 14, 2022

lognaturel commented Mar 14, 2022

lindsay-stevens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lognaturel Mar 15, 2022 • edited Loading

Choose a reason for hiding this comment

lognaturel Mar 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindsay-stevens commented Mar 15, 2022

Choose a reason for hiding this comment

lognaturel commented Mar 16, 2022 • edited Loading

lindsay-stevens commented Mar 16, 2022

lognaturel commented Mar 16, 2022

lognaturel commented Mar 17, 2022

lognaturel commented Mar 14, 2022 •

edited

Loading

lognaturel Mar 15, 2022 •

edited

Loading

lognaturel Mar 15, 2022 •

edited

Loading

lognaturel Mar 16, 2022 •

edited

Loading

lognaturel commented Mar 16, 2022 •

edited

Loading