Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Injecting mandatory FMTM fields into custom XLSForms #1722

Closed
spwoodcock opened this issue Jul 31, 2024 · 8 comments · Fixed by #1792
Closed

Injecting mandatory FMTM fields into custom XLSForms #1722

spwoodcock opened this issue Jul 31, 2024 · 8 comments · Fixed by #1792
Assignees
Labels
backend Related to backend code effort:medium Likely a day or two enhancement New feature or request priority:high Should be addressed as a priority testing:ready Ready for testing

Comments

@spwoodcock
Copy link
Member

spwoodcock commented Jul 31, 2024

Is your feature request related to a problem? Please describe.

  • Currently our XLSForm has hidden fields that are mandatory: feature / task selection, digitization_correct questions.
  • This is prone to breakage if the user uploads a custom XLSForm and accidentally deletes fields (despite being hidden).
  • We need a more resilient mechanism to allow for custom XLSForm upload flexbility, but also impose our restrictions / requirements that FMTM needs to function well.

Describe the solution you'd like

  • Option 1: append the fields in the XLSForm, either via pyxlsx or possibly via Pandas, prior to XLS --> XML.

    • Load the XLSForm as a dataframe.
    • Have pre-prepared separate XLSForms with building questions & digitization_correct questions.
    • Read the building selection questions into a dataframe and append to start.
    • Read the digitization_correct questions into a dataframe and append to end.
    • However, this will introduce a big dependency: Pandas, which is not ideal.
  • Option 2: use the current XML parsing / manipulation method to inject all required params, after XLS --> XML.

    • This method is possible, but may have many fields we need to insert into specific places.
    • Needs testing and is more prone to breakage.

Additional context

  • This first option would probably require we update all form handling to use Pandas, which would be quite an improvement to XLS wrangling.

Note

See comment below. We should not use Pandas, but instead python-calamine.

@spwoodcock spwoodcock added enhancement New feature or request priority:low Backlog of tasks that will be addressed in time backend Related to backend code effort:medium Likely a day or two labels Jul 31, 2024
@spwoodcock
Copy link
Member Author

This would make the XLSForms a whole lot more maintainable.

Currently if we have say 10 categories = 10 XLSForms, we need to keep the extra fields we use in FMTM in sync for all of them.

Ideally we have:

  • 1 XLSForm with mandatory fields for FMTM.
  • 10 XLSForms with questions only related to the specific mapping category.
  • On project creation we merge the selected form with the 'FMTM mandatory fields' XLSForm.

@manjitapandey manjitapandey added priority:high Should be addressed as a priority and removed priority:low Backlog of tasks that will be addressed in time labels Aug 2, 2024
@spwoodcock
Copy link
Member Author

spwoodcock commented Aug 3, 2024

I just did a small assessment of package size between various XLS and XLSX parser libraries:

image

  • Pandas is nice to work with, but pulls in so many unnecessary libraries it's not worth it.
  • openpyxl and xlrd would be required together to parse XLSX and XLS files respectively.
  • python-calamine seems like the most sustainable choice (it was recently integrated into Pandas, so should have good community support). At the low level, it's actually using a compiled rust binary, making it up to 5x faster than openpyxl and 1.5x faster than xlrd. It's also the smallest, adding only 12MB to the build!

We should use python-calamine for XLS and XLSX parsing.

@Sujanadh
Copy link
Collaborator

I found that python-calamine is only for reading xls files but can't use it to manipulate the form fields. We would need another package, for example, openpyxl to modify the form in order to inject fmtm mandatory fields.

@spwoodcock
Copy link
Member Author

Oh no! That's a pain 😅

I considered Pandas but it's a huge library (200MB).
Also Polars is similar.

Openpyxl is ancient now and unmaintained - but the xlsx spec probably hasn't changed and perhaps it's just feature complete.

It would mean we need to use it alongside xlrd to read xls though!

Two other options:

  • Use Pandas or Polars for the simplicity?
  • Just deal with the XML instead of XLS(X)?
    (This isn't ideal, but perhaps we can make it easier by parsing the XML to dict)

@spwoodcock
Copy link
Member Author

Acknowledgement I have been a bit silly

  • Pandas is large, but it's also going to be supported for a long time & is easy to use.
  • While it takes up a lot of space on a fresh install, I tested adding it to our current FMTM stack & it adds very little...
  • I assume this is because our other dependencies already install the large subdependencies
  • We can safely add pandas to the backend dependencies - sorry for the mess around!
  • While python-calamine is definitely a lot faster, it's overkill for our small xlsx files (I was mainly looking at it as an alternative for pandas)

If we ever need to revisit this

If we ever want to remove pandas for any reason, we can consider calamine again for writing, in parallel with XlsxWriter for writing.

We would need to iterate through the survey and choices sheets to combine our defaults with the custom uploaded XLSX.

from xlsxwriter import Workbook
from python_calamine import CalamineWorkbook

workbook = CalamineWorkbook.from_path("form.xlsx")
xlsx_data = workbook.get_sheet_by_name("survey").to_python()
# [
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ["1",  "2",  "3",  "4",  "5",  "6",  "7"],
# ]

workbook = Workbook('combined.xlsx')
worksheet = workbook.add_worksheet("survey")
for row_index, row_data in enumerate(xlsx_data):
    for column_index, column_data in enumerate(row_data):
        # Get the column letter from column_index
        column_letter = xxx
        worksheet.write(f'{column_letter}{row_index}', column_data)

@spwoodcock spwoodcock changed the title Assess possibility of injecting mandatory FMTM fields into XLSForm Injecting mandatory FMTM fields into custom XLSForms Aug 21, 2024
@spwoodcock
Copy link
Member Author

Possible issue with translations

Interesting issue I just encountered!

For the choice sheet we have a placeholder:
task_filter 1 1 1 1 1

The 1 values are for list_name name and label
But they are also required in our workflow for label::English(en) and the other 3 languages.
If they are missing, then the value is displayed at a null value in ODK Collect!

I'm raising this because it may cause issues with merging!
If our mandatory fields form has the fields: label::English(en) label::Spanish(es) etc
But the user uploaded form doesn't have these fields, and just uses a default label field, then it looks like the value for specific languages will take priority: in this case they would be null / empty.

Possible solution

  • It's quite hard to automatically merge translations, as the user may use the label field with a language other than English.
  • We probably should address this complexity later for now, and assume the default label will be English, with other languages being in fields like label::Spanish(es).

Scenario 1:

  • The user uploads a form with label field only.
  • We remove our label::English(en) label::Spanish(es) etc fields from the mandatory fields form.
  • We move label::English(en) contents to the label field for consistency.
  • There is only one language for the form - all users will see this.

Scenario 2:

  • The user uploads a form with ``label::Spanish(es)` etc fields.
  • We try to match these fields with the mandatory field form fields.
  • We remove any translation fields from the mandatory field form that are not present in the user uploaded form.
  • Example:
    • User uploads a form with label::Spanish(es) label::French(fr) fields
    • We match and keep the label::Spanish(es) label::French(fr) fields form our mandatory fields form
    • We remove the additional label::English(en) label::Swahili(sw) fields.

Note

In future we should translation our mandatory field questions into all available languages in ODK.
Then we can handle merging for every possible field.

@spwoodcock
Copy link
Member Author

spwoodcock commented Aug 22, 2024

Current mandatory fields in survey sheet:

image

image

image

Current mandatory fields in choices sheet:

image

Current mandatory fields in entities sheet:

image

@spwoodcock
Copy link
Member Author

spwoodcock commented Sep 4, 2024

For future reference, openpyxl is still maintained, but is available here instead! https://foss.heptapod.net/openpyxl/openpyxl

openpyxl can be used to read AND write XLSX...

My bad for missing that. Although it makes no difference now, as we are using higher level Pandas anyway.

This may be useful for the future though, if one day we remove our reliance on Pandas in osm-fieldwork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Related to backend code effort:medium Likely a day or two enhancement New feature or request priority:high Should be addressed as a priority testing:ready Ready for testing
Projects
Development

Successfully merging a pull request may close this issue.

4 participants