Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two test failures affecting Bio.SeqIO.SeqXmlIO in Debian sid #4640

Closed
emollier opened this issue Feb 28, 2024 · 18 comments · Fixed by #4683
Closed

two test failures affecting Bio.SeqIO.SeqXmlIO in Debian sid #4640

emollier opened this issue Feb 28, 2024 · 18 comments · Fixed by #4683

Comments

@emollier
Copy link
Contributor

Greetings,

This issue may not be an easy one, and even risks being Debian
specific, so if you are busy somewhere else, then I won't mind.
:)

I'm having great difficulties pinpointing a change between
Debian testing and sid which is causing Debian bug #1064147,
affecting both Biopython 1.81 and 1.83 in sid, but none of them
in testing (as of today). The error output comes from the test
suite, and gives in Debian sid:

======================================================================
ERROR: test_embl7 (test_SeqIO.TestSeqIO.test_embl7)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 3406, in test_embl7
   self.perform_test(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 625, in perform_test
   self.check_simple_write_read(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 362, in check_simple_write_read
   records2 = list(SeqIO.parse(handle=handle, format=fmt))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
   return next(self.records)
          ^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 482, in iterate
   parser.close()
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
   self.feed(b"", isFinal=True)
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
   self._parser.Parse(data, isFinal)
 File "../Modules/pyexpat.c", line 416, in StartElement
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
   self._cont_handler.startElementNS(pair, None,
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 184, in startEntryFieldElement
   return self.startPropertyElement(attrs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 374, in startPropertyElement
   record = self.records[-1]
            ~~~~~~~~~~~~^^^^
IndexError: list index out of range

======================================================================
ERROR: test_genbank8 (test_SeqIO.TestSeqIO.test_genbank8)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 2803, in test_genbank8
   self.perform_test(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 625, in perform_test
   self.check_simple_write_read(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 362, in check_simple_write_read
   records2 = list(SeqIO.parse(handle=handle, format=fmt))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
   return next(self.records)
          ^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 482, in iterate
   parser.close()
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
   self.feed(b"", isFinal=True)
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
   self._parser.Parse(data, isFinal)
 File "../Modules/pyexpat.c", line 416, in StartElement
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
   self._cont_handler.startElementNS(pair, None,
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 184, in startEntryFieldElement
   return self.startPropertyElement(attrs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 374, in startPropertyElement
   record = self.records[-1]
            ~~~~~~~~~~~~^^^^
IndexError: list index out of range

Version comparison of Biopython direct dependencies suggest they
are not involved in the test regression I observe:

  • python3-numpy version in sid: 1:1.24.2-2
  • python3-numpy version in testing: 1:1.24.2-2
  • python3 version in sid: 3.11.6-1
  • python3 version in testing: 3.11.6-1
  • libc6 version in sid: 2.37-15
  • libc6 version in testing: 2.37-15
  • python3-reportlab version in sid: 4.1.0-1
  • python3-reportlab version in testing: 4.1.0-1
  • w3c-sgml-lib version in sid: 1.3-3
  • w3c-sgml-lib version in testing: 1.3-3

This means that the issue is caused by a transitive dependency
(which I have not managed to identify yet), or something else
entirely. I also reproduced the problem on Debian Salsa CI,
which strongly hints that I haven't mishandled my build
environment. Besides, if one of the direct dependencies had
triggered the bug, then I would have expected the test run
triggered by the package migration CI to have tripped, which
has not happened (which hints that the situation did not appear
before 2024 February 3rd in sid).

I wouldn't be surprised to learn that this isn't be a problem in
Biopython per se, but I'm running out of options without your
thoughts upstream (maybe the combination of these two particular
tests failing and the others passing hints to something in
particular?). Do you per chance have an idea of what I am
missing that could cause the SeqXmlIO records to be empty and
cause these test failures?

Thank you for your time,
Étienne.

PS: here below, the mandatory form for your convenience:

Setup

I am reporting a problem with Biopython version, Python version,
and operating system as follows. In Debian sid:

>>> import sys; print(sys.version)
3.11.8 (main, Feb  7 2024, 21:52:08) [GCC 13.2.0]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-6.6.15-amd64-x86_64-with-glibc2.37
>>> import Bio; print(Bio.__version__)
1.83

In Debian testing (no changes visibly):

>>> import sys; print(sys.version)
3.11.8 (main, Feb  7 2024, 21:52:08) [GCC 13.2.0]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-6.6.15-amd64-x86_64-with-glibc2.37
>>> import Bio; print(Bio.__version__)
1.83

This is also valid with Biopython 1.81.

Expected behaviour

I would like to make all test items of the test suite pass on
Debian sid.

Actual behaviour

All test items pass on Debian testing. However on Debian sid, I
observe the following test items failing:

======================================================================
ERROR: test_embl7 (test_SeqIO.TestSeqIO.test_embl7)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 3406, in test_embl7
   self.perform_test(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 625, in perform_test
   self.check_simple_write_read(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 362, in check_simple_write_read
   records2 = list(SeqIO.parse(handle=handle, format=fmt))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
   return next(self.records)
          ^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 482, in iterate
   parser.close()
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
   self.feed(b"", isFinal=True)
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
   self._parser.Parse(data, isFinal)
 File "../Modules/pyexpat.c", line 416, in StartElement
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
   self._cont_handler.startElementNS(pair, None,
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 184, in startEntryFieldElement
   return self.startPropertyElement(attrs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 374, in startPropertyElement
   record = self.records[-1]
            ~~~~~~~~~~~~^^^^
IndexError: list index out of range

======================================================================
ERROR: test_genbank8 (test_SeqIO.TestSeqIO.test_genbank8)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 2803, in test_genbank8
   self.perform_test(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 625, in perform_test
   self.check_simple_write_read(
 File "/tmp/autopkgtest.OslsYr/autopkgtest_tmp/Tests/test_SeqIO.py", line 362, in check_simple_write_read
   records2 = list(SeqIO.parse(handle=handle, format=fmt))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
   return next(self.records)
          ^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 482, in iterate
   parser.close()
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
   self.feed(b"", isFinal=True)
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
   self._parser.Parse(data, isFinal)
 File "../Modules/pyexpat.c", line 416, in StartElement
 File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
   self._cont_handler.startElementNS(pair, None,
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 184, in startEntryFieldElement
   return self.startPropertyElement(attrs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 374, in startPropertyElement
   record = self.records[-1]
            ~~~~~~~~~~~~^^^^
IndexError: list index out of range

Steps to reproduce

Run the test suite.

@peterjc
Copy link
Member

peterjc commented Feb 28, 2024

That is odd. I've only skimmed the tracebacks, however I can give some context.

Both the failures are within check_simple_write_read which takes an existing sequence file (here one in EMBL format, and one in the related GenBank format), exports it as another format, and confirms we can read it back again with no unexpected data loss.

This suggests to me the issue is either in writing out the SeqXML format, or in reading SeqXML format. Libraries this code imports to do that:

from xml import sax
from xml.sax import handler
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesImpl

Are there any likely looking XML SAX libraries whose version changed?

It might be helpful to try the SeqXML output directly, and validate the XML file. e.g.

$ python -c "from Bio import SeqIO; SeqIO.convert('Tests/EMBL/SC10H5.embl', 'embl', 'check_me.xml', 'seqxml')"

The other thing I would double check is the system locale (just because that has caused oddities before).

@emollier
Copy link
Contributor Author

Hi Peter,

Are there any likely looking XML SAX libraries whose version changed?

I looked there first, but the xml.sax looks to be part of the standard library, which is tightly coupled to the python3 interpreter:

$ apt-file search /xml/sax/xmlreader.py
libpython3.10-stdlib: /usr/lib/python3.10/xml/sax/xmlreader.py
libpython3.11-stdlib: /usr/lib/python3.11/xml/sax/xmlreader.py
libpython3.12-stdlib: /usr/lib/python3.12/xml/sax/xmlreader.py
[... snipped various irrelevant stdlib embeddings ...]

So the issue would have been in an external dependency of the module. Skimming through python3.11 shows nothing standing out; dependencies are merely the compiler and libc.

The libc hasn't changed, so, urgh, I'm hoping I'm not running into a long winded compiler bug, the only difference I see that might affect xml.sax operation between testing and sid is gcc (through libgcc_s.so.1):

  • testing has gcc 14-20240201 (beginning of the month)
  • sid has gcc 14-20240221 (end of the month)

Possibly relevant milestone would be gcc 14-20240207, as I know for sure I witnessed the issue before Berlin sprint on 16th and 17th. If further testing prove this happens to be the relevant change, then I will move back the discussion in Debian and close the present issue.

It might be helpful to try the SeqXML output directly, and validate the XML file. e.g.

$ python -c "from Bio import SeqIO; SeqIO.convert('Tests/EMBL/SC10H5.embl', 'embl', 'check_me.xml', 'seqxml')"

Thanks for the snippet, for what it's worth I obtained the same output from either testing or sid. I guess this might point towards a problem with reading back the SeqXML? The file obtained out of testing is in attachment if need be.

testing.xml.gz

The other thing I would double check is the system locale (just because that has caused oddities before).

I think we've been bitten by that in the past. Nowadays we run the test suite enforcing LC_ALL=C.UTF-8. This is probably not the issue.

Thank you for your time,
Étienne.

@emollier
Copy link
Contributor Author

Re,

If further testing prove [libgcc_s.so.1] happens to be the relevant change, then I will move back the discussion in Debian and close the present issue.

I reran the test suite in Debian testing with a targeted upgrade of the libgcc_s.so.1 from sid, and tests passed, so this is not related to the compiler library.

If you have a snippet to reread the output of the snippet that produced the xml, I would be taker.

Have a nice day, :)
Étienne.

@peterjc
Copy link
Member

peterjc commented Feb 29, 2024

Well a one line to parse the XML file back again could be:

$ python -c "from Bio import SeqIO; print(SeqIO.read('check_me.xml', 'seqxml').id)"
AL031232

Also worth a try is:

$ cd Tests/
$ python test_SeqIO.py

If that works (outside of run_tests.py) that is rather puzzling. If it fails, we might start cutting test_SeqIO.py down to a smaller and smaller test.

@emollier
Copy link
Contributor Author

emollier commented Mar 1, 2024

Thanks for the reader, it looks like the issue stems solely from the reading, I see on sid:

$ python3 -c "from Bio import SeqIO; print(SeqIO.read('check_me.xml', 'seqxml').id)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/__init__.py", line 658, in read
    next(iterator)
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/Interfaces.py", line 72, in __next__
    return next(self.records)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 447, in iterate
    parser.close()
  File "/usr/lib/python3.11/xml/sax/expatreader.py", line 240, in close
    self.feed(b"", isFinal=True)
  File "/usr/lib/python3.11/xml/sax/expatreader.py", line 217, in feed
    self._parser.Parse(data, isFinal)
  File "../Modules/pyexpat.c", line 416, in StartElement
  File "/usr/lib/python3.11/xml/sax/expatreader.py", line 369, in start_element_ns
    self._cont_handler.startElementNS(pair, None,
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 163, in startEntryFieldElement
    return self.startPropertyElement(attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/Bio/SeqIO/SeqXmlIO.py", line 339, in startPropertyElement
    record = self.records[-1]
             ~~~~~~~~~~~~^^^^
IndexError: list index out of range

while on testing I have the expected result:

$ python3 -c "from Bio import SeqIO; print(SeqIO.read('check_me.xml', 'seqxml').id)"
AL031232

@emollier
Copy link
Contributor Author

emollier commented Mar 1, 2024

Ouch, by the way, I had a look back at my build log from yesterday, and I realized my gcc upgrade attempt didn't apply cleanly, so this is still a variable to consider.

@peterjc
Copy link
Member

peterjc commented Mar 1, 2024

Progress for sure - and the SAX expat C code is involved somehow.

@mr-c
Copy link
Contributor

mr-c commented Mar 5, 2024

Looks like the NixOS people ran into this as well: NixOS/nixpkgs#292138

@emollier
Copy link
Contributor Author

emollier commented Mar 8, 2024

It took me a while, but I finally managed to run the test in a Debian testing environment with upgraded libgcc_s.so.1. The recipe was to upgrade the following packages in a lock step:

  • libgcc-s1
  • gcc-14-base
  • libstdc++6
  • libgomp1
  • libgfortran5

They all stem from Gcc 14 version 14-20240303. In the end, the SeqIO test passed in testing with upgraded Gcc libraries.

Just in case, I rechecked the status in Debian sid, in which I still witness the failure. Now we know the issue is not from Gcc 14-20240303.

@emollier
Copy link
Contributor Author

emollier commented Mar 8, 2024

I must be very tired for not having looked at libexpat1 earlier: the package looks pulled by the "essential set" and thus does not need explicit dependency set (which is why I completely missed it); also Python source code embeds a copy of expat, but Debian builds Python against the operating system's libexpat.so.1.

I do reproduce the issue by upgrading testing's libexpat.so.1 from 2.5.0 in testing to sid's 2.6.1. I believe the issue is not in Biopython, but in the interaction between Python and the newer expat library.

Debian CI is tripping on Python test errors following libexpat 2.6.1 upgrade in sid, suggesting it won't make it to testing until tests are resolved in Python, fixing possible the Biopython issue at the same time. From Python test attempt against the newer expat:

2444s 0:25:01 load avg: 0.96 [414/427/1] test_xml_etree failed (2 failures)
2444s test test_xml_etree failed -- multiple errors occurred; run in verbose mode for details
2446s 0:25:03 load avg: 0.97 [415/427/2] test_xml_etree_c failed (2 failures)
2446s test test_xml_etree_c failed -- multiple errors occurred; run in verbose mode for details

I think it is safe to close the present issue as it is not a problem in Biopython per se.

@emollier emollier closed this as completed Mar 8, 2024
@peterjc
Copy link
Member

peterjc commented Mar 9, 2024

Impressive detective work there. We can hope that Python's libexpat code gets fixed before too long (given you said the Python tests are failing), so hopefully nothing for the Biopython to do here. Thank you all!

@peterjc
Copy link
Member

peterjc commented Mar 28, 2024

This is now hitting our Python 3.10 Linux runs on GitHub actions

@peterjc peterjc reopened this Mar 28, 2024
peterjc added a commit to peterjc/biopython that referenced this issue Mar 28, 2024
Hopefully this will fix biopython#4640.

If it does, this is due to CPython issue 115133
and is a side effect of a CVE security fix.
@emollier
Copy link
Contributor Author

emollier commented Mar 28, 2024 via email

@peterjc
Copy link
Member

peterjc commented Mar 28, 2024

Thanks Étienne. After some searches on the Debian log error messages, I was also looking at CPython issue 115133, which suggests parsing with small chunk size is no longer viable due to the security fix.

Perhaps the block size relative to the size of the lines of data matters, since 1024 is hardly a small block:

https://github.com/biopython/biopython/blob/biopython-183/Bio/SeqIO/SeqXmlIO.py#L387

You may be right about feeding whole lines to the parser being a more complete fix - but it is also more work. Anyway, doubling the chunk size to 2048 bytes seems to help #4683 and might suffice...

@emollier
Copy link
Contributor Author

emollier commented Mar 29, 2024 via email

@mdehoon
Copy link
Contributor

mdehoon commented Mar 29, 2024

The XML parser in Bio.Entrez also uses a block size of 1024 bytes. If the problem occurs only in Bio.SeqIO.SeqXmlIO and not in Bio.Entrez, then the bug may actually be in Bio.SeqIO.SeqXmlIO.

peterjc added a commit to peterjc/biopython that referenced this issue Mar 29, 2024
@peterjc
Copy link
Member

peterjc commented Mar 29, 2024

The Entrez XML parser do not use the problematic .feed(...) method, so perhaps the SeqXML parser could be remodeled to follow that approach?

peterjc added a commit that referenced this issue Mar 29, 2024
Hopefully this will fix #4640.

If it does, this is due to CPython issue 115133
and is a side effect of a CVE security fix.
peterjc added a commit that referenced this issue Mar 29, 2024
See discussion on #4640
@peterjc
Copy link
Member

peterjc commented Mar 29, 2024

(Hopefully changing the chunk size is enough, if not it will be a less simple fix)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants