-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
two test failures affecting Bio.SeqIO.SeqXmlIO in Debian sid #4640
Comments
That is odd. I've only skimmed the tracebacks, however I can give some context. Both the failures are within This suggests to me the issue is either in writing out the SeqXML format, or in reading SeqXML format. Libraries this code imports to do that: from xml import sax
from xml.sax import handler
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesImpl Are there any likely looking XML SAX libraries whose version changed? It might be helpful to try the SeqXML output directly, and validate the XML file. e.g.
The other thing I would double check is the system locale (just because that has caused oddities before). |
Hi Peter,
I looked there first, but the xml.sax looks to be part of the standard library, which is tightly coupled to the python3 interpreter:
So the issue would have been in an external dependency of the module. Skimming through python3.11 shows nothing standing out; dependencies are merely the compiler and libc. The libc hasn't changed, so, urgh, I'm hoping I'm not running into a long winded compiler bug, the only difference I see that might affect xml.sax operation between testing and sid is gcc (through libgcc_s.so.1):
Possibly relevant milestone would be gcc 14-20240207, as I know for sure I witnessed the issue before Berlin sprint on 16th and 17th. If further testing prove this happens to be the relevant change, then I will move back the discussion in Debian and close the present issue.
Thanks for the snippet, for what it's worth I obtained the same output from either testing or sid. I guess this might point towards a problem with reading back the SeqXML? The file obtained out of testing is in attachment if need be.
I think we've been bitten by that in the past. Nowadays we run the test suite enforcing LC_ALL=C.UTF-8. This is probably not the issue. Thank you for your time, |
Re,
I reran the test suite in Debian testing with a targeted upgrade of the libgcc_s.so.1 from sid, and tests passed, so this is not related to the compiler library. If you have a snippet to reread the output of the snippet that produced the xml, I would be taker. Have a nice day, :) |
Well a one line to parse the XML file back again could be:
Also worth a try is:
If that works (outside of |
Thanks for the reader, it looks like the issue stems solely from the reading, I see on sid:
while on testing I have the expected result:
|
Ouch, by the way, I had a look back at my build log from yesterday, and I realized my gcc upgrade attempt didn't apply cleanly, so this is still a variable to consider. |
Progress for sure - and the SAX expat C code is involved somehow. |
Looks like the NixOS people ran into this as well: NixOS/nixpkgs#292138 |
It took me a while, but I finally managed to run the test in a Debian testing environment with upgraded libgcc_s.so.1. The recipe was to upgrade the following packages in a lock step:
They all stem from Gcc 14 version 14-20240303. In the end, the SeqIO test passed in testing with upgraded Gcc libraries. Just in case, I rechecked the status in Debian sid, in which I still witness the failure. Now we know the issue is not from Gcc 14-20240303. |
I must be very tired for not having looked at libexpat1 earlier: the package looks pulled by the "essential set" and thus does not need explicit dependency set (which is why I completely missed it); also Python source code embeds a copy of expat, but Debian builds Python against the operating system's libexpat.so.1. I do reproduce the issue by upgrading testing's libexpat.so.1 from 2.5.0 in testing to sid's 2.6.1. I believe the issue is not in Biopython, but in the interaction between Python and the newer expat library. Debian CI is tripping on Python test errors following libexpat 2.6.1 upgrade in sid, suggesting it won't make it to testing until tests are resolved in Python, fixing possible the Biopython issue at the same time. From Python test attempt against the newer expat:
I think it is safe to close the present issue as it is not a problem in Biopython per se. |
Impressive detective work there. We can hope that Python's libexpat code gets fixed before too long (given you said the Python tests are failing), so hopefully nothing for the Biopython to do here. Thank you all! |
This is now hitting our Python 3.10 Linux runs on GitHub actions |
Hopefully this will fix biopython#4640. If it does, this is due to CPython issue 115133 and is a side effect of a CVE security fix.
Hi Peter,
I had a fresher look at the corresponding Python issue [115133],
and it seems that it is possible to repair some of the tests by
feeding whole lines to the parser. Looking at the corresponding
merge requests [115289], some of the affected tests do just that
and it seems to have been determined that other tests could be
acceptably skipped. Maybe it is possible to just handle the
Biopython tests similarly?
It looks delicate to bring a solution fitting all cases, because
the root issue that caused the change in behavior was a security
vulnerability in Expat.
Hope this helps,
Étienne.
[115133]: python/cpython#115133
[115289]: https://github.com/python/cpython/pull/115289/files
|
Thanks Étienne. After some searches on the Debian log error messages, I was also looking at CPython issue 115133, which suggests parsing with small chunk size is no longer viable due to the security fix. Perhaps the block size relative to the size of the lines of data matters, since 1024 is hardly a small block: https://github.com/biopython/biopython/blob/biopython-183/Bio/SeqIO/SeqXmlIO.py#L387 You may be right about feeding whole lines to the parser being a more complete fix - but it is also more work. Anyway, doubling the chunk size to 2048 bytes seems to help #4683 and might suffice... |
Anyway, doubling the chunk size to 2048 bytes seems to help #4683 and might suffice...
Fair enough, I would even go as far as working with 4k bytes, as it is a common size for memory pages.
|
The XML parser in Bio.Entrez also uses a block size of 1024 bytes. If the problem occurs only in Bio.SeqIO.SeqXmlIO and not in Bio.Entrez, then the bug may actually be in Bio.SeqIO.SeqXmlIO. |
See discussion on biopython#4640
The Entrez XML parser do not use the problematic |
Hopefully this will fix #4640. If it does, this is due to CPython issue 115133 and is a side effect of a CVE security fix.
(Hopefully changing the chunk size is enough, if not it will be a less simple fix) |
Greetings,
This issue may not be an easy one, and even risks being Debian
specific, so if you are busy somewhere else, then I won't mind.
:)
I'm having great difficulties pinpointing a change between
Debian testing and sid which is causing Debian bug #1064147,
affecting both Biopython 1.81 and 1.83 in sid, but none of them
in testing (as of today). The error output comes from the test
suite, and gives in Debian sid:
Version comparison of Biopython direct dependencies suggest they
are not involved in the test regression I observe:
This means that the issue is caused by a transitive dependency
(which I have not managed to identify yet), or something else
entirely. I also reproduced the problem on Debian Salsa CI,
which strongly hints that I haven't mishandled my build
environment. Besides, if one of the direct dependencies had
triggered the bug, then I would have expected the test run
triggered by the package migration CI to have tripped, which
has not happened (which hints that the situation did not appear
before 2024 February 3rd in sid).
I wouldn't be surprised to learn that this isn't be a problem in
Biopython per se, but I'm running out of options without your
thoughts upstream (maybe the combination of these two particular
tests failing and the others passing hints to something in
particular?). Do you per chance have an idea of what I am
missing that could cause the SeqXmlIO records to be empty and
cause these test failures?
Thank you for your time,
Étienne.
PS: here below, the mandatory form for your convenience:
Setup
I am reporting a problem with Biopython version, Python version,
and operating system as follows. In Debian sid:
In Debian testing (no changes visibly):
This is also valid with Biopython 1.81.
Expected behaviour
I would like to make all test items of the test suite pass on
Debian sid.
Actual behaviour
All test items pass on Debian testing. However on Debian sid, I
observe the following test items failing:
Steps to reproduce
Run the test suite.
The text was updated successfully, but these errors were encountered: