-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bleach seems to be significantly slower than lxml in 7.1.x+ #1892
Comments
Found the relevant PR here: #1854 |
Interestingly we (GitHub) looked at using bleach as a second sanitization step on this project and decided against it because it caused too many notebooks to timeout and increased render times my more than 75%. |
Along the same lines of The two could then be pitted against each other during test, probably after some further homogenization.
|
Hi, author of #1854 here. Didn't imagine I'd open such a can of worms (#1863, #1849...)... The issue is |
No worries. We can't make everybody happy all of the time. Given "simple" and "fast," the former should win as a minimum-testable-product, but having a documented approach for the latter is important.... so many different use cases.
I guess with free software on non-free operating systems, you get what you pay for.
Of note, the |
@bollwyvl Meh, the "get what you pay for" argument is a bit naff, since There's also a bit of discussion (can't find a link right now) around whether or not libxml, libxslt and the other C friends required by lxml should be linked statically into the lxml module shipped by a PyPI wheel – this is of extra interest when there are (security) bugs in those libraries, see e.g. GHSA-wrxv-2j5q-m38w, and if the wheels were dynamically linked, you'd easily bump into ABI incompatibilities (I think this happened with xmlsec). |
For the time being, a workaround, if you're willing to try it, is to patch
This essentially reverts the change made to the filters dict in b40bb13. It should be noted, though, that |
We could help by setting up a devcontainer for codespaces development of this repo, then folks on a mac could just use a pre-configured linux environment for this. |
Not to be too blunt, but containers and SaaS solutions sweep these problem under the covers, and are not viable in a number of settings or for a number of users. Feel free to propose and maintain such a solution in perpetuity. But I feel the issue at hand is not ease of development on this repo, but balancing performance and deployment simplicity on millions of devices: |
You know far more about how this is used than I do. I just thought it may help in some capacity. |
I would imagine that Bleach is painfully slow on a Raspberry PI if it is this slow on a powerful machine 😓 |
But can be easily installed, works, and requires no porting to new platforms where python works. As suggested,
But even it that worked gangbusters, and generated identical output to |
As an update, It looks like a maximally secure approach would still have to use Further, as the stdlib XML parser is used in a number of parsing-related places, one might still want to prefer having So there's certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g. |
That makes sense to me- I imagine if
On board with this, benchmarking the uses of each library and comparing and contrasting their tradeoffs would be great for confidence and documenting the decision made for future contributors and library consumers that may want to try out different sanitizers. |
Of note: bleach 6.0 out (with some breaking changes, the impact of which I haven't yet assessed), but is also now officially deprecated. We'll need to do something within the next year or so. |
FWIW the underlying Ammonia library allows essentially arbitrary rewriting of attributes, though it has a fair amount of special cases which don't use this engine (e.g. classes rewriting), and no special support for For nbconvert two options would be to either have the nh3 API expanded for this (and possibly other things, nh3 is currently quite limited), or to depend on Ammonia directly and build its own sanitizer with the additional build and distribution complexity that entails as nbviewer would now be a non-native wheel. Though this could also solve distribution issues related to nh3: distros which provide nbconvert would need to add nh3 to their packages to support an nh3-based nbconvert. I'm not sure what the policies are for build-time dependencies but e.g. ammonia is already in debian testing. A third option would be to get lxml to use ammonia for cleaning internally, but
|
pygments emits line numbers via tables, with tr and td elements (https://pygments.org/docs/formatters/#HtmlFormatter). lxml's clean_html considers table, tr and td as safe elements, but with jupyter#1854 they are now considered unsafe. So instead of displaying line numbers, the table, tr and td elements are escaped, and show up as literal HTML if trying to enable line numbers via the method introduced in jupyter#1683. This PR adds table, tr and td as safe elements so that line numbers can continue to work. I know that there are probably plans to move away from bleach (jupyter#1892), but this is a small and focused change so hopefully doesn't need to block on
Description
After updating
nbconvert
from 6.5.0 to 7.2.2 for use here at GitHub we discovered the notebook rendering service degraded in render time (almost twice as long to render). We looked into what might have caused this and it looks as though the switch tobleach
for sanitization is the culprit here. I'm wondering if we could provide a configuration for users to passthat would allow them to choose the cleaning library they would prefer to use considering it could have a major performance impact.
In the meantime our team should be able to safely upgrade to 7.0.0 from 6.5.0 though moving forward we'd love to keep in close sync with the latest releases, a couple steps behind if not update to date.
I recognize this may have been a deliberate design decision, I'm interested in looking into making an PR myself. Any pointers/recommendations would be appreciated 🙇🏾 .
Relevant PR: #1854
Screenshots
7.0.0 Profile
7.0.0 Flamegraph
7.2.2 Profile
7.2.2 Flamegraph
The text was updated successfully, but these errors were encountered: