-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283
Comments
-1 from me on all points:
I agree we could try to devise a better scheme, where editing an existing release file triggers some warning / is impossible. Just to prevent accidents, not to stop malicious actors with repo write access (that's not possible). |
This is far more subtle and dangerous than that typical risk. With a normal package - like gensim itself - you only get fresh code when you None of those mitigating factors help with this risky practice. The gensim This isn't: "I'm trusting the maintainers of the project to get their major releases right, or for the larger community of users to eventually notice any problem." This is, "I'm trusting the maintainers (or anyone else who gets even temporary access to their github accounts) to inject new executable code onto my system at undisclosed later points in time, without any formal review or release process, or logging." That's very different from the threat of a bad-actor sending version-controlled code into the normal review/release/publish/install process.
That's not at all clear in the I'm relatively sure the NLTK loading mechanism only loads non-executable data. The expectation is not that a dataset wouldn't need custom code. It's that a mere "load" will only bring inert data, and that executed-code is only installed & run at explicit, chosen software-installation boundaries. And, there's no need for this recklessness! There could easily be a properly-versioned
That logic may make sense for large, non-executable data when facing certain implementation limits. It makes no sense for short executable Also, part of the general reason for a release to be "immutable" is to allow reasoning about it as a frozen artifact. If every gensim release can later load new arbitrary executable code from the network, whenever a new
I can't find anything in the github interface which lists who could change the files at https://github.com/RaRe-Technologies/gensim-data/releases/download/*. I believe that info is private to the project owner(s).
We could adopt the baseline standard that executable code is only delivered through explicit installs, and all executable code is version-controlled, as per my recommendation. That limits the damage a bad-actor with repo access can do. Then, damaging gensim users requires either compromising the full release process to PyPI & other repos, or tricking knowledgeable users (who consciously choose to install pre-release code) to load from a branch that's been visibly compromised with logged/attributed bad commits. As it stands, a bad actor who even temporarily compromises the Github accounts of @menshikh-iv, @piskvorky, @chaitaliSaini, and perhaps others (whose admin privileges I can't see) can make it so that any gensim user using older gensim release, but executing a fresh I'm a bit surprised to have to make this case. If you aren't convinced by my examples, please check with others whose opinions you'd trust on what sort of dynamic code load-and-execution practices are safe or commonly expected for open-source libraries, without prominent disclosure. |
If I understand you correctly, you'd like the code delivered separately (installed via As long as the releases are immutable, this difference is cosmetic / documentation. So the real question is, how can we ensure that releases stay immutable?
I don't know what you mean by "revved" – there's no release reversing allowed in I'm not opposed to separating the code from data. But it'd not be for versioning reasons (there's no versioning in immutable releases), nor to limit "repository write access for malicious users" (completely wrong solution resolution for this attack vector). It'd be more to catch accidents and to ensure all changes stay above board. May be worth asking Github support whether they support "immutable release files" = the simplest solution. Another option would be to inform users to review the code associated with the dataset they're installing. The same way you'd ask them to review Pull requests welcome! Unfortunately we won't have time to get to this in the near future ourselves. |
I partially agree with @gojomo, this is really security risk (exactly by reason that "GitHub release" have no guarantees to immutability, this guarantee exist only by our agreement). I'm +1 for:
I'm not sure about "gensim-data" as python package (that will store code for loading), from one hand - good idea (consistent, fully tracked), from another hand - make a release for each new dataset is pretty annoying + user always should upgrade BTW, "immutable release files" on GitHub doesn't help much: you have no chance to mistake in that case (in the current scheme, this is hard to test it properly before you make a github release, we need that first). |
With regard to the potential addition of HuggingfaceHub-specific downloading functionality in #3206, @piskvorky writes:
I'm against integrating the PR in #3206, for the reasons outlined in a comment there.
Given that, it shouldn't have to wait for anyone to "architect a better system". It should just be discontinued ASAP - especially if there are no current maintainers eager to fix its issues. A wiki page full of download links can replace it: "If you want to work with data X, here's a link to a bundle: x.gz. Download, expand, view & work-with the files inside. Remember that if you unpickle a data-file, or run a I'd even argue that explicit process is better for users' understanding of what data files are now on their local drive, and what code/classes/calls they might use to work with it. Compare the current approach, of encouraging extremely-novice users to run code like:
What do you think There's zero documentation of what kind of object this opaque magic call just delivered. Here and here are two examples of users in just the last few days who quite reasonably thought it might be a But it's not a https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300 You have to download it & view it locally to know what it's doing. (It turns out it's only using a static set of full-word output vectors from FastText, and loading them as a plain And while the release is attributed to @menshikh-iv, I believe any one of the (nowhere-publicly-displayed) maintainers of this project can replace any of those 'asset' files via a Github upload interface at any time. That is, the normal processes which create confidence in the code that's shipped as part of the core Gensim repo, or officially packaged releases – attributed commit logs, test cycles, formal releases – are not active for this source code file. (Even the And what if someone wanted to improve this confusing user experience with the So this idiosyncratic delivery-method is also resistant to the normal mechanisms by which an open-source project can improve over time. |
That's the option 3) I guess. That option (with a replacement) I also mentioned in the part of the comment you didn't quote here:
To be clear, I'm also not a fan of the current design of |
Indeed, we should 'replace' if possible - but an easy way to do that is to set up a replacement web page, with all the same info as in the The key value of the collection – reliable download URLs for known useful datasets – will remain, needing just a few extra clicks from users to use. And those extra explicit steps would be a positive improvement over the opaque magic of |
The difference between such version of But either |
re: <<add some API / process to load those "static packages" in a standardized way. Possibly as wheels, spacy-style.>> Is this likely to ever happen? No one's touched this hairball in years. No users seems to be clamoring for a Gensim-specific dataset library, either in the form of And I still haven't seen any conjectured benefits that can't be matched (or exceeded!) by a web page that simply says, "here's a download link, here's 2-3 lines of code to load those files into Gensim". |
The datasets have hundreds of thousands of downloads, according to Github stats. They are useful to many people, myself included. I wish I had the time to properly prepare and add more datasets (esp. the requested ones). Don't let the fact that some users complain fool you – that's always the case. Your vitriol toward
The iterator classes are far from 2-3 lines of code. It's 2-3 lines only if the data accessor / parser is already built into Python and/or Gensim. So we'd have to let people download the Python sources (~ the current And then let them manage these ad-hoc modules locally, place them under path, import (or copy-paste?) into own code. And fluff up our own tutorials & docs with such instructions. That doesn't sound like an improvement to me. |
That's useful info; I'd been looking for stats & couldn't find them. Are they only available to project maintainers? Are they broken down by 'asset' downloaded? Which one is most popular?
It's not so much that they complain - but they are confused by the underdocumented behavior, & stay ignorant of what file-formats, & APIs, they're actually using.
There are lengthy custom iterator classes inside some of the How could one review this code, or report bugs or propose/discuss improvements? If it's useful code, shouldn't it be in real version-control and/or an official release? |
I googled up this. The stats difference between
Staying ignorant is a desirable state for many :-) That users don't have to care about file formats and APIs and locations is a victory. That's the main appeal of Although there should be a path for the occasional power-user to dig deeper, we shouldn't confuse users of course.
Yes. |
Very interesting, thanks! But, I think most of those large numbers may be artifacts of automated processes. For example, the The top downloads, like Your testimony, & user reports, & the variety of other tallies all imply there's some real usage. But even thousands of those numbers could easily be web crawlers, or people finding the 'assets' download URLs & downloading directly from a browser/
Indeed, and that's great if you take users the whole way to new capabilities, smoothly, with irrelevant details hidden. But it's a disservice if hiding relevant details, leaving people confused as to what's actually happening, or what capabilities have just been imported. (This is of course especially the case with the 'fasttext-subwords' model that's really just a
Then I'd suggest some good incremental steps could be:
The code & docs would then be amenable to further improvement/extension using normal contribution processes. Users would receive fair warning of what & where is landing in their local storage, and that they're choosing to run new code as if from a notebook or cut & pasted online example. |
I hope not! That's a >1 GB download. For the more niche models, I consider 10k, or even "mere" thousands of downloads a success. That's what I'd expect from such published corpora or models – useful to a niche community of hundreds-to-thousands of researchers, most of whom download one particular dataset a few times, during the course of their project.
Relentless :) And absurd. I think I understand your strategy – keep pushing and nagging until I rewrite the downloader. I'm pretty close TBH, it's working.
What do you mean by this? At which point are assets downloaded, if not at
This If I understand correctly, your proposal is adding a new extra optional I totally agree with the explicitness in output / progress. Breaking down |
A few thousand real users would be plenty. Even a few dozen, especially if it's key work/learning that wouldn't happen otherwise! I just know that any hit-count could be entirely artificial, without other confirmation of real usage. In particular, regarding:
Isn't doc-comment example code like this... ..auto-run in testing? That makes me think the ~54k Similarly, the
I'd expect that still, the single call to But I mean, as a quick fix: it stops pulling individual executable, instructive source code (like the If it needs to still pull that file separately, it might as well target the repo at a certain tag. EG: an URL more like https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/fasttext-wiki-news-subwords-300/generate_table.py At leaast such an URL would allow the source to be viewed in a browser. (Though, I'm not sure if Github's commitment to serving such URLs indefinitely is as strong as their 'assets' mechanism.) Better would be: the code + data gets packaged as part of a formal, version-control-tagged release into one bundle. That frozen artifact would be fine to download from the 'assets' mechanism. Best would be: users get all When users need an extra big supplementary optional blob of data, that would then come via the (out-of-band wrt normal development) 'assets' mechanism. For normal data that's truly "just data" (plain-text stuff), all's good. (If and when we start putting actual pickled models in that mechanism, a token warning reminding users that particular new data they've downloaded includes pickled objects, and that to unpickle is equivalent to running arbitrary code, would be nice.)
Roughly, yes, to import the function to where it's callable.
Yes: the threshold of executing new code that you've just downloaded should be vividly disclosed, and crossed by explicit user choice to import & run some new code file – rather than hiding behind what looks like a mere static-data-load. (I don't think this hypothetical
Sure, most people remain in a "it won't hurt for me to gloss over this extra risk this one more time, everyone's doing it" mindset unless/until they or someone close to them gets burned. Then, their customer data is exfiltrated & sold on the dark web; their systems are ransomwared; their online accounts are drained. After that, they may join the power-user minority. Note that in what I consider the best case – all dataset-specific load-shim code is released as part of an explicit
And, the method provides a place to hang (& improve over time) docs on what you're getting - or warn/redirect people if a model is every withdrawn/usefully-updated/etc. |
@mpenkov are we downloading these files regularly as part of our CI / testing?
Hm, what's the difference to the current system? If a malicious person takes over these
At that point, we could also make that code a part of
Frankly, I see PyPI as a bigger risk vector than any of this
OK. I'll go back and review why we went with a separate repo in the current style. I'm sure there were reasons, even if its execution was lacking. Unless the reasons were compelling, we could switch (transition) to an all-downloads-code-installed-as-one-package style, whether as |
Sometimes. Not all of the gensim doctests code actually gets run. From what I understand, most of it actually does not, because of large data dependencies like this. |
And what determines whether something gets run or not? |
It's about as bad, from security risk, if the bundle is still downloaded/exapnded/executed inside a It's slightly better in that the Python code is under real version-control before being bundled, & would be slightly better again if the user has to explcitly download/expand the bundle & then choose-to-run a new But putting the code in an explicit release-via-PyPI would be better still.
That's a good question. I can understand the original hope may have been that datasets might arrive at a much more frequent cadence that I know that despite automation, there's still a bunch of manual effort involved in pushing out a full Personally, for these tiny swatches of example code backed by larger data, I'd still prefer a web-based gallery of examples/notebooks, where users click/wget to download data bundles, then copy & paste a few lines of code showing usual loading. That's vividly instructive, and easy to maintain/improve, constantly, as wiki- or repo- edits, without any release formalisms. (Though, something like a 10+ line iterable class customized for certain formats would still go into the related This could take the form of a
I agree that's also a big risk – I suspect it's only a matter of time before PyPI or similar repos for other languages is implicated in some major headline-grabbing malware/ransomware incident. There've been a bunch of issues this year: https://www.theregister.com/2021/03/02/python_pypi_purges/ The best we can do is protect our credentials/processes as well as possible so that we're not the early vector, and our users not the early victims - as other lower-hanging targets gradually get exploited & overall habits improve in response. Barring compromise of the PyPI administrators or infrastructure, there is some safety in the number of users serially using a release. When you're the 10,000th person to
I could potentially do some 'axe-with-minimalist-stopgap' work. I don't think that's your preferred path, but it'd be: deprecate Actually designing/implementing something better than that, or designing/maintaining release processes, would be more than I could sign up for. If it were easy enough to load a new dataset/model into some new repo/gallery, I might at some point like to get some pretrained |
I had a closer look at the repo and it doesn't look like anything is getting automatically doctested. My previous reply ("sometimes") was due to confusion with smart_open. Over there, we do have doctests, which we run prior to each release. |
OK, thanks. That's what I thought too – we check docs for syntax, at most. We don't actually run the code there. |
The |
That is of course possible. We have no control / reliable stats over who or why downloads from The relative numbers check out though: from thousands for fringe datasets, to hundreds of thousands for the "superstar" ones. Of which I'm personally responsible for at least dozens, for sure, even without automation. |
The
api.load()
utility will grab fresh Python code, in the form of an__init__.py
file inside the Github projectgensim-data
download area, and then run it at the user's machine.This form of dynamic code loading & execution:
api.load()
docs that this could occur.gensim-data
Github "releases" files, they can cause arbitrary new code to run on gensim users' machine, when those users useapi.load()
. The users wold think, "I'm still using this old, assumed-safe-through-widespread-use gensim-X.Y.Z version" – but they'd be running arbitrary all-new code from over the network. It's hard to tell who hasgensim-data
project rights. It's also not clear that anyone would quickly notice edits/changes there.Further, these
__init__.py
files in thegensim-data
releases aren't even in version-control – instead, they're uploaded as 'assets' via the Github releases interface. (There's no file-size need to do this; the__init__.py
I reviewed, forwiki-english-20171001
at https://github.com/RaRe-Technologies/gensim-data/releases/download/wiki-english-20171001/__init__.py, is just a tiny shim and I imagine most other such files are, as well. It's code, it should be versioned.)That they are not in version-control makes them hard to review through normal means (such as browsing the Github website), and raises the possibility they could be changed to something malicious, and then back again, without anyone noticing or it being visible in any persistent logs.
Recommendations:
The
api.load()
mechanism should be immediately redesigned to not load any new code over the network – and the developer guidelines for gensim & associated projects should make it clear such dynamic loading of code outside normal package-installation processes (likepip
) is unacceptable.If supporting new datasets requires dataset-specific code, that code should go through normal collaboration/version-control/release procedures, waiting for a new
pip
-installablegensim
(or other new supporting project) explicit release before running on users' computers.ATTN: @menshikh-iv @piskvorky @chaitaliSaini
The text was updated successfully, but these errors were encountered: