Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283

Open
gojomo opened this issue Dec 3, 2018 · 23 comments
Open

SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283

gojomo opened this issue Dec 3, 2018 · 23 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@gojomo
Copy link
Collaborator

gojomo commented Dec 3, 2018

The api.load() utility will grab fresh Python code, in the form of an __init__.py file inside the Github project gensim-data download area, and then run it at the user's machine.

This form of dynamic code loading & execution:

  • violates user expectations: requesting a dataset should not run arbitrary new code that was never explicitly installed on the user's machine. Further, there's no indication in the api.load() docs that this could occur.
  • creates a severe security risk: if a bad-faith actor obtains the ability to edit the gensim-data Github "releases" files, they can cause arbitrary new code to run on gensim users' machine, when those users use api.load(). The users wold think, "I'm still using this old, assumed-safe-through-widespread-use gensim-X.Y.Z version" – but they'd be running arbitrary all-new code from over the network. It's hard to tell who has gensim-data project rights. It's also not clear that anyone would quickly notice edits/changes there.

Further, these __init__.py files in the gensim-data releases aren't even in version-control – instead, they're uploaded as 'assets' via the Github releases interface. (There's no file-size need to do this; the __init__.py I reviewed, for wiki-english-20171001 at https://github.com/RaRe-Technologies/gensim-data/releases/download/wiki-english-20171001/__init__.py, is just a tiny shim and I imagine most other such files are, as well. It's code, it should be versioned.)

That they are not in version-control makes them hard to review through normal means (such as browsing the Github website), and raises the possibility they could be changed to something malicious, and then back again, without anyone noticing or it being visible in any persistent logs.

Recommendations:

The api.load() mechanism should be immediately redesigned to not load any new code over the network – and the developer guidelines for gensim & associated projects should make it clear such dynamic loading of code outside normal package-installation processes (like pip) is unacceptable.

If supporting new datasets requires dataset-specific code, that code should go through normal collaboration/version-control/release procedures, waiting for a new pip-installable gensim (or other new supporting project) explicit release before running on users' computers.

ATTN: @menshikh-iv @piskvorky @chaitaliSaini

@piskvorky
Copy link
Owner

piskvorky commented Dec 4, 2018

-1 from me on all points:

  • If a bad actor gains access to the repo, we're screwed anyway (python packages always execute arbitrary code)
  • Why do you think there's an expectation of not running any code with a dataset? AFAIK the opposite is true: our contract is that the dataset comes with sane code to use it (as opposed to only raw data).
  • Releases are meant to be immutable (never changed), so version controlling them makes no sense. Dataset files are published as Github releases (as opposed to tracked inside the repo itself) due to severe file size limitation on Github.
  • Seeing who has gensim or gensim-data access rights is completely standard.

I agree we could try to devise a better scheme, where editing an existing release file triggers some warning / is impossible. Just to prevent accidents, not to stop malicious actors with repo write access (that's not possible).

@gojomo
Copy link
Collaborator Author

gojomo commented Dec 4, 2018

  • If a bad actor gains access to the repo, we're screwed anyway (python packages always execute arbitrary code)

This is far more subtle and dangerous than that typical risk. With a normal package - like gensim itself - you only get fresh code when you pip install. As a result, there are more eyes on code as it enters the release branch, then gets approved for release, then winds up at a package repository. Even if all those steps fail, you're getting the same code as all others who install the same-versioned package, so if/when there's malicious code found, it's more likely to be discovered, and a loud alert sounded, and community remedial actions taken.

None of those mitigating factors help with this risky practice. The gensim api.load() code is loading and executing a small bit of code that's not in version control and was reviewed by no one but whomever uploaded it to the Github releases file area. (Further, from a quick glance, it appears release files can be added, changed, and removed without necessarily leaving any persistent public log.)

This isn't: "I'm trusting the maintainers of the project to get their major releases right, or for the larger community of users to eventually notice any problem." This is, "I'm trusting the maintainers (or anyone else who gets even temporary access to their github accounts) to inject new executable code onto my system at undisclosed later points in time, without any formal review or release process, or logging." That's very different from the threat of a bad-actor sending version-controlled code into the normal review/release/publish/install process.

  • Why do you think there's an expectation of not running any code with a dataset? AFAIK the opposite is true: our contract is that the dataset comes with sane code to use it (as opposed to only raw data).

That's not at all clear in the api.load() documentation or examples. And perhaps this is a generational/cultural gap, but I personally would not expect any software or library, by security-competent developers, to blind-download code from a network location, without clear disclosure of that behavior, and much stronger processes around what could appear at that location.

I'm relatively sure the NLTK loading mechanism only loads non-executable data.

The expectation is not that a dataset wouldn't need custom code. It's that a mere "load" will only bring inert data, and that executed-code is only installed & run at explicit, chosen software-installation boundaries.

And, there's no need for this recklessness! There could easily be a properly-versioned gensim-datasets package. When a dataset's custom code must be revved, that package would get normal review/release procedures... and when a user wanted to use that new code, they'd just pip install gensim-datasets --upgrade. (And still, the bulk data itself would exist somewhere else to be fetched separately.)

  • Releases are meant to be immutable (never changed), so version controlling them makes no sense. Dataset files are published as Github releases (as opposed to tracked inside the repo itself) due to severe file size limitation on Github.

That logic may make sense for large, non-executable data when facing certain implementation limits. It makes no sense for short executable __init__.py files, which could arbitrarily compromise a user's system. Wouldn't you agree that having often-run code, with potentially unbounded effects, completely outside normal source viewing/versioning is a bad practice?

Also, part of the general reason for a release to be "immutable" is to allow reasoning about it as a frozen artifact. If every gensim release can later load new arbitrary executable code from the network, whenever a new api.load() is run, those releases have lost an important aspect of their 'immutability' for analytical/trust purposes.

  • Seeing who has gensim or gensim-data access rights is completely standard.

I can't find anything in the github interface which lists who could change the files at https://github.com/RaRe-Technologies/gensim-data/releases/download/*. I believe that info is private to the project owner(s).

I agree we could try to devise a better scheme, where editing an existing release file triggers some warning / is impossible. Just to prevent accidents, not to stop malicious actors with repo write access (that's not possible).

We could adopt the baseline standard that executable code is only delivered through explicit installs, and all executable code is version-controlled, as per my recommendation. That limits the damage a bad-actor with repo access can do. Then, damaging gensim users requires either compromising the full release process to PyPI & other repos, or tricking knowledgeable users (who consciously choose to install pre-release code) to load from a branch that's been visibly compromised with logged/attributed bad commits.

As it stands, a bad actor who even temporarily compromises the Github accounts of @menshikh-iv, @piskvorky, @chaitaliSaini, and perhaps others (whose admin privileges I can't see) can make it so that any gensim user using older gensim release, but executing a fresh api.load(), will run their new malicious code. I'm not sure any of you would even get a notification that one of the files under https://github.com/RaRe-Technologies/gensim-data/releases/download/* has changed. The bad actor wouldn't have made any noticeable commits, nor would they have had to separately compromise any PyPI release credentials, nor create any sort of noticed package release/version-increment.

I'm a bit surprised to have to make this case. If you aren't convinced by my examples, please check with others whose opinions you'd trust on what sort of dynamic code load-and-execution practices are safe or commonly expected for open-source libraries, without prominent disclosure.

@piskvorky
Copy link
Owner

piskvorky commented Dec 4, 2018

If I understand you correctly, you'd like the code delivered separately (installed via pip install, followed by api.load()) as opposed to together (current, just api.load()).

As long as the releases are immutable, this difference is cosmetic / documentation. So the real question is, how can we ensure that releases stay immutable?

When a dataset's custom code must be revved

I don't know what you mean by "revved" – there's no release reversing allowed in gensim-data.

I'm not opposed to separating the code from data. But it'd not be for versioning reasons (there's no versioning in immutable releases), nor to limit "repository write access for malicious users" (completely wrong solution resolution for this attack vector). It'd be more to catch accidents and to ensure all changes stay above board.

May be worth asking Github support whether they support "immutable release files" = the simplest solution.

Another option would be to inform users to review the code associated with the dataset they're installing. The same way you'd ask them to review setup.py or the rest of the package. This may actually be both easier to manage as well as easier for the users to accomplish (less code to review, no irrelevant scaffolding).

Pull requests welcome! Unfortunately we won't have time to get to this in the near future ourselves.

@menshikh-iv
Copy link
Contributor

I partially agree with @gojomo, this is really security risk (exactly by reason that "GitHub release" have no guarantees to immutability, this guarantee exist only by our agreement).

I'm +1 for:

  • move "load" code to git (for better tracking & real guarantees), stay "data files" in releases
  • clarify documentation of api.load (describe process "how that works" in details)

I'm not sure about "gensim-data" as python package (that will store code for loading), from one hand - good idea (consistent, fully tracked), from another hand - make a release for each new dataset is pretty annoying + user always should upgrade gensim-data first before download new data/model.

BTW, "immutable release files" on GitHub doesn't help much: you have no chance to mistake in that case (in the current scheme, this is hard to test it properly before you make a github release, we need that first).

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 2, 2021

With regard to the potential addition of HuggingfaceHub-specific downloading functionality in #3206, @piskvorky writes:

Changing how the gensim-data models and corpora are uploaded / maintained is definitely possible – pending a strong contributor to actually implement it.

If I understand correctly, the options now are:

  1. Rethink and reimplement the whole gensim-data API first, before considering integration with HFH.
    Pros: We can architect a better system.
    Cons: Non-trivial work; keep in mind no one has stepped up to do it for years.
  2. Accept this PR.
    Pros: Useful, to at least N users, with N >= 3 :) (real user base for this HFH-Gensim integration unclear)
    Cons: Adding a bit of complexity we may eventually replace via 1) or 3).
  3. Other?

I'm against integrating the PR in #3206, for the reasons outlined in a comment there.

gensim.downloader violates a bunch of good project-hygeine practices that should be rarely if ever violated, such as:

  • all code is in version control attributed to a contributor
  • all code sent to non-expert users goes through a formal & standard release/packaging process
  • novel unversioned/unreviewed/hard-to-review code isn't fetched & executed from over the internet without prominent warning

Given that, it shouldn't have to wait for anyone to "architect a better system". It should just be discontinued ASAP - especially if there are no current maintainers eager to fix its issues.

A wiki page full of download links can replace it: "If you want to work with data X, here's a link to a bundle: x.gz. Download, expand, view & work-with the files inside. Remember that if you unpickle a data-file, or run a .py file, you'll be running new arbitrary code from the internet."

I'd even argue that explicit process is better for users' understanding of what data files are now on their local drive, and what code/classes/calls they might use to work with it.

Compare the current approach, of encouraging extremely-novice users to run code like:

import gensim.downloader as api
model = api.load("fasttext-wiki-news-subwords-300")

What do you think model is, after that code?

There's zero documentation of what kind of object this opaque magic call just delivered.

Here and here are two examples of users in just the last few days who quite reasonably thought it might be a FastText model that could offer OOV word lookup.

But it's not a FastText object. After searching and finding no docs declaring what it is, you might be tempted to try to browse the project source code to figure out what it is. But the __init__.py source code – that's downloaded & executed to create that model – isn't in any git repo that I can find. It's only in the 'release' assets download area here:

https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300

You have to download it & view it locally to know what it's doing. (It turns out it's only using a static set of full-word output vectors from FastText, and loading them as a plain KeyedVectors - so it's somewhat of a misnomer to even call this a 'FastText' model at all, especially one which claims to be leveraging 'subwords'.)

And while the release is attributed to @menshikh-iv, I believe any one of the (nowhere-publicly-displayed) maintainers of this project can replace any of those 'asset' files via a Github upload interface at any time. That is, the normal processes which create confidence in the code that's shipped as part of the core Gensim repo, or officially packaged releases – attributed commit logs, test cycles, formal releases – are not active for this source code file.

(Even the gensim.downloader code's token effort to checksum-verify the bulk data is kind of a joke. It's using MD5 - a broken algorithm that cryptographers have been warning about, due to suspected flaws, since all the way back in 1994. Researchers published engineered hash-collisions for it in 2010. It was already obsolete for this purpose when this code was written in 2017.)

And what if someone wanted to improve this confusing user experience with the fasttext-wiki-news-subwords-300 bundle? There's no code repo against which to propose changes to that __init__.py, to use a more appropriate dataset/model-class, or to add extra docs, or to point people to a better example of a true FastText model with OOV functionality.

So this idiosyncratic delivery-method is also resistant to the normal mechanisms by which an open-source project can improve over time.

@piskvorky
Copy link
Owner

piskvorky commented Aug 2, 2021

It should just be discontinued ASAP - especially if there are no current maintainers eager to fix its issues.

That's the option 3) I guess.

That option (with a replacement) I also mentioned in the part of the comment you didn't quote here:

Or possibly outsource the whole "models/corpora management" problem altogether, if it turns out something sufficient already exists.

To be clear, I'm also not a fan of the current design of gensim-data. It should go the way of the infamous #1777 (same ppl).
But I do think removing the functionality without a replacement is inferior to keeping it: axe < keep < replace < re-architect.

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 3, 2021

Indeed, we should 'replace' if possible - but an easy way to do that is to set up a replacement web page, with all the same info as in the gensim-data list.json, including direct-download links for the datasets.

The key value of the collection – reliable download URLs for known useful datasets – will remain, needing just a few extra clicks from users to use.

And those extra explicit steps would be a positive improvement over the opaque magic of api.load(), that hides which files arrive, & what Gensim code/classes are being used to work on what data formats. A user of a dataset from the cloud should know which filepath it's at, and which line-of-code is needed to load it, and what Gensim class it gets loaded as. Hiding those details is a disservice to users, especially the beginners enticed by those singular api.load() lines.

@piskvorky
Copy link
Owner

piskvorky commented Aug 3, 2021

The difference between such version of replace and actually rearchitecting seems sufficiently small that it probably makes sense to got the full way: add some API / process to load those "static packages" in a standardized way. Possibly as wheels, spacy-style.

But either replace or re-architect is an improvement over the current keep. My objection is to axe.

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 3, 2021

re: <<add some API / process to load those "static packages" in a standardized way. Possibly as wheels, spacy-style.>>

Is this likely to ever happen? No one's touched this hairball in years. No users seems to be clamoring for a Gensim-specific dataset library, either in the form of gensim.downloader as it exists, or some vaguely-specified future version. It's unclear if anyone is actually getting any benefit from the existing grab-bag of downloads... but it is generating user confusion.

And I still haven't seen any conjectured benefits that can't be matched (or exceeded!) by a web page that simply says, "here's a download link, here's 2-3 lines of code to load those files into Gensim".

@piskvorky
Copy link
Owner

piskvorky commented Aug 3, 2021

No users seems to be clamoring for a Gensim-specific dataset library, either in the form of gensim.downloader as it exists, or some vaguely-specified future version. It's unclear if anyone is actually getting any benefit from the existing grab-bag of downloads... but it is generating user confusion.

The datasets have hundreds of thousands of downloads, according to Github stats. They are useful to many people, myself included. I wish I had the time to properly prepare and add more datasets (esp. the requested ones).

Don't let the fact that some users complain fool you – that's always the case.

Your vitriol toward gensim-data is well established. But so is the field's appetite for simple data sharing and ease-of-(re)use. Your point that the release process could/should be improved is gratefully accepted, but that doesn't mean what we have is useless.

"here's a download link, here's 2-3 lines of code to load those files into Gensim"

The iterator classes are far from 2-3 lines of code. It's 2-3 lines only if the data accessor / parser is already built into Python and/or Gensim. So we'd have to let people download the Python sources (~ the current __init__.py files within each release) anyway.

And then let them manage these ad-hoc modules locally, place them under path, import (or copy-paste?) into own code. And fluff up our own tutorials & docs with such instructions.

That doesn't sound like an improvement to me.

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 4, 2021

The datasets have hundreds of thousands of downloads, according to Github stats. They are useful to many people, myself included. I wish I had the time to properly prepare and add more datasets (esp. the requested ones).

That's useful info; I'd been looking for stats & couldn't find them. Are they only available to project maintainers? Are they broken down by 'asset' downloaded? Which one is most popular?

Don't let the fact that some users complain fool you – that's always the case.

It's not so much that they complain - but they are confused by the underdocumented behavior, & stay ignorant of what file-formats, & APIs, they're actually using.

The iterator classes are far from 2-3 lines of code. It's 2-3 lines only if the data accessor / parser is already built into Python and/or Gensim. So we'd have to let people download the Python sources (~ the current __init__.py files within each release) anyway.

And them let them manage these ad-hoc modules locally, place them under path, import (or copy-paste?) into own code. And fluff up our own tutorials & docs with such instructions.

There are lengthy custom iterator classes inside some of the __init__.py files?!

How could one review this code, or report bugs or propose/discuss improvements? If it's useful code, shouldn't it be in real version-control and/or an official release?

@piskvorky
Copy link
Owner

piskvorky commented Aug 4, 2021

That's useful info; I'd been looking for stats & couldn't find them.

I googled up this.

The stats difference between __init__.py and its data asset file is interesting. I interpret that as "user cancelled the download mid-way" (almost half of Wiki corpus downloads).

It's not so much that they complain - but they are confused by the underdocumented behavior, & stay ignorant of what file-formats, & APIs, they're actually using.

Staying ignorant is a desirable state for many :-)

That users don't have to care about file formats and APIs and locations is a victory. That's the main appeal of gensim-data and similar storages.

Although there should be a path for the occasional power-user to dig deeper, we shouldn't confuse users of course.

If it's useful code, shouldn't it be in real version-control and/or an official release?

Yes.

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 4, 2021

That's useful info; I'd been looking for stats & couldn't find them.

I googled up this.

The stats difference between __init__.py and its data asset file is interesting. I interpret that as "user cancelled the download mid-way" (almost half of Wiki corpus downloads).

Very interesting, thanks!

But, I think most of those large numbers may be artifacts of automated processes. For example, the __testing* bundles have over 50k downloads, but aren't even listed on the gensim-data README or of real interest to users. They're just downloaded as part of our automated testing.

The top downloads, like glove-wiki-gigaword-100.gz (209k downloads) or word2vec-google-news-300.gz (220k) are also references in Gensim's code, like doc-comment examples & auto_examples notebooks – that, IIUC, is auto-run regularly by our build/test infrastructure.

Your testimony, & user reports, & the variety of other tallies all imply there's some real usage. But even thousands of those numbers could easily be web crawlers, or people finding the 'assets' download URLs & downloading directly from a browser/curl/wget rather than api.load().

It's not so much that they complain - but they are confused by the underdocumented behavior, & stay ignorant of what file-formats, & APIs, they're actually using.

Staying ignorant is a desirable state for many :-)

That users don't have to care about file formats and APIs and locations is a victory. That's the main appeal of gensim-data and similar storages.

Indeed, and that's great if you take users the whole way to new capabilities, smoothly, with irrelevant details hidden.

But it's a disservice if hiding relevant details, leaving people confused as to what's actually happening, or what capabilities have just been imported. (This is of course especially the case with the 'fasttext-subwords' model that's really just a KeyedVectors.)

If it's useful code, shouldn't it be in real version-control and/or an official release?

Yes.

Then I'd suggest some good incremental steps could be:

  • For each named datset, make a directory with that name under gensim-data. in that directory put the code that currently lives in each __init__.py, perhaps renamed as load_data.py - then at least the code is browsable/reviewable/version-controlled. Probably also, make a README inside each dir, to be the canonical place for explanation of that dataset - instead of that text living just on the https://github.com/RaRe-Technologies/gensim-data/releases page.
  • Sunset downloader.load() (with its hidden filenames, paths, automatic code execution, and unclear return-types) with a stub that recommends instead a new downloader.download() method.
  • In this new downloader.download(), perform the following
    • download all the same data, but now pulling the small stub code files from the repo, not the 'assets'
    • print vividly for the user the list of files that have arrived, & their local paths/sizes
    • print an advisory message like:
The gensim-data bundle 'word2vec-google-news-300' with supporting files downloaded to:
`/BLAH/BLAH/BLAH/word2vec-google-news-300/`. 

For more about this dataset, see:
`B/LAH/BLAH/BLAH/word2vec-google-news-300/README`. 

For example code to load the data into an appropriate Gensim model, see:
`/BLAH/BLAH/BLAH/word2vec-google-news-300/load_data.py`. 

Review it there & adapt to your needs, or import directly with:

    downloader.monkey_import('/BLAH/BLAH/BLAH/word2vec-google-news-300/load_data.py')
    model = load_data('/BLAH/BLAH/BLAH/word2vec-google-news-300/word2vec-google-news-300.gz')

The code & docs would then be amenable to further improvement/extension using normal contribution processes. Users would receive fair warning of what & where is landing in their local storage, and that they're choosing to run new code as if from a notebook or cut & pasted online example.

@piskvorky
Copy link
Owner

piskvorky commented Aug 5, 2021

The top downloads, like glove-wiki-gigaword-100.gz (209k downloads) or word2vec-google-news-300.gz (220k) are also references in Gensim's code, like doc-comment examples & auto_examples notebooks – that, IIUC, is auto-run regularly by our build/test infrastructure.

I hope not! That's a >1 GB download.

For the more niche models, I consider 10k, or even "mere" thousands of downloads a success. That's what I'd expect from such published corpora or models – useful to a niche community of hundreds-to-thousands of researchers, most of whom download one particular dataset a few times, during the course of their project.

Your testimony, & user reports, & the variety of other tallies all imply there's some real usage. But even thousands of those numbers could easily be web crawlers, or people finding the 'assets' download URLs & downloading directly from a browser/curl/wget rather than api.load().

Relentless :) And absurd.

I think I understand your strategy – keep pushing and nagging until I rewrite the downloader. I'm pretty close TBH, it's working.

  • download all the same data, but now pulling the small stub code files from the repo, not the 'assets'

What do you mean by this? At which point are assets downloaded, if not at download()?

Review it there & adapt to your needs, or import directly with:

downloader.monkey_import('/BLAH/BLAH/BLAH/word2vec-google-news-300/load_data.py')

This monkey_import  – that's our current load, right?

If I understand correctly, your proposal is adding a new extra optional download step, and being more explicit about what is happening overall.

I totally agree with the explicitness in output / progress. Breaking down load() into two separate download() and monkey_import() steps is also possible, but I suspect most users don't care and will continue to call both steps combined for convenience, as load(). I know I will. I'm afraid you're in a power-user minority @gojomo.

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 5, 2021

A few thousand real users would be plenty. Even a few dozen, especially if it's key work/learning that wouldn't happen otherwise!

I just know that any hit-count could be entirely artificial, without other confirmation of real usage. In particular, regarding:

I hope not! That's a >1 GB download.

Isn't doc-comment example code like this...

https://github.com/RaRe-Technologies/gensim/blob/b287fd841c31d0dfa899d784da0bd5b3669e104d/gensim/models/word2vec.py#L163

..auto-run in testing? That makes me think the ~54k glove-twitter-25 downloads are mostly driven by the same process that generated the 55k __testing_word2vec-matrix-synopsis downloads.

Similarly, the word2vec-google-news-300 dataset is used by a bunch of the auto_examples demos/tutorials, which I also thought were being auto-run at regular intervals.

  • download all the same data, but now pulling the small stub code files from the repo, not the 'assets'

What do you mean by this? At which point are assets downloaded, if not at download()?

I'd expect that still, the single call to .download() gets everything in one call.

But I mean, as a quick fix: it stops pulling individual executable, instructive source code (like the __init__.py special-file) raw from the weird Github 'assets' hosting URL, like the URL https://github.com/RaRe-Technologies/gensim-data/releases/download/fasttext-wiki-news-subwords-300/__init__.py that you can see on the page https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300.

If it needs to still pull that file separately, it might as well target the repo at a certain tag. EG: an URL more like https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/fasttext-wiki-news-subwords-300/generate_table.py At leaast such an URL would allow the source to be viewed in a browser. (Though, I'm not sure if Github's commitment to serving such URLs indefinitely is as strong as their 'assets' mechanism.)

Better would be: the code + data gets packaged as part of a formal, version-control-tagged release into one bundle. That frozen artifact would be fine to download from the 'assets' mechanism.

Best would be: users get all gensim-data-related code & metadata via a bog-standard pip install gensim-data operation. Then all utility reader code then comes via a normal release/version/package-manager process, with all the higher-assurance review/PR/etc habits that entails.

When users need an extra big supplementary optional blob of data, that would then come via the (out-of-band wrt normal development) 'assets' mechanism. For normal data that's truly "just data" (plain-text stuff), all's good. (If and when we start putting actual pickled models in that mechanism, a token warning reminding users that particular new data they've downloaded includes pickled objects, and that to unpickle is equivalent to running arbitrary code, would be nice.)

Review it there & adapt to your needs, or import directly with:

downloader.monkey_import('/BLAH/BLAH/BLAH/word2vec-google-news-300/load_data.py')

This monkey_import  – that's our current load, right?

Roughly, yes, to import the function to where it's callable.

If I understand correctly, your proposal is adding a new extra optional download step, and being more explicit about what is happening overall.

Yes: the threshold of executing new code that you've just downloaded should be vividly disclosed, and crossed by explicit user choice to import & run some new code file – rather than hiding behind what looks like a mere static-data-load.

(I don't think this hypothetical monkey_import() is a great way to do this, if designing from scratch, but it's a minimal-delta with what exists.)

I totally agree with the explicitness in output / progress. Breaking down load() into two separate download() and monkey_import() steps is also possible, but I suspect most users don't care and will continue to call both steps combined for convenience, as load(). I know I will. I'm afraid you're in a power-user minority @gojomo.

Sure, most people remain in a "it won't hurt for me to gloss over this extra risk this one more time, everyone's doing it" mindset unless/until they or someone close to them gets burned. Then, their customer data is exfiltrated & sold on the dark web; their systems are ransomwared; their online accounts are drained. After that, they may join the power-user minority.

Note that in what I consider the best case – all dataset-specific load-shim code is released as part of an explicit gensim-data package, or a gensim.data module of gensim – it's easy to make it one smooth step again, with no kludgey monkey_import-like step. It's just something like:

from gensim_data import CuratedModels
model = CuratedModels.get_fasttext_wiki_news_subwords_300()

And, the method provides a place to hang (& improve over time) docs on what you're getting - or warn/redirect people if a model is every withdrawn/usefully-updated/etc.

@piskvorky
Copy link
Owner

piskvorky commented Aug 6, 2021

I hope not! That's a >1 GB download.

Isn't doc-comment example code like this...

https://github.com/RaRe-Technologies/gensim/blob/b287fd841c31d0dfa899d784da0bd5b3669e104d/gensim/models/word2vec.py#L163

..auto-run in testing? That makes me think the ~54k glove-twitter-25 downloads are mostly driven by the same process that generated the 55k __testing_word2vec-matrix-synopsis downloads.

Similarly, the word2vec-google-news-300 dataset is used by a bunch of the auto_examples demos/tutorials, which I also thought were being auto-run at regular intervals.

@mpenkov are we downloading these files regularly as part of our CI / testing?

Better would be: the code + data gets packaged as part of a formal, version-control-tagged release into one bundle. That frozen artifact would be fine to download from the 'assets' mechanism.

Hm, what's the difference to the current system? If a malicious person takes over these assets… can't they still do the same thing?

Best would be: users get all gensim-data-related code & metadata via a bog-standard pip install gensim-data operation. Then all utility reader code then comes via a normal release/version/package-manager process, with all the higher-assurance review/PR/etc habits that entails.

At that point, we could also make that code a part of gensim proper – why use gensim-data at all.

Sure, most people remain in a "it won't hurt for me to gloss over this extra risk this one more time, everyone's doing it" mindset unless/until they or someone close to them gets burned. Then, their customer data is exfiltrated & sold on the dark web; their systems are ransomwared; their online accounts are drained. After that, they may join the power-user minority.

Frankly, I see PyPI as a bigger risk vector than any of this gensim-data / github stuff. It's a miracle no major PyPI package has been taken over yet (or the entire PyPI system!)… or least I haven't heard of it. If I were to do damage, I'd go straight to the point of distribution, protected by half-assed passwords on some half-assed web system.

from gensim_data import CuratedModels
model = CuratedModels.get_fasttext_wiki_news_subwords_300()

OK. I'll go back and review why we went with a separate repo in the current style. I'm sure there were reasons, even if its execution was lacking. Unless the reasons were compelling, we could switch (transition) to an all-downloads-code-installed-as-one-package style, whether as gensim_data or gensim.data, with the large data files in assets. Or alternatively, with releases as separate wheels. Thanks for your inputs @gojomo . Do you feel like checking & implementing something yourself?

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 6, 2021

@mpenkov are we downloading these files regularly as part of our CI / testing?

Sometimes. Not all of the gensim doctests code actually gets run. From what I understand, most of it actually does not, because of large data dependencies like this.

@piskvorky
Copy link
Owner

And what determines whether something gets run or not?

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 6, 2021

Better would be: the code + data gets packaged as part of a formal, version-control-tagged release into one bundle. That frozen artifact would be fine to download from the 'assets' mechanism.

Hm, what's the difference to the current system? If a malicious person takes over these assets… can't they still do the same thing?

It's about as bad, from security risk, if the bundle is still downloaded/exapnded/executed inside a .load() call.

It's slightly better in that the Python code is under real version-control before being bundled, & would be slightly better again if the user has to explcitly download/expand the bundle & then choose-to-run a new .py file that they know has arrived.

But putting the code in an explicit release-via-PyPI would be better still.

Best would be: users get all gensim-data-related code & metadata via a bog-standard pip install gensim-data operation. Then all utility reader code then comes via a normal release/version/package-manager process, with all the higher-assurance review/PR/etc habits that entails.

At that point, we could also make that code a part of gensim proper – why use gensim-data at all.

That's a good question. I can understand the original hope may have been that datasets might arrive at a much more frequent cadence that gensim releases. But it hasn't seemed that way in practice.

I know that despite automation, there's still a bunch of manual effort involved in pushing out a full gensim release. So I see why you might not want to add any more overhead to that process, or be dependent on that process when a new dataset would otherwise just need to be a little JSON, & a few lines of code. Perhaps having a thinner project, that's easier to push tiny new versions, is still helpful. And it does allow leveraging the Github 'assets' mechanism for bulk downloads, albeit in the odd manner where each 'release' only contains one latest dataset. (But: Github's size limit forces an extra split-and-reassemble step.)

Personally, for these tiny swatches of example code backed by larger data, I'd still prefer a web-based gallery of examples/notebooks, where users click/wget to download data bundles, then copy & paste a few lines of code showing usual loading. That's vividly instructive, and easy to maintain/improve, constantly, as wiki- or repo- edits, without any release formalisms. (Though, something like a 10+ line iterable class customized for certain formats would still go into the related gensim modules.)

This could take the form of a gensim-examples Github project that doesn't ever even have 'releases' - users just visit the web-viewable files.

Sure, most people remain in a "it won't hurt for me to gloss over this extra risk this one more time, everyone's doing it" mindset unless/until they or someone close to them gets burned. Then, their customer data is exfiltrated & sold on the dark web; their systems are ransomwared; their online accounts are drained. After that, they may join the power-user minority.

Frankly, I see PyPI as a bigger risk vector than any of this gensim-data / github stuff. It's a miracle no major PyPI package has been taken over yet (or the entire PyPI system!)… or least I haven't heard of it. If I were to do damage, I'd go straight to the point of distribution, protected by half-assed passwords on some half-assed web system.

I agree that's also a big risk – I suspect it's only a matter of time before PyPI or similar repos for other languages is implicated in some major headline-grabbing malware/ransomware incident. There've been a bunch of issues this year:

https://www.theregister.com/2021/03/02/python_pypi_purges/

https://arstechnica.com/gadgets/2021/07/malicious-pypi-packages-caught-stealing-developer-data-and-injecting-code/

The best we can do is protect our credentials/processes as well as possible so that we're not the early vector, and our users not the early victims - as other lower-hanging targets gradually get exploited & overall habits improve in response.

Barring compromise of the PyPI administrators or infrastructure, there is some safety in the number of users serially using a release. When you're the 10,000th person to pip install a specific version that's been up for weeks or months, there's a good chance someone else will have already hit (and forced the correction of) really distardly problems.

from gensim_data import CuratedModels
model = CuratedModels.get_fasttext_wiki_news_subwords_300()

OK. I'll go back and review why we went with a separate repo in the current style. I'm sure there were reasons, even if its execution was lacking. Unless the reasons were compelling, we could switch (transition) to an all-downloads-code-installed-as-one-package style, whether as gensim_data or gensim.data, with the large data files in assets. Or alternatively, with releases as separate wheels. Thanks for your inputs @gojomo . Do you feel like checking & implementing something yourself?

I could potentially do some 'axe-with-minimalist-stopgap' work. I don't think that's your preferred path, but it'd be: deprecate .load(), create a very-verbose .download() & instructions for how people can consciously opt to import/run the __init__.py shims - because that'd quickly remedy the practices that most concern me, without waiting indefinitely for some best-case future capability. Of course, there'd be some loss of the current slick magic ease, & transition costs as old online examples stop wokring without changes.

Actually designing/implementing something better than that, or designing/maintaining release processes, would be more than I could sign up for.

If it were easy enough to load a new dataset/model into some new repo/gallery, I might at some point like to get some pretrained Doc2Vec models based on Wikipedia into the new repo.

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 7, 2021

And what determines whether something gets run or not?

I had a closer look at the repo and it doesn't look like anything is getting automatically doctested.

My previous reply ("sometimes") was due to confusion with smart_open. Over there, we do have doctests, which we run prior to each release.

@piskvorky
Copy link
Owner

piskvorky commented Aug 9, 2021

OK, thanks. That's what I thought too – we check docs for syntax, at most. We don't actually run the code there.
(invalidating the theory of "bogus download counts driven by our own CI")

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 9, 2021

OK, thanks. That's what I thought too – we check docs for syntax, at most. We don't actually run the code there.
(invalidating the theory of "bogus download counts driven by our own CI")

The __testing assets still have 54K+ downloads which are CI-driven - & even if it's not our own doc-comments, other projects' automated fetches may be the overwhelming bulk of other downloads.

@piskvorky
Copy link
Owner

piskvorky commented Aug 9, 2021

other projects' automated fetches may be the overwhelming bulk of other downloads.

That is of course possible. We have no control / reliable stats over who or why downloads from gensim-data. Or from gensim itself, for that matter – Gensim's sustained "7M+ monthly downloads" is clearly nonsense.

The relative numbers check out though: from thousands for fringe datasets, to hundreds of thousands for the "superstar" ones. Of which I'm personally responsible for at least dozens, for sure, even without automation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

4 participants