Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPi package #913

Open
mbollmann opened this issue Jul 9, 2020 · 4 comments
Open

PyPi package #913

mbollmann opened this issue Jul 9, 2020 · 4 comments
Assignees
Labels
enhancement triaged Next on the docket
Milestone

Comments

@mbollmann
Copy link
Member

There was some discussion on whether we should make our anthology library into a PyPi package. This would make it easier for people to use our Python interface to the Anthology, e.g., to build external tools or run analyses. It might even encourage people to contribute and add functionality to the library itself.

Requirements to achieve this (from the top of my head):

  1. A mechanism to download/update the Anthology XML data from within the Python package.
    Many Python packages download external data as part of their functionality (e.g., NLTK, torchtext), and I've personally used GitPython to do exactly this with the ACL Anthology for my recent Anthology analysis paper. I believe this is completely solvable.

  2. A proper documentation. If we want to promote our Python API in this way, we should have at least a succinct, user-friendly documentation that gets people started on how to use it. I believe that might be good thing to have anyway, to help future volunteers for the Anthology who might work on the Python API. I'd also be happy to help prepare it.

  3. Faster loading as discussed in Faster loading of Anthology class #835 could be a major factor for usability. I have more ideas in this direction that I want to look into at some point, but maybe it's more of a "nice-to-have" than an actual blocker?

Most importantly, I think it would be great to gauge the community's interest in this. If you'd be interested in and see value in working with Anthology data through a pip-installable library, give a thumbs up here!

@akoehn
Copy link
Member

akoehn commented Jul 9, 2020

I think the most work is proper versioning and releases. Right now code and data are automatically synchronized because they are in the same repository, but we cannot guarantee that an old version of the library works with new data (and we should not try to change that) and we currently have no versioning at all.

Extracting the code into its own git repo and embedding it here creates a lot of overhead (speaking from experience with these setups in an academic setting) and I don't know how we would have version numbers & releases while keeping the code in here.

@mbollmann
Copy link
Member Author

Great points, @akoehn.

Versioning would indeed require more thought. We could have a file in data/ indicating the minimum version of the library needed to work with it, so the library could warn its users when it's outdated and no longer compatible with the latest XML. But it'd certainly be more work.

Conversely, though, you could say that the lack of versioning currently makes it less attractive for people to build on our API, since it could change at any moment without clear documentation. That's why I'm wondering how many people would even be interested in this, to see if it makes sense to think about this.

Extracting the code into its own git repo and embedding it here creates a lot of overhead

Are you thinking of the git submodule approach here? I don't see a lot of problems with just adding our package to this repo's requirements.txt instead, but maybe I haven't fully thought this through.

I don't know how we would have version numbers & releases while keeping the code in here.

I'm not sure what problems you foresee here; version numbers for the Python package could be kept in a subdirectory where the package lives (say lib/), and releases to PyPi could be triggered manually by us when appropriate.

@akoehn
Copy link
Member

akoehn commented Jul 10, 2020

Are you thinking of the git submodule approach here?

No, I meant another repo. The thing is that fixing a bug is straight-forward now. With a separate repository, you would need to check out acl-anthology and the anthology code, make changes to the code, publish it locally (or otherwise make sure it is used by acl-anthology) test whether your fix worked, repeat.

The easiest way would probably be to generate a pypi package from the current setup where the core anthology code base is together with the library part in one repository and we don't have to think about versioning all the time.

@mbollmann mbollmann self-assigned this Aug 8, 2021
@mjpost mjpost added enhancement help wanted Interesting but beyond current volunteer bandwidth good first project Good projects for new contributors labels Dec 17, 2022
@mjpost mjpost added this to the 2023Q1 milestone Dec 17, 2022
@mjpost mjpost pinned this issue Dec 26, 2022
@mjpost mjpost modified the milestones: 2023Q1, 2023Q3 Jul 13, 2023
@mbollmann
Copy link
Member Author

There's a first usable version of a PyPI library now: https://pypi.org/project/acl-anthology-py/

I'm currently developing this in a separate repo, but I've thought about the versioning issues and think it should probably be moved into this repo, as keeping it in sync with the data format here (XML schema etc.) does seem like a headache otherwise. I don't see a big problem with having version numbers & releases within this repo, though.

Over the coming weeks, I'll prepare a feature branch here that merges in this library, so that we can continue the discussion here.

@mbollmann mbollmann added triaged Next on the docket and removed help wanted Interesting but beyond current volunteer bandwidth good first project Good projects for new contributors labels Oct 21, 2023
@mjpost mjpost modified the milestones: 2023Q3, 2023Q4, 2024Q1 Jan 23, 2024
@mjpost mjpost modified the milestones: 2024Q1, 2024Q2 May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement triaged Next on the docket
Projects
None yet
Development

No branches or pull requests

3 participants