Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use matroid-database package #37231

Merged
merged 13 commits into from
Feb 25, 2024
Merged

Use matroid-database package #37231

merged 13 commits into from
Feb 25, 2024

Conversation

gmou3
Copy link
Contributor

@gmou3 gmou3 commented Feb 3, 2024

Remove matroid database files (ext_data/matroids) and use newly created matroid-database pypi package instead.

Links to matroid-database: pypi, github

Outdated:

Compress (via xz -e) the matroid database files (~75MB to <1MB). The files are now opened with lzma.open() and the file paths are located through sage.ext_data and not relative to SAGE_SRC or SAGE_EXTCODE. These changes attempt to address the issues raised by @tornaria.

Also, the addition of the matroid database files inside the folder ext_data needs to be noted in the file setup.cfg.m4. This issue was brought to my attention by @antonio-rojas.

These follow the merging of #37140.

I also took the initiative of reordering alphabetically the list of paths inside setup.cfg.m4 and of removing the line ext_data/images/* which seemed redundant.

📝 Checklist

  • The title is concise, informative, and self-explanatory.
  • The description explains in detail what this PR is about.
  • I have linked a relevant issue or discussion.

@antonio-rojas
Copy link
Contributor

Perfect, thanks

@tornaria
Copy link
Contributor

tornaria commented Feb 4, 2024

I copy below my comment from #37140. IMHO that PR should be reverted, and replaced by an external database package. Also, I understand that SAGE_EXTCODE is deprecated (#33037).


This PR almost doubles the size of src/sage:

$ du -sh src/sage
180M	src/sage
$ du -sh src/sage/ext_data/matroids/
75M	src/sage/ext_data/matroids/

This is not the way databases are handled in sagemath. The normal way would be to make an optional spkg containing the database. The nicer/easier way would be to create a pypi package similar to https://github.com/sagemath/conway-polynomials (btw, in conway-polynomials the actual data files are xz compressed and transparently decompressed as needed)

Also, as @antonio-rojas pointed out, this broke sagemath-standard.

@tscrim
Copy link
Collaborator

tscrim commented Feb 5, 2024

@tornaria You mean #37140, correct?

However, I am not sure how much I favor creating additional optional as they are a pain to install unless you have a source build IIRC. Putting things on pypi currently means it goes out into the either and nobody really knows it exists. We (the Sage community) don't really have a good policy for how deal with small databases. (The pypi issue is a bigger thing for me. I am not aware of any good way to find what's available, and I don't have a lot of trust with that code to stay current with changes to Sage.)

In this case, we have a database that is available online but it isn't packaged in any standard form. I agree that this is not the best solution, but without a good way to install optional packages (which would be preference otherwise), I don't see a better option for a somewhat small database like this. Or has something changed with optional package installation?

Lastly, I wasn't aware of the deprecation. It does not really seem like a formal deprecation, much less have a plan to replace it.

@gmou3 gmou3 changed the title Update src/setup.cfg.m4 to include matroid database and reorder alphabetically Compress matroid database and update src/setup.cfg.m4 accordingly Feb 5, 2024
@gmou3 gmou3 changed the title Compress matroid database and update src/setup.cfg.m4 accordingly Compress matroid database and update src/setup.cfg.m4 Feb 5, 2024
@gmou3
Copy link
Contributor Author

gmou3 commented Feb 5, 2024

Please have a look at the changes.

@tornaria
Copy link
Contributor

tornaria commented Feb 5, 2024

@tornaria You mean #37140, correct?

Indeed

However, I am not sure how much I favor creating additional optional as they are a pain to install unless you have a source build IIRC.

It should be as simple as sage -pip install matroid-database or something like that. Alternatively, you could make a vote in sage-devel to make the package standard instead of optional. This way is only simpler because the vote on sage-devel was omitted.

Putting things on pypi currently means it goes out into the either and nobody really knows it exists.

For sage-the-distro is just a (standard or optional) spkg as always. In general, if it's optional it can be handled by "features" which means when you want to use the feature, if it's not available, it will explain what is missing and what is the suggested way to install the feature depending on your installation.

We (the Sage community) don't really have a good policy for how deal with small databases.

We do. We have a number of databases which are external spkgs, and some work has been started to move them to python packages (I mentioned conway-polynomials, which is a core database which is used to create finite fields, very non-optional I think). The python package is very simple, easy to install in varying circumstances (via pip, as a spkg, as a distro package, conda, etc).

Here's the data packages I have installed (I think they are all the standard data packages):

$ du -sch /usr/share/sagemath/* | sort -h
$ du -sch /usr/share/sagemath/* | sort -h
36K	/usr/share/sagemath/combinatorial_designs
188K	/usr/share/sagemath/jones
332K	/usr/share/sagemath/reflexive_polytopes
668K	/usr/share/sagemath/threejs-sage
732K	/usr/share/sagemath/conway_polynomials
2.2M	/usr/share/sagemath/ellcurves
3.5M	/usr/share/sagemath/graphs
8.1M	/usr/share/sagemath/cremona
16M	total

These are all smaller than the matroid database, and still they were deemed to be external packages.

(The pypi issue is a bigger thing for me. I am not aware of any good way to find what's available, and I don't have a lot of trust with that code to stay current with changes to Sage.)

The code would be as a sagemath project in github (e.g. https://github.com/sagemath/conway-polynomials). Having it in pypi means they are easy to install using pip, etc. And sage-the-distro would still have a spkg for this which should be very easy to write and maintain (as are distro packages). Moreover, the databases change very seldom which means all of that data is installed once and doesn't need to change when updating sagemath, etc.

In summary, having a database as a spkg is the usual policy for sagemath. Having a database as a "proper" python package in pypi instead of just a custom spkg has some advantages and there would still be a spkg (based on the pypi package instead of custom spkg).

Anyway, if you prefer to have a "custom spkg" version of the database, that should be fine as well. The current way is that databases are searched for in a few paths given by SAGE_DATA_PATH which defaults to something like ~/.sage/db/<NAME>, $SAGE_LOCAL/share/sagemath/<NAME>, $SAGE_LOCAL/share/<NAME>, which means a user can easily just put the files in its own home directory if they are not installed in the system.

In this case, we have a database that is available online but it isn't packaged in any standard form. I agree that this is not the best solution, but without a good way to install optional packages (which would be preference otherwise), I don't see a better option for a somewhat small database like this. Or has something changed with optional package installation?

If you want this database to be standard instead of optional, then a vote has to be held in sage-devel. But the policy for databases (either standard or optional) is to have them as separate packages.

Lastly, I wasn't aware of the deprecation. It does not really seem like a formal deprecation, much less have a plan to replace it.

No, the idea is that package data should go inside packages and use importlib.resources to access the data. This is for data tied to sage source code.

For databases, the policy AFAICT is to have them as separate spkgs.

@tornaria
Copy link
Contributor

tornaria commented Feb 5, 2024

Please have a look at the changes.

I already stated my opinion: I don't think it makes sense to add database files to the repo, even if they are compressed. These files seem to be 10+ years old, we don't need to update them every release (otoh, if they had some bug or were extended, it's nice if they can be updated without having to change sage!)

Having an external spkg is the current policy. Making the database "standard" AFAIK requires a vote in sage-devel.

Note: even if the spkg is only optional, if the database files are searched via SAGE_DATA_PATH as done with several other databases (see #37024) it should be possible for a user to place the files in its home dir if not installed in system.

Regarding the compression:

  • for files smaller thank 4k (most of them) it may be better to not compress since data is usually read in 4k blocks anyway.
  • For the large files I think the best is to use something like xz -T 2 -e which gives around half the size than gz.
  • Using xz should not be a problem since lzma is included in python standard library since python 3.3.

With this the largest files are:

20K	all_matroids/allr3n10.txt.xz
24K	unorientable_matroids/unorientabler3n11.txt.xz
28K	unorientable_matroids/unorientabler4n09.txt.xz
244K	all_matroids/allr4n09.txt.xz
360K	all_matroids/allr3n11.txt.xz

Every other file fits in 4k and the total is:

$ du -sch all_matroids/ unorientable_matroids/
768K	all_matroids/
80K	unorientable_matroids/
848K	total

@@ -195,24 +197,20 @@ def AllMatroids(n, r=None, type="all"):
rp = min(r, n - r) if (type != "unorientable") else r
type_file = "all" if (type != "unorientable") else "unorientable"
file = os.path.join(
str(SAGE_EXTCODE), "matroids", "database",
str(SAGE_SRC), "sage", "ext_data", "matroids", "database",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using SAGE_SRC is not the way to avoid using SAGE_EXTCODE. The idea of #33037 is that one should not use neither SAGE_LIB nor SAGE_SRC, and SAGE_EXTCODE.

Instead, one is supposed to use importlib.resources to access package data (which I guess should be placed in the package that owns the data instead of the generic src/sage/ext_data although I guess you could still access files in there via sage.ext_data).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, using SAGE_SRC should also be avoided completely because this variable does not have a meaningful definition in the presence of PEP 420 namespace packages.

@gmou3
Copy link
Contributor Author

gmou3 commented Feb 5, 2024

I already stated my opinion: I don't think it makes sense to add database files to the repo, even if they are compressed. These files seem to be 10+ years old, we don't need to update them every release (otoh, if they had some bug or were extended, it's nice if they can be updated without having to change sage!)

Having an external spkg is the current policy. Making the database "standard" AFAIK requires a vote in sage-devel.

Fair enough. If there is consensus on this or you are confident enough of this requirement let me know and I'll try to make it into a package.

Regarding the compression:

  • for files smaller thank 4k (most of them) it may be better to not compress since data is usually read in 4k blocks anyway.
  • For the large files I think the best is to use something like xz -T 2 -e which gives around half the size than gz.
  • Using xz should not be a problem since lzma is included in python standard library since python 3.3.

With this the largest files are:

20K	all_matroids/allr3n10.txt.xz
24K	unorientable_matroids/unorientabler3n11.txt.xz
28K	unorientable_matroids/unorientabler4n09.txt.xz
244K	all_matroids/allr4n09.txt.xz
360K	all_matroids/allr3n11.txt.xz

Every other file fits in 4k and the total is:

$ du -sch all_matroids/ unorientable_matroids/
768K	all_matroids/
80K	unorientable_matroids/
848K	total

That's quite cool. I compressed them all for consistency/simplicity.

@videlec
Copy link
Contributor

videlec commented Feb 5, 2024

@tornaria You mean #37140, correct?

However, I am not sure how much I favor creating additional optional as they are a pain to install unless you have a source build IIRC. Putting things on pypi currently means it goes out into the either and nobody really knows it exists. We (the Sage community) don't really have a good policy for how deal with small databases. (The pypi issue is a bigger thing for me. I am not aware of any good way to find what's available, and I don't have a lot of trust with that code to stay current with changes to Sage.)

In this case, we have a database that is available online but it isn't packaged in any standard form. I agree that this is not the best solution, but without a good way to install optional packages (which would be preference otherwise), I don't see a better option for a somewhat small database like this. Or has something changed with optional package installation?

Lastly, I wasn't aware of the deprecation. It does not really seem like a formal deprecation, much less have a plan to replace it.

I do not quite understand your objection. Any package on pypi can be installed on any system with python. The package might have a build-dependency or run-dependency on sage, and this is handled by pip. For a purely data+python package such as the one discussed here it would only be run-dependency. As an example, you can try

$ pip install admcycles

which is a pure python library with run-dependency on sage.

What is a pain in the ass are the external packages that requires a recompilation of the sage source code in order to work. For this not only you need a compiled version of sage but also to recompile part of sage to get it work. And this is a nonsense to package.

About advertisement, good projects (as admcycles) tend to survive if people use it and developers maintain it. And I think that dead code in sage is not better than a dead project.

@tscrim
Copy link
Collaborator

tscrim commented Feb 6, 2024

Let me first start with my proposal going forward. A pypi package for the database could work provided we have a clear mechanism for telling a user what they need to do to use the feature (both in code when trying to do it without installing the database and in the documentation with a link to the database). This is assuming that we don't have an auto-download/install feature, but perhaps that becomes a sysadmin security concern?

I am upgrading this to a blocker as I think we all agree that we need to settle this before the next full release.


Now let me explain my positions a bit more.

@videlec @tornaria Note there are many differences between Sage optional packages and pypi package.

Optional Sage packages:

  • Explicitly listed as part of Sage.
  • We say that they should work with the latest version of Sage.
  • They are somewhat regularly tested.
  • There is no easy way to install them on non-source-builds (AFAIK).

pypi packages:

  • Easy enough to install.
  • No quality control: there isn't any check or promise that they should work when Sage upgrades.
  • No list of packages for optional Sage features or downstream code.
  • Similarly, no way to know what is needed for a particular optional feature.

It isn't about the package run-time dependency on Sage, but the other way around: Sage's run-time dependency on a pypi package. This case might be a good place to start setting a policy about this, or at least one reasonable attempt at one.

These files seem to be 10+ years old, we don't need to update them every release (otoh, if they had some bug or were extended, it's nice if they can be updated without having to change sage!)

I find the argument to be a weak one in general, but a bit stronger for databases. We generally change Sage to push the (minimum) version numbers to not include buggy code. I don't see how we are updating them every release. There are some mild benefits for upgrading one small part independent of our release cycle. Yet we release somewhat frequently and I would tell people just to upgrade Sage.

About advertisement, good projects (as admcycles) tend to survive if people use it and developers maintain it. And I think that dead code in sage is not better than a dead project.

Who is to judge what counts as dead code? If someone is trying to use it but it is broken, then is it really dead? Frankly, I find this argument completely unconvincing as this doesn't tell someone (who is not in-the-know) what is available.

@tornaria
Copy link
Contributor

tornaria commented Feb 6, 2024

Some quick comments:

  • sage already depends on a lot of pypi packages, both standard and optional
  • any pypi package cam be easily added as a spkg, in that sense having the database in pypi is more versatile than having it only as a spkg
  • having databases as separate packages has been the policy forever afaik
  • moving packages (and databases) from optional to standard requires a vote; adding the database to the source should have the same requirement.

I believe it was a mistake to merge #37140, at least without a wider discussion. The way forward IMO is to revert it and have the discussion or just add the database as a spkg (pypi or not). Even having the database as a standard package would be way, way better than having it in the source.

@mkoeppe
Copy link
Contributor

mkoeppe commented Feb 6, 2024

The pypi issue is a bigger thing for me. I am not aware of any good way to find what's available, and I don't have a lot of trust with that code to stay current with changes to Sage.

@tscrim This is a solved problem. We advertise packages that are on PyPI by creating an SPKG for it. This puts them in our package index https://doc.sagemath.org/html/en/reference/spkg/index.html#optional-packages
We also run automatic tests on them in each release.

See also:

@mkoeppe
Copy link
Contributor

mkoeppe commented Feb 6, 2024

For @videlec's example package "admcycles", we also have a corresponding SPKG: https://doc.sagemath.org/html/en/reference/spkg/admcycles.html#spkg-admcycles
Again having it there not only advertises it as part of our documentation, but it also includes it in our automatic tests on GH Actions.

@mkoeppe
Copy link
Contributor

mkoeppe commented Feb 6, 2024

@tscrim
Copy link
Collaborator

tscrim commented Feb 6, 2024

The pypi issue is a bigger thing for me. I am not aware of any good way to find what's available, and I don't have a lot of trust with that code to stay current with changes to Sage.

@tscrim This is a solved problem. We advertise packages that are on PyPI by creating an SPKG for it. This puts them in our package index https://doc.sagemath.org/html/en/reference/spkg/index.html#optional-packages We also run automatic tests on them in each release.

A solution with duct-tape and glue at best. I don't see any documentation saying someone should do that (or even encouraging it). An unmarked list interspersed within a huge list contained in an even bigger technical-looking list isn't really advertising. Nothing is said about any automated testing; the only way I am aware of testing on optional packages is by us in the community testing in various combinations. I feel like you are trying to sell me a hamburger without cooking it.

@mkoeppe
Copy link
Contributor

mkoeppe commented Feb 6, 2024

A solution with duct-tape and glue at best.

Really now?

@tscrim
Copy link
Collaborator

tscrim commented Feb 6, 2024

What about my points is wrong?

@tornaria
Copy link
Contributor

tornaria commented Feb 8, 2024

I think you need to add PythonModule('matroid_database') to the list at the end of src/sage/features/databases.py. I think that's also necessary for the # needs matroid_database tag to work.

@mkoeppe
Copy link
Contributor

mkoeppe commented Feb 8, 2024

I was able to handle this by using a combination of sage: # optional - matroid_database and # needs matroid_database

We would use # optional here because matroid_database is an optional package / feature.
The tag # needs is used for standard packages / features.

@mkoeppe
Copy link
Contributor

mkoeppe commented Feb 8, 2024

I think you need to add PythonModule('matroid_database') to the list at the end of src/sage/features/databases.py. I think that's also necessary for the # needs matroid_database tag to work.

That's right. Without declaring the feature, the # optional - matroid_database would only work when matroid_database is installed as an SPKG, i.e., using make matroid_database or sage -i matroid_database. But we want it to work also in other deployment situations.

Add PythonModule('matroid_database')
Use # optional - matroid_database
src/setup.cfg.m4 Outdated
@@ -107,27 +107,26 @@ sage.repl.rich_output =

sage =
ext_data/*
ext_data/kenzo/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to back out the cosmetic change (alphabetic sorting) from this PR because it will conflict with #36951

@gmou3 gmou3 changed the title Compress matroid database and update src/setup.cfg.m4 Use matroid-database package Feb 8, 2024

License
-------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add license and more detailed description -- for example by fixing the metadata of the package on PyPI and re-running sage --package create matroid_database --pypi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did so manually.

@@ -262,7 +262,8 @@ def __init__(self, name='polytopes_db'):


def all_features():
return [PythonModule('conway_polynomials'),
return [PythonModule('conway_polynomials', spkg='conway_polynomials', type='standard'),
PythonModule('matroid_database', spkg='matroid_database'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to make this a separate feature class DatabaseMatroids.

Then, just before you import matroid_database, use DatabaseMatroids().require(). This will give instructions to users how to install it.

Copy link

github-actions bot commented Feb 9, 2024

Documentation preview for this PR (built with commit 7a0a502; changes) is ready! 🎉

@mkoeppe
Copy link
Contributor

mkoeppe commented Feb 13, 2024

This is working well. Let's merge it.

vbraun pushed a commit to vbraun/sage that referenced this pull request Feb 18, 2024
    
Remove matroid database files (`ext_data/matroids`) and use newly
created `matroid-database` pypi package instead.

Links to `matroid-database`: [pypi](https://pypi.org/project/matroid-
database/), [github](https://github.com/gmou3/matroid-database)

Outdated:
> Compress (via `xz -e`) the matroid database files (~75MB to <1MB). The
files are now opened with `lzma.open()` and the file paths are located
through `sage.ext_data` and not relative to `SAGE_SRC` or
`SAGE_EXTCODE`. These changes attempt to address the issues raised by
@tornaria.
>
> Also, the addition of the matroid database files inside the folder
`ext_data` needs to be noted in the file `setup.cfg.m4`. This issue was
brought to my attention by @antonio-rojas.
>
> These follow the merging of sagemath#37140.
>
> I also took the initiative of reordering alphabetically the list of
paths inside `setup.cfg.m4` and of removing the line `ext_data/images/*`
which seemed redundant.

<!-- Why is this change required? What problem does it solve? -->
<!-- If this PR resolves an open issue, please link to it here. For
example "Fixes sagemath#12345". -->
<!-- If your change requires a documentation PR, please link it
appropriately. -->

### 📝 Checklist

- [x] The title is concise, informative, and self-explanatory.
- [x] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
    
URL: sagemath#37231
Reported by: gmou3
Reviewer(s): gmou3, Gonzalo Tornaría, Matthias Köppe
vbraun pushed a commit to vbraun/sage that referenced this pull request Feb 19, 2024
    
Remove matroid database files (`ext_data/matroids`) and use newly
created `matroid-database` pypi package instead.

Links to `matroid-database`: [pypi](https://pypi.org/project/matroid-
database/), [github](https://github.com/gmou3/matroid-database)

Outdated:
> Compress (via `xz -e`) the matroid database files (~75MB to <1MB). The
files are now opened with `lzma.open()` and the file paths are located
through `sage.ext_data` and not relative to `SAGE_SRC` or
`SAGE_EXTCODE`. These changes attempt to address the issues raised by
@tornaria.
>
> Also, the addition of the matroid database files inside the folder
`ext_data` needs to be noted in the file `setup.cfg.m4`. This issue was
brought to my attention by @antonio-rojas.
>
> These follow the merging of sagemath#37140.
>
> I also took the initiative of reordering alphabetically the list of
paths inside `setup.cfg.m4` and of removing the line `ext_data/images/*`
which seemed redundant.

<!-- Why is this change required? What problem does it solve? -->
<!-- If this PR resolves an open issue, please link to it here. For
example "Fixes sagemath#12345". -->
<!-- If your change requires a documentation PR, please link it
appropriately. -->

### 📝 Checklist

- [x] The title is concise, informative, and self-explanatory.
- [x] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
    
URL: sagemath#37231
Reported by: gmou3
Reviewer(s): gmou3, Gonzalo Tornaría, Matthias Köppe
@vbraun vbraun merged commit 5430775 into sagemath:develop Feb 25, 2024
30 of 71 checks passed
@mkoeppe mkoeppe added this to the sage-10.3 milestone Mar 7, 2024
@gmou3 gmou3 deleted the ext_data_matroids branch April 9, 2024 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants