Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling Python packages with LPython #992

Open
Tracked by #1600
czgdp1807 opened this issue Aug 19, 2022 · 33 comments
Open
Tracked by #1600

Compiling Python packages with LPython #992

czgdp1807 opened this issue Aug 19, 2022 · 33 comments

Comments

@czgdp1807
Copy link
Collaborator

Following is the plan to achieve the goal mentioned in issue title,

  1. Write down small type annotated test packages in Python syntax. The nesting can vary from just 1 folder/package having an __init__.py file to 3-4 folders/subpackages one inside the other each level having an __init__.py. We can write packages for different sorting algorithms, graph algorithms and some practical stuff like that. We should try not to use workarounds but write code in a way that feels natural to us.
  2. Try to compile those test packages starting from easy ones (with just 1 folder) to difficult ones. Do sprints to quickly make them work then send PRs each having a single fix.
  3. Then proceed with small packages from PyPI (there must be many). Try to compile and fix LPython as described in step 2.
  4. Once all of the above is done, move on to advanced Python packages.

Alongside we should also try to interface with CPython code (#703).

@certik What do you say?

@czgdp1807 czgdp1807 self-assigned this Aug 19, 2022
@certik
Copy link
Contributor

certik commented Aug 19, 2022

Yes, we should get LPython working with actual Python packages, that we write ourselves to ensure we stick to the subset that LPython supports. And the goal here is to fix up the inevitable bugs that we either know about or that we will hit. After most things work, it will allow people to upload packages to PyPi and to have dependencies, etc. We can use regular Python tooling to install dependencies and download/upload from PyPi. We have to improve LPython to be able to compile such a package with dependencies.

We probably also need to add a mode to compile the library and all dependencies to some kind of a "mod" file that stores the ASR in it, for quick compilation, so that when you modify a script, only that script has to be parsed and compiled to ASR, all other modules can be just deserialized. LFortran works this way, and it works great.

Finally, we should then start to write libraries of reusable Python code:

  • NumPy (we'll ship with LPython)
  • All the packages from the Python standard library (we'll ship with LPython)
  • Matplotlib
  • SciPy

Effectively we need "lite" versions of such libraries, written in pure LPython. We might later venture into web servers too (Flask, etc.), but I would start with numerical / scientific libraries.

We can start creating packages in the Python standard library as external 3rd party packages, and then we'll simply download them / pull them into the LPython distribution, since we need to ship with them by default. The same with NumPy.

Ideall LPython itself (AST->ASR) has support for all the features that are needed, including a small subset of NumPy, such that all of NumPy and all of the Python Standard Library can be implemented as a regular LPython package, no magic / help from the compiler. Equivalent to user code.

@certik certik mentioned this issue Aug 19, 2022
9 tasks
@Shaikh-Ubaid Shaikh-Ubaid self-assigned this Aug 19, 2022
@czgdp1807
Copy link
Collaborator Author

I think we can re-use the serialization-deserialisation mechanism of LFortran and use it here to compile Python modules till ASR level. In fact, modules written by us right now compile (parsed + AST->ASR) only when we import them. However LFortran pre-compiles them till ASR and loads them if we import in our program. LPython should behave in a similar manner I think. Also, the question is how to integrate LPython with a build system (writing build files in simple words) so that LPython compiles all the intrinsic modules defined in it when its being built from source.

@certik
Copy link
Contributor

certik commented Aug 20, 2022

Yes. I think CPython compiles every module into .pyc, so LPython would compile every module to .lpy (or even to .pyc to reuse the extension), and the compiled module would be just serialized ASR. Just like LFortran works with the .mod files.

@czgdp1807
Copy link
Collaborator Author

I see. Let me try implementing this idea then. Let's see how far we can get.

@czgdp1807
Copy link
Collaborator Author

In order to achieve first and second point in #992 (comment), I think we need to write CMakeLists.txt for the small test package. For now we can use add_custom_command to generate LLVM object files and then link those object files together. So at the end for a single sub-package a static/shared binary will be generated. Does this make sense?

@certik
Copy link
Contributor

certik commented Aug 24, 2022

For end users the main mode that we want to support is automatic compilation, just like CPython works.

Is the CMake based mode just for testing that we can compile things by hand.

@czgdp1807
Copy link
Collaborator Author

I think LPython won't be detected as a compiler by a CMake, right? So add_custom_command will help in compiling those .py files to .o. Basically I am saying that pipeline should be same as C++/C projects because we are generating binaries for Python files same as what Clang/GCC do for C/C++ files.

For end users the main mode that we want to support is automatic compilation, just like CPython works.

By automatic compilation you mean that a module will be compiled to pyc file only when its imported? AFAIK, CPython creates pyc files only when we run the file importing those modules. I am not sure about this though.

@certik
Copy link
Contributor

certik commented Aug 24, 2022

We will probably support both modes:

  • use cmake
  • not use cmake, just do lpython a.py and if a.py imports a package b, the package b gets automatically compiled (and we can support some command line flag to compile just the package b into a list of .pyc files.)

I think most users would prefer the second approach.

Or to rephrase it -- the cmake build system can be automatically generated by LPython, so why bother and not just compile things directly? We know all the information.

@czgdp1807
Copy link
Collaborator Author

I see. Makes sense. Second approach would be much easier. We will have to implement timestamp strategy to check whether the file has been changed or not. Do you know how to fetch the timestamp the last time a file was modified on disk a.k.a modification time? https://stackoverflow.com/a/40504396 seems to be a promising way to do get the modification time of a file but if you know of anything better then we can go for that as well.

@czgdp1807
Copy link
Collaborator Author

So I am planning to implement the strategy in my above comment. However the problem is that when you auto-compile and re-use the auto-compiled modules then symbol table IDs depend on the order in which you compile those modules. This would have consequences on python run_tests.py -u since it doesn't follow any particular order of compilation. Now say test a.py and b.py use a module m.py then if a.py gets compiled first then m.py will be different b.py and hence ASR output of b.py will not match. Something similar will happen if b.py gets compiled first. So how to deal with this situation? Because once a module is compiled then it changes the Symbol Table IDs of the rest of the program as well.

@certik
Copy link
Contributor

certik commented Aug 29, 2022

This is dealt with in the Fortran desearilization by constructing a new symbol table and resolving it correctly. So everything should just work, as long as we reuse all the code, which I strongly recommend. :)

@certik
Copy link
Contributor

certik commented Oct 4, 2022

To move forward:

  • This is a big issue and we have to split it into smaller tasks and implement them
  • I don't know all the details what need to be done, so we simply have to start working towards the goal, discover issues as we go, and work on fixing them by creating a good design

Issues that we need to tackle:

  • Looks like there are two modes, one compiles everything in memory (no saving to disk), the other mode saves to .pyc
  • In the mode that saves to .pyc, we simply save every module to .pyc, then load it

@certik
Copy link
Contributor

certik commented Oct 4, 2022

Here is our goal:

  • Imagine a project with 10 dependencies, each of which has 10 dependencies, so total of 200 packages have to be build
  • The first time we build everything (say it takes 20s)
  • Then when we modify one file in our project, the dependencies are not built again, only the file that changed is built and our project is rebuilt/linked (ideally this be immediate -- couple milliseconds; or as fast as possible)

@certik
Copy link
Contributor

certik commented Oct 4, 2022

So what pieces do we need to get to the goal?

  • We must be able to build 200 dependencies, and then reuse the build.

What does that mean?

  • We need to reuse as much as possible.
  • Consequently, we have several options:
    • compile each file to .pyc (ASR serialization) or (even better): compile each package to one file that contains serialized ASR for the whole package
    • compile each file or (even better) each package into both serialized ASR and an object file(s)

The second option is more work for us, and we can tackle it later. We need something similar for LFortran as well, so we can work towards that as we go.

So far I like the idea compile each package to one file that contains serialized ASR for the whole package the most.

@certik
Copy link
Contributor

certik commented Oct 4, 2022

So the first task can be:

Also what we need are nested modules. Say:

+-  test_a.py
+-  a (directory)
    |
    +- __init__.py
    +- b.py
    +- c.py

And we need to get the following working in test_a.py:

from a.b import x
import a.b
from a import b

plus the relative import syntax in Python.

We need to get lpython test_a.py working, and it needs to correctly import a etc.

One issue is how to tell LPython where to look for the "a" package if you do import a. There are several options. It seems one clean option is to use the -I compiler flag, so lpython -I/some/path test_a.py would look into /some/path to find the package a if you do import a from test_a.py.

@certik
Copy link
Contributor

certik commented Oct 4, 2022

Once the above works, we can think about how to build dependencies. I suggest we use pyproject.toml to specify the dependencies (Cargo style), see here for more background and links to other resources: https://stackoverflow.com/questions/62983756/what-is-pyproject-toml-file-for

@Thirumalai-Shaktivel
Copy link
Collaborator

@Shaikh-Ubaid are you working on this issue?

@certik
Copy link
Contributor

certik commented Oct 6, 2022

Yes, @Shaikh-Ubaid is working on this issue, but if you are interested @Thirumalai-Shaktivel you can go ahead and also work on it, there is plenty of tasks to do.

@certik
Copy link
Contributor

certik commented Nov 16, 2022

I just tested it, here is the latest status on this:

#1305

@Thirumalai-Shaktivel
Copy link
Collaborator

@certik certik mentioned this issue Mar 21, 2023
38 tasks
@certik
Copy link
Contributor

certik commented Mar 21, 2023

It's time to start writing LPython packages, upload to pypi, make them pip installable and see what breaks, see #992 (comment). Report issues that break and we'll fix it.

@certik
Copy link
Contributor

certik commented Mar 21, 2023

Here are three ideas that we can do right away:

Make it pip installable (from github).

@Smit-create
Copy link
Collaborator

I'll start with the first one.

@certik
Copy link
Contributor

certik commented May 12, 2023

@Shaikh-Ubaid I figured out the instructions how to install some package using pip, and use (compile) it with LPython:

conda create -n test1 python=3.11
conda activate test1
pip install lpynn lpdraw
lpython -I$CONDA_PREFIX/lib/python3.11/site-packages/ test_pkg_lnn_01.py

where test_pkg_lnn_01.py is taken from integration_tests, but the import was changed from lnn to lpynn. To run this via CPython, do:

conda install numpy
python test_pkg_lnn_01.py

It looks like we can do:

  • lpynn
  • lpdraw
  • nrp

Let's get a few more packages working.

Also let's iterate, perhaps something like this: lpython --include-conda-env a.py, or even a shorter option, perhaps just lpython --conda a.py. I would not allow environment variables by default yet, I want to get more experience using this first.

@Shaikh-Ubaid
Copy link
Collaborator

Shaikh-Ubaid commented May 12, 2023

How many packages do we currently support? It looks like we can do:

There is also nrp (https://github.com/Shaikh-Ubaid/lpython_packages/tree/main/nrp_pkg, https://pypi.org/project/nrp/).

Let's get a few more packages working.

Also let's iterate, perhaps something like this: lpython --include-conda-env a.py, or even a shorter option, perhaps just lpython --conda a.py. I would not allow environment variables by default yet, I want to get more experience using this first.

Sure.

@certik
Copy link
Contributor

certik commented May 12, 2023

It looks like the lpython emulation works perfectly!

@certik
Copy link
Contributor

certik commented May 12, 2023

I would version lpython-emulation exactly with the same version as the released LPython version. That way it is clear which one you have to use, depending on which features of LPython the script uses, as well as to avoid any backwards incompatibilities.

@certik
Copy link
Contributor

certik commented May 12, 2023

I am upgrading this package: certik/mathfn#5, I can see a few things:

  • We should finish the lpython conda package to make it easier to install at the CI
  • I want to depend on a few packages, maybe some terminal package, this mathfn package and then create a full end user application (binary)
  • Polish the experience, make everything very easy

@Shaikh-Ubaid
Copy link
Collaborator

I would version lpython-emulation exactly with the same version as the released LPython version.

We recently released LPython 0.14.0. Do you mean we should version lpython_emulation as 0.14.0? Do we update lpython_emulation only when LPython has a new release?

@certik
Copy link
Contributor

certik commented May 12, 2023

That's what I would do. Do you see a problem with it?

@Shaikh-Ubaid
Copy link
Collaborator

That's what I would do. Do you see a problem with it?

Got it. Sure, we can do that. It seems lpython does not actually use/depend-on src/runtime/lpython/lpython.py. lpython handles the types using if-else logical constructs in AST->ASR. I wonder if there is a possibility to distribute src/runtime/lpython/lpython.py (that is lpython_emulation) independently. For example, a use case can be a user who likes the lpython_emulation package and is wanting to use the package to add typing information to his cpython codes.

@certik
Copy link
Contributor

certik commented May 13, 2023

LPython depends on lpython.py via integration_tests. We often upgrade a test, then we change lpython.py and AST->ASR. If we separate lpython.py from LPython, then we have to keep them in sync, which I think is more tedious than the current approach. Maybe later once LPython stabilizes we can do that.

@Shaikh-Ubaid
Copy link
Collaborator

LPython depends on lpython.py via integration_tests. We often upgrade a test, then we change lpython.py and AST->ASR. If we separate lpython.py from LPython, then we have to keep them in sync, which I think is more tedious than the current approach. Maybe later once LPython stabilizes we can do that.

Got it, let's focus on delivering lpython first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants