Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid marshal for creating code objects from serialized data. #566

Open
markshannon opened this issue Mar 17, 2023 · 0 comments
Open

Avoid marshal for creating code objects from serialized data. #566

markshannon opened this issue Mar 17, 2023 · 0 comments

Comments

@markshannon
Copy link
Member

markshannon commented Mar 17, 2023

Once the work to allow any object as the "code" of a frame is done, we can take advantage of that to speed up creation of code objects from serialized data.

The idea is that the serialized data will consist of two parts:

  1. A sequence of immutable bytecode
  2. Supporting binary data.

Creation of the top-level (module) code object would be done as follows:

  1. Create a "module initializer" object, consisting of a pointer to the binary data and debug info like the name and filename.
  2. Create a frame, setting the "code" field to the module initializer and setting the instruction to point at the instructions.
  3. Start executing in the interpreter.

What are the advantages of this?

  • Marshal is slow
  • There is no need for a secondary interpreter (marshal)
  • It allow partial deep-freezing, meaning that the names and consts arrays can be deep frozen without requiring that the code object is deep frozen. The resulting constant can be loaded with LOAD_COMMON_CONST.
  • It allows further improvements, e.g. we could skip creating a code object for the module, just creating them for functions.
  • It decouples the pyc format from marshal, allowing them to be improved separately.
  • Common objects can be shared very efficiently, by leaving them on the stack and using COPY instead of MAKE_...

Creating the instruction sequence

We can create the instruction in much the same way as marshal serializes; recursively emitting code for sub-objects until the entire object is complete.

To do this will need some new instructions and a few new instrinsics.

New general purpose instructions:

  • LOAD_COMMON_CONST Loads a constant from the global array containing None, True, etc plus assorted common constants
  • LOAD_COMMON_NAME Like LOAD_COMMON_NAME but from an array of strings.
  • LOAD_INT Loads a small int

Insructions to create objects from binary data.

These instructions will create an object from the binary data, advancing the pointer.

  • MAKE_FLOAT
  • MAKE_STRING
  • MAKE_LONG (we could build large ints from small ints, but that would be quadratic)
  • MAKE_BYTES
  • MAKE_CODE: Creates a code object from values on the stack (name, qualname, names, consts) and binary data

New instrinsic functions

  • make_complex (2)
  • make_frozenset (1)

We already have an instruction for making tuples.

The instruction sequence would finish with MAKE_CODE; RETURN_VALUE returning the completed instruction on the stack.
Or, we could add another instruction, START_CODE at the end to execute the code object and return the completed module.

Examples

Creation of the tuple (1, "a", 37.0, (2, "foo"))

LOAD_INT 1
LOAD_COMMON_NAME "a"
MAKE_FLOAT 37.0
LOAD_INT 2
MAKE_STRING "foo"
BUILD_TUPLE 2
BUILD_TUPLE 3

Creation of a code object would look like something like this:

(Code to create names tuple)
(Code to create consts tuple)
MAKE_STRING name 
MAKE_STRING qualname
COPY n (filename will be shared for all code objects in module)
MAKE_CODE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant