Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typos, and possible bug/question #36

Open
PallHaraldsson opened this issue Jul 17, 2023 · 3 comments
Open

Typos, and possible bug/question #36

PallHaraldsson opened this issue Jul 17, 2023 · 3 comments

Comments

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Jul 17, 2023

It's in- not "imcompatible":

8f0d8c0

and in the README "from/for pytorch", should be "from/to PyTorch"?

It's great to see this package (while seemingly you can always use PythonCall.jl, i.e. Python to load/store all pickles), i.e. not needing Python.

I tried to dig into the source code to check if an ordered Dict is used. I.e. Python at some point changed from unordered to ordered, so when you deserialize, an ordered (also unordered from older Python/pickle?) I think you want to return such a type, not Julia's (unordered) Dict, from Base. I see you depend on DataStructures.jl, maybe for that. An ordered type should be there, it's also in OrderedCollection.jl

It's a question what to do when you serialize (unordered) Julia Dict. Can and does/should it go to now default Python's ordered dict?

I don't know that I will ever need to serialize to pickle/PyTorch, while you want to support that, it seems maybe the main reason for this package is to deserialize Pickle, or rather PyTorch's .bin (that I suppose has all the same requirements and more).

I was sort of surprised to see "opcode". I realize in Python you can pickle anything even programs, so it's meant for, not the source code, rather Python's generated VM code? I'm not sure if it's used or useful for PyTorch, I only know the possibility makes it a security risk in Python that someone can serialize a supposed NN, with arbitrary Python executable code.

I suppose if you deserialize this code in Julia, you get the opcodes in some form, but no way to actually run them (yet), i.e. actually without the security risk? Is there ever a downside and you actually want to distribute code with and have it run on deserialization (in Python)? Are you planning then on supporting it (I guess not)? You might want to document this somehow either way, in a positive light... :)

@chengchingwen
Copy link
Owner

I tried to dig into the source code to check if an ordered Dict is used. I.e. Python at some point changed from unordered to ordered, so when you deserialize, an ordered (also unordered from older Python/pickle?) I think you want to return such a type, not Julia's (unordered) Dict, from Base. I see you depend on Datastructures.jl, maybe for that. An ordered type should be there, it's also in OrderedCollection.jl

It's a question what to do when you serialize (unordered) Julia Dict. Can and does/should it go to now default Python's ordered dict?

I'm not sure if the internal ordering is important here. I would expect python people using collections.OrderedDict if the ordered property is required. OTOH, even for the old unordered python implementation, the Julia Dict is still using an different implementation (i.e data structure, hash algo, etc.). That is to say, the iteration order of an unordered python dict and unordered julia Dict is different. I don't think it is ever possible to reconstruct the order of an python dict without an explicitly ordered dict implementation in julia. The main usage of OrderedDict in Pickle is for the torch pickle which use ordered dict to store the model states.

I don't know that I will ever need to serialize to pickle/PyTorch, while you want to support that, it seems maybe the mean reason for this package is to deserialize Pickle, or rather PyTorch's .bin (that I suppose has all the same requirements and more).

Personally, it is am important feature so that I can save the Transformer model trained in Julia and distributed the model file for people to use in python. Another possible usage of that is you can passing an object to a python process by serializing the object with pickle format.

I was sort of surprised to see "opcode". I realize in Python you can pickle anything even programs, so it's meant for, not the source code, rather Python's generated VM code? I'm not sure if it's used or useful for PyTorch, I only know the possibility makes it a security risk in Python that someone can serialize a supposed NN, with arbitrary Python executable code.

First of all, the python pickle format is actually the instructions for a stack machine, which is different to python VM and implemented in python language. The "opcode" is for that stack machine. The stack machine will read the instructions and try to "build" the python object back. The stack machine itself is not that powerful, but it is allow to call any python functions (since it's implemented in python), even the eval. So by saying "you can pickle anything even programs", it is usually referring to the instructions that "build" the function object, or an instruction secretly calling eval. That's way pickle is known for arbitrary code execution.

PyTorch pickle is slightly different for two parts. First, the format is actually containing multiple stuffs. The newer torch pickle is an zip file containing: 1. the object hierarchy in python pickle format, and 2. the parameters in "subfiles" of the zip (not if what the actually term is). Second, Pytorch have their own stack machine implemented in c++ for loading their modified pickle format and they seems to fallback to the stack machine in python.

I suppose if you deserialize this code in Julia, you get the opcodes in some form, but no way to actually run them (yet), i.e. actually without the security risk?

So in Julia, or say this package, we implemented the stack machine with plain julia. You can see that the pickle format is deeply integrated with the python language since it is actually allowing to call python function. Here, we usage a method table to mapping the python function to julia function. If it can find a mapping of a python function on the method table, the function is called to build the object, otherwise the Defer object is created. So yes, our Pickle.jl can be free from the security risk as long as no one register the mapping for eval, or even they do, the mapped function need to understand python code somehow.

Is there ever a downside and you actually want to distribute code with and have it run on deserialization (in Python)? Are you planning then on supporting it (I guess not)? You might want to document this somehow either way, in a positive light... :)

I'm not sure what you mean. Could you elaborate more on that?

@PallHaraldsson
Copy link
Contributor Author

I would expect python people using collections.OrderedDict if the ordered property is required.

No, so we're on the same page, I'll explain my understanding.

This isn't just about it:

https://docs.python.org/3/library/collections.html?highlight=ordered#collections.OrderedDict

also the regular dict:

https://docs.python.org/3.7/library/stdtypes.html#dict

Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6.

So in all currently supported Python versions, dicts are ordered. So if you read them (from a pickle file) into Julia you would want an ordered dict type too(?).

But I suppose you want to be able to read even older pickle files than made since before 3.7 (not a huge priority?).

I suppose new Python has the same issue with reading old Python/pickle files. I'm guessing since the dicts from then were not ordered they can't be ordered in the pickle file either, so you just get that random order when read in on newer Python (and should too on Julia). But I suppose it still should be read into an ordered dict type (i.e. the same new dict, with the same name, just new semantics).

One other difference between Python and Julia seems to be that Python has a default value, I assume that's why you read into DataStructure.DefaultDict, but you would want to change that to DataStructure.DefaultOrderedDict (I think this would not be a breaking change according to semver). I was going to propose OrderedCollections.OrderedDict that it builds on (also that package is a cheaper dependency, and this might break semver? Not because of Ordered or different package, but because of the lack of Default), but it seems that the Default part matters.

I suppose you have (way) more experience with Python (or both languages) than me. I wanted OrderedDict to be a new default for Julia, and tried to make a PR, but it wasn't accepted. At least for now it doesn't seem like a wanted new default. It could be slower, but I think the ordered part important for the same reasons it is thought to be in Python (and thus was changed). Julia Base only has the one unordered Dict, and they don't want more than one type since it's all Julia itself needs. What would be the most ideal one (or one additional) type there for you or most users? I overlooked the Default part, so I think it might be the DefaultOrderedDict and I think it could be added to OrderedCollections (since it's a much cheaper dependency). However most users would rather I think still use Julia's default dict out of ignorance. Possibly OrderedCollections could be made a stdlib, but it would still be opt in...

the python pickle format is actually the instructions for a stack machine, which is different to python VM

That's very intriguing, not what I assumed.

You state your package is experimental, and it seems to me supporting the Pickle VM fully would likely never work, nor even too useful because of the security issue, and that it is intimately tied to [c]Python, amounting to reimplementing it?

I'm not sure if your main goal was fully supporting pickle or supporting PyTorch, both deserializing and serializing? Would you've achieved deserializing PyTorch fully? And that that is a harder goal than for (a superset of) Pickle? And that you've also achieved serializing for PyTorch use?

I believe Pickle supports all of Python types, i.e. you can serialize arbitrary data, even programs, and user defined types, I guess that means classes. But only supporting the base types would go a long way (plus e.g. BFloat16 that you have), and documenting some usuful application might help, e.g. "fully supports PyTorch; and most Pickle files, but for arbitrary files use PythonCall.jl and CPython with it".

https://github.com/JuliaCollections/DataStructures.jl/blob/ca157a1ddef3057d74d68e267c16892e560f90a7/src/default_dict.jl#L80C1-L80C1

@chengchingwen
Copy link
Owner

I kinda agree the situation of python dict and it is problematic. But instead of changing the default type, I am thinking of some ways to change the behavior of those instructions. So maybe in v0.4.

It is not possible to support PyTorch fully because they can fallback to python pickle, so the situation is the same as normal python pickle.

Personally, this package is a framework for analyzing pickle and building custom file loader/saver for different pickle files without the python dependencies (By different pickle files, I am referring to the pickle containing different kinds of custom type or python package like numpy or pandas.). Fully support pickle without cpython seems impossible for me, unless someone is willing and able to reimplement the python language and probably handling the package interfaces like pybind or cython in julia.

But it is actually possible to fallback to cpython also, by allowing those Defer object to call into PyCall.jl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants