Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register .pkl in FileIO? #8

Open
findmyway opened this issue Jun 8, 2020 · 15 comments
Open

Register .pkl in FileIO? #8

findmyway opened this issue Jun 8, 2020 · 15 comments

Comments

@findmyway
Copy link
Contributor

This is a low priority issue.

Ref: https://github.com/JuliaIO/FileIO.jl#adding-new-formats

@chengchingwen
Copy link
Owner

Is .pkl the standard file extension for pickle format?

@johnnychen94
Copy link

johnnychen94 commented Apr 29, 2022

From https://fileinfo.com/extension/pkl it seems that .pkl, .p and .pickle are all valid extensions.

To avoid conflict with other file formats, FileIO requires you to write a small detector to check the first few bytes, for instance, https://github.com/JuliaIO/FileIO.jl/blob/21f435dc3ea63c4786e4cb894f855aef0692c056/src/registry.jl#L108

Thus it might be

function is_pickle_format(...)
...
end

add_format(format"PyPickle", detect_pkl_format, [".pkl", ".p", ".pickle"], [idPickle])

By doing this, when FileIO sees a file with extension in ".pkl", ".p", or ".pickle", it will call detect_pkl_format to check if it's really a valid PICKLE file, and if it is, call Pickle.jl to load the data.

If pickle format has fixed bytes at the beginning of the file, say pickle_magic_bytes, then this can be simplified to

# pickle_magic_bytes can be a UInt8 vector as bytes, or a String
add_format(format"PyPickle", pickle_magic_bytes, [".pkl", ".p", ".pickle"], [idPickle])

The format"PyPickle" name is somehow arbitrary as long as Pickle.jl provides the corrresponding save(::File{DataFormat{:PyPickle}, ...) and optionally save(::Stream{DataFormat{:PyPickle}}, ...) functions.

You can check Netpbm.jl out as a simple example.

@chengchingwen
Copy link
Owner

Unfortunately regular python pickle don't have magic bytes. We would need to scan through the whole file to see if it is potentially loadable by the pickler. Would that be acceptable?

@johnnychen94
Copy link

Googling .pkl it seems that Python pickle is the only format that uses this extension.

add_format(format"PyPickle", "", [".pkl", ".p", ".pickle"], [idPickle])

If .p and .pickle extensions are not used commonly, we can register ".pkl" only.


BTW, it says pickle format is unsafe: https://docs.python.org/3/library/pickle.html. Does this package handle this security issue?

@chengchingwen
Copy link
Owner

Pickle format is unsafe because it can execute arbitrary python functions by explicitly storing instructions to call those functions. But since we don't/can't map every python function into equivalent Julia ones, we won't encounter that security issue. You can safely "load" any legal pickle format file, even the malicious one, but they would just return a Pickle.Defer.

This brings up another issue. If we want to get the stored data correctly (i.e. without any Defer in it), it can only contain data of basic types (otherwise we need to manually add those function mapping with Pickle.jl). I'm not sure if this match the expectation when someone use FileIO.jl.

@johnnychen94
Copy link

it can only contain data of basic types (otherwise we need to manually add those function mapping with Pickle.jl). I'm not sure if this match the expectation when someone use FileIO.jl.

As long as it's clearly documented somewhere in the Pickle README or raises a friendly error message, I think it already improves the usability. BSON as another Julia serialization format is also registered in FileIO and I haven't seen people complaining about this.

@CarloLucibello
Copy link
Contributor

We could also register ".npy", ".npz" and pytorch's ".pt" or ".pth" support

@johnnychen94
Copy link

I thought ".npy" and ".npz" are already supported now JuliaIO/FileIO.jl#358?

@CarloLucibello
Copy link
Contributor

Yes by NPZ.jl, but we can register Pickle.jl as an alternative

@chengchingwen
Copy link
Owner

NPZ and Pickle are different. ".npy" and ".npz" are special format defined by numpy. You cannot load a numpy file by pickle and vice versa.

Usually there are no pickle format in numpy format, unless you're storing ndarray with object dtype. This means there're actually 2 different ways (format, to be precise) to serialize a ndarray in numpy. One is numpy's own format npy/npz, another one is numpy's extension of pickle format.

@chengchingwen
Copy link
Owner

Some explanation about each format:

  • Pickle format: a stack machine bytecodes that help to build the serialized data by executing the instructions. Has instruction to call any function by function name, so most object are/can be restored by calling their constructor.

  • Pickle in python: The stack machine. Each python object has an extensible method that return the underlying data of that object. The function call instruction are always evaluated, which leads to the security issue.

  • Pickle.jl: An Julia implementation of the stack machine. We have a lookup table for those function call instructions, or return a Pickle.Defer if the mapping doesn't exist (which bypass the security issue).

  • Numpy:

    • numpy pickle: extending those methods for ndarray so it can be stored/load with pickle.
    • .npy format w/o object dtype: basically a direct write of the array data with some header (magic number, shape, size, dtype, etc.).
    • .npy format w/ object dtype: a mixture of pickle and .npy. The object is encode by pickle and serialized with .npy header.
    • .npz format: a zip of multiple .npy file.
  • Torch Pickle: A modified pickle format

    • legacy format: Containing multiple pickle and a large chunk of array data. Those pickle are:
      1. dict of system config (magic, endian, word size, ...)
      2. Stored object. Tensors only contain descriptor without the actually data. Each Tensor descriptor have a storage key
      3. a dict of storage key to chuck offset
    • new format: a zip with a pickle file for the stored object and multiple storage files. Each tensor storage is in a single file, so the file name is the storage key.

So npy/npz and torch pickle are not real pickle file, but some custom formats that use pickle format as part of its definition.

@CarloLucibello
Copy link
Contributor

Thanks for the explanation, these formats were very confusing to me.
But still, Pickle.jl is able to load .npy file (in both formats?) and torch pickle files (in both formats?), right?

@chengchingwen
Copy link
Owner

torch pickle? yes.

npy file? no, I didn't handle that, but shouldn't be too hard to add.

@CarloLucibello
Copy link
Contributor

But I have been using with success Pickle.npyload(path) ...

@chengchingwen
Copy link
Owner

I wonder if it has the wrong file extension? Could you try loading with NPZ.npzread?

julia> using PyCall, NPZ

julia> np = pyimport("numpy")
PyObject <module 'numpy' from '/home/peter/pyenv/lib/python3.6/site-packages/numpy/__init__.py'>

julia> np.save("npsave.npy", np.random.randn(10))

julia> Pickle.npyload("npsave.npy")
ERROR: ArgumentError: Deque must be non-empty
Stacktrace:
[...]

julia> NPZ.npzread("npsave.npy")
10-element Vector{Float64}:
 -1.064963962117236
  0.5587883282776522
 -1.8944936182698036
  1.2505924109230577
  0.36063115028026194
  3.3166609146251327
 -2.31110326775469
 -1.0430525977835379
  1.2946367099663125
 -1.441233594994848

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants