Switch to Arrow storage format for TestData #381

dmbates · 2020-09-19T00:51:30Z

The experience with #380 makes me more convinced that it would be good to switch from Feather storage format, which brings in DataFrames and CategoricalArrays when reading the file, to the new Arrow format as implemented in https://github.com/JuliaData/Arrow.jl (note that this is not the currently registered repository for Arrow). On the slack data channel Jacob indicated that he hopes to release the new Arrow implementation in a week or so.

It will take us a while to switch formats because all the datasets must be saved in the new format and I haven't worked out a way of having both Feather and the new Arrow loaded at the same time.

palday · 2020-09-19T07:52:13Z

I'm happy to do the conversion, but I still can't figure out how to read the the produced files into Python. The documentation for the relevant packages there treat Arrow as a memory format and not as a disk format and none of the various disk formats listed seem to match the output of Arrow.jl.

dmbates · 2020-09-19T17:14:39Z

I am trying out the conversion now. Did you see Jacob's answer on https://julialang.slack.com/archives/C674VR0HH/p1600454109147800

I wasn't quite sure what arguments could be used to open_file as in

import pyarrow as pa
df = pa.ipc.open_file(buf).read_pandas()

dmbates · 2020-09-20T03:41:40Z

I have added the Arrow files to the osf.io repo. If you add the master branch of https://github.com/JuliaData/Arrow.jl (which also requires the master branch of Tables.jl) you can read these files with, e.g., Arrow.Table("cbpp.arrow")

dmbates · 2020-09-20T15:12:08Z

This issue may come to the fore earlier than we had anticipated. I just installed a prerelease version of julia-1.5.2 and was unable to test MixedModels because compilation of the release version of Arrow.jl (from https://github.com/ExpandingMan/Arrow.jl) segfaulted. The development version in https://github.com/JuliaData/Arrow.jl did not segfault.

In the discourse.julialang.org discussion on julia-1.5.2 the conclusion seems to be that the compilation failure is in CategoricalArrays.jl and is a problem for any 1.5 series version. It does not show up in 1.5.1 because assertions are not turned on in the distributed version whereas they are in the 1.5.2 test version.

julia: /buildworker/worker/package_linux64/build/src/subtype.c:1978: jl_types_equal: Assertion `subtype_ab == 3 || subtype_ab == subtype || jl_has_free_typevars(a) || jl_has_free_typevars(b)' failed.

signal (6): Aborted
in expression starting at /home/bates/.julia/packages/Arrow/q3tEJ/src/Arrow.jl:3
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f52176df728)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
jl_types_equal at /buildworker/worker/package_linux64/build/src/subtype.c:1978
jl_typemap_entry_lookup_by_type at /buildworker/worker/package_linux64/build/src/typemap.c:537
jl_typemap_assoc_by_type at /buildworker/worker/package_linux64/build/src/typemap.c:599
check_ambiguous_visitor at /buildworker/worker/package_linux64/build/src/gf.c:1302
jl_typemap_intersection_node_visitor at /buildworker/worker/package_linux64/build/src/typemap.c:312
jl_typemap_intersection_visitor at /buildworker/worker/package_linux64/build/src/typemap.c:408
jl_typemap_intersection_visitor at /buildworker/worker/package_linux64/build/src/typemap.c:399
check_ambiguous_matches at /buildworker/worker/package_linux64/build/src/gf.c:1394
jl_method_table_insert at /buildworker/worker/package_linux64/build/src/gf.c:1709
jl_insert_methods at /buildworker/worker/package_linux64/build/src/dump.c:2292 [inlined]
_jl_restore_incremental at /buildworker/worker/package_linux64/build/src/dump.c:3248
jl_restore_incremental at /buildworker/worker/package_linux64/build/src/dump.c:3299
_include_from_serialized at ./loading.jl:681
_require_search_from_serialized at ./loading.jl:782
_require at ./loading.jl:1007
require at ./loading.jl:928
require at ./loading.jl:923
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
call_require at /buildworker/worker/package_linux64/build/src/toplevel.c:425 [inlined]
eval_import_path at /buildworker/worker/package_linux64/build/src/toplevel.c:462
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:682
jl_eval_module_expr at /buildworker/worker/package_linux64/build/src/toplevel.c:197
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:666
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:913
jl_load_rewrite at /buildworker/worker/package_linux64/build/src/toplevel.c:914
include at ./Base.jl:380
include at ./Base.jl:368
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1690 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:117
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:206
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:157 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:552
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:492
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:660
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:840
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:883
eval at ./boot.jl:331 [inlined]
eval at ./client.jl:467
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
top-level scope at ./none:3
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2231 [inlined]
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2238
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:834
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:790
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:883
eval at ./boot.jl:331
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
exec_options at ./client.jl:272
_start at ./client.jl:506
jfptr__start_52252.clone_1 at /home/bates/src/julia-1.5.2-DEV/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2214 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2398
jl_apply at /buildworker/worker/package_linux64/build/ui/../src/julia.h:1690 [inlined]
true_main at /buildworker/worker/package_linux64/build/ui/repl.c:106
main at /buildworker/worker/package_linux64/build/ui/repl.c:227
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/bates/src/julia-1.5.2-DEV/bin/julia (unknown line)
Allocations: 2545 (Pool: 2535; Big: 10); GC: 0
ERROR: LoadError: Failed to precompile Arrow [69666777-d1a9-59fb-9406-91d4454c9d45] to /home/bates/.julia/compiled/v1.5/Arrow/QnF3w_WvzKc.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1305
 [3] _require(::Base.PkgId) at ./loading.jl:1030
 [4] require(::Base.PkgId) at ./loading.jl:928
 [5] require(::Module, ::Symbol) at ./loading.jl:923
 [6] include(::Function, ::Module, ::String) at ./Base.jl:380
 [7] include(::Module, ::String) at ./Base.jl:368
 [8] top-level scope at none:2
 [9] eval at ./boot.jl:331 [inlined]
 [10] eval(::Expr) at ./client.jl:467
 [11] top-level scope at ./none:3
in expression starting at /home/bates/.julia/packages/Feather/pbm3o/src/Feather.jl:3
ERROR: LoadError: Failed to precompile Feather [becb17da-46f6-5d3c-ad1b-1c5fe96bc73c] to /home/bates/.julia/compiled/v1.5/Feather/RgcL0_WvzKc.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1305
 [3] _require(::Base.PkgId) at ./loading.jl:1030
 [4] require(::Base.PkgId) at ./loading.jl:928
 [5] require(::Module, ::Symbol) at ./loading.jl:923
 [6] include(::Function, ::Module, ::String) at ./Base.jl:380
 [7] include(::Module, ::String) at ./Base.jl:368
 [8] top-level scope at none:2
 [9] eval at ./boot.jl:331 [inlined]
 [10] eval(::Expr) at ./client.jl:467
 [11] top-level scope at ./none:3
in expression starting at /home/bates/.julia/dev/MixedModels/src/MixedModels.jl:5
ERROR: LoadError: Failed to precompile MixedModels [ff71e718-51f3-5ec2-a782-8ffcbfa3c316] to /home/bates/.julia/compiled/v1.5/MixedModels/tBiYK_WvzKc.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1305
 [3] _require(::Base.PkgId) at ./loading.jl:1030
 [4] require(::Base.PkgId) at ./loading.jl:928
 [5] require(::Module, ::Symbol) at ./loading.jl:923
 [6] include(::String) at ./client.jl:457
 [7] top-level scope at none:6
in expression starting at /home/bates/.julia/dev/MixedModels/test/runtests.jl:1
ERROR: Package MixedModels errored during testing

dmbates mentioned this issue Sep 20, 2020

Use arrow format for datasets [ci skip] #382

Merged

dmbates mentioned this issue Sep 28, 2020

Testing on julia-v1.6.0-DEV #387

Closed

dmbates added this to the MixedModels 3.0 milestone Oct 4, 2020

dmbates closed this as completed in #382 Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to Arrow storage format for TestData #381

Switch to Arrow storage format for TestData #381

dmbates commented Sep 19, 2020

palday commented Sep 19, 2020

dmbates commented Sep 19, 2020

dmbates commented Sep 20, 2020

dmbates commented Sep 20, 2020

Switch to Arrow storage format for TestData #381

Switch to Arrow storage format for TestData #381

Comments

dmbates commented Sep 19, 2020

palday commented Sep 19, 2020

dmbates commented Sep 19, 2020

dmbates commented Sep 20, 2020

dmbates commented Sep 20, 2020