Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed IOError: write: bad address in system call argument (EFAULT) #33899

Open
oxinabox opened this issue Nov 20, 2019 · 2 comments · Fixed by #33942
Open

Distributed IOError: write: bad address in system call argument (EFAULT) #33899

oxinabox opened this issue Nov 20, 2019 · 2 comments · Fixed by #33942
Labels
io Involving the I/O subsystem: libuv, read, write, etc. parallelism Parallel or distributed computation

Comments

@oxinabox
Copy link
Contributor

Sometimes one gets EFAULT when sending over a closure over some data.
I don't have a super minimal reproducer, but this is much cut down.

MVE

My full Project.toml and Manifest.toml can be found in this gist.

using Distributed
addprocs(3)
@everywhere using Pkg
@everywhere Pkg.activate(".")
@everywhere using Dates, TimeZones, Intervals

const dts = DateTime(2000,11,10) .+ Hour.(0:24*60)
const zdts = ZonedDateTime.(dts, localzone())
const ivals = Interval.(zdts, zdts.+Hour(1))
const many = rand(ivals, 297_944)  # Need quite a few of these

mkfun(data) =  ii->(ii, Distributed.myid(), sizeof(data))
const fun2 = mkfun(many)

# One of these will likely fail
pmap(fun2, CachingPool(workers()), 1:10)
pmap(fun2, CachingPool(workers()), 1:10)
pmap(fun2, CachingPool(workers()), 1:10)
pmap(fun2, CachingPool(workers()), 1:10)
pmap(fun2, CachingPool(workers()), 1:10)

Error:

Error is:

ERROR: IOError: write: bad address in system call argument (EFAULT)
Stacktrace:
 [1] (::getfield(Base, Symbol("##684#686")))(::Task) at ./asyncmap.jl:178
 [2] foreach(::getfield(Base, Symbol("##684#686")), ::Array{Any,1}) at ./abstractarray.jl:1835
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::UnitRange{Int64}) at ./asyncmap.jl:178
 [4] #async_usemap#669 at ./asyncmap.jl:154 [inlined]
 [5] #async_usemap at ./none:0 [inlined]
 [6] #asyncmap#668 at ./asyncmap.jl:81 [inlined]
 [7] #asyncmap at ./none:0 [inlined]
 [8] #pmap#213(::Bool, ::Int64, ::Nothing, ::Array{Any,1}, ::Nothing, ::Function, ::Function, ::CachingPool, ::UnitRange{Int64}) at /User
s/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.0/Distributed/src/pmap.jl:126
 [9] pmap(::Function, ::CachingPool, ::UnitRange{Int64}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.
0/Distributed/src/pmap.jl:101
 [10] top-level scope at none:0

Versions

So far I have only reproduced this on the LTS.
It might happen without that
But I have it on Mac and on Linux.

julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
@oxinabox
Copy link
Contributor Author

I have now reproduced this in Julia 1.3-RC5

vtjnash added a commit that referenced this issue Nov 25, 2019
More cases similar to those identified in #23914
fix #33899
@oxinabox
Copy link
Contributor Author

oxinabox commented Nov 26, 2019

I am still seeing this with current master of julia.
It seems harder to trigger, but still possible

LoadError: IOError: write: bad address in system call argument (EFAULT)
Stacktrace:
 [1] (::Base.var"#730#732")(::Task) at ./asyncmap.jl:178
 [2] foreach(::Base.var"#730#732", ::Array{Any,1}) at ./abstractarray.jl:1920
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::StepRange{Date,Day}) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Distributed.var"#208#211"{CachingPool}, ::Function, ::StepRange{Date,Day}) at ./asyncmap.jl:154
 [5] #async_usemap#715(::Function, ::Nothing, ::typeof(Base.async_usemap), ::Distributed.var"#192#194"{Distributed.var"#192#193#195"{CachingPool,var"#5#6"{typeof(financial_day),ForecastAgent{MISO},S3DB.TestClient}}}, ::StepRange{Date,Day}) at ./asyncmap.jl:103
 [6] (::Base.var"#kw##async_usemap")(::NamedTuple{(:ntasks, :batch_size),Tuple{Distributed.var"#208#211"{CachingPool},Nothing}}, ::typeof(Base.async_usemap), ::Function, ::StepRange{Date,Day}) at ./none:0
 [7] #asyncmap#714 at ./asyncmap.jl:81 [inlined]
 [8] #asyncmap at ./none:0 [inlined]
 [9] #pmap#207(::Bool, ::Int64, ::Nothing, ::Array{Any,1}, ::Nothing, ::typeof(pmap), ::Function, ::CachingPool, ::StepRange{Date,Day}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/pmap.jl:126
 [10] pmap at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/pmap.jl:101 [inlined]

I haven't yet managed to reproduce it with a relatively simple example like before.
But in my real use case it is still happening,
with exact same message as before

@vtjnash vtjnash reopened this Nov 26, 2019
KristofferC pushed a commit that referenced this issue Nov 29, 2019
More cases similar to those identified in #23914
fix #33899

(cherry picked from commit e1086fe)
KristofferC pushed a commit that referenced this issue Dec 4, 2019
More cases similar to those identified in #23914
fix #33899

(cherry picked from commit e1086fe)
KristofferC pushed a commit that referenced this issue Apr 11, 2020
More cases similar to those identified in #23914
fix #33899
@brenhinkeller brenhinkeller added parallelism Parallel or distributed computation filesystem Underlying file system and functions that use it io Involving the I/O subsystem: libuv, read, write, etc. and removed filesystem Underlying file system and functions that use it labels Nov 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants