Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Do not cache the type before we finish constructing it." #11632

Merged
merged 1 commit into from
Jun 10, 2015

Conversation

yuyichao
Copy link
Contributor

@yuyichao yuyichao commented Jun 9, 2015

Reverts #11606

As @vtjnash points out #11606 (comment) #11606 doesn't solve all the problems (it only solves some cases that has been brought up.) and there's a wierd CI failure that seems to be related = = ....

Sorry about that....

@JeffBezanson
Copy link
Sponsor Member

Perhaps, but did this cause the CI failure? There's not much point in just frobbing the system until intermittent failures temporarily go away.

Also, solving only some problems is allowed!

@JeffBezanson
Copy link
Sponsor Member

It looks like all those recent failures are on 32-bit, which is suspicious.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 9, 2015

Yeah. No idea why 32bit... (It makes some sense for GC (higher pressure) but not so much for the type system............)

@JeffBezanson
Copy link
Sponsor Member

The 32-bit Array{Any,1} failure actually seems quite reliable. Maybe this is exposing another bug; we should take advantage of it. cc @carnaval

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 9, 2015

Another one of my PR (partially reverted by #11621) seems to be exposing sth else as well (mainly because it's been passing on linux and only cause issue on windows after merging with other commits).

I was going to investigate with another PR later but maybe some windows/llvm expert can make sense of it in its current state....

@tkelman
Copy link
Contributor

tkelman commented Jun 9, 2015

I have no idea, but this is wasting appveyor time on a bunch of other open PR's. I'd rather not make them fail pointlessly.

@tkelman
Copy link
Contributor

tkelman commented Jun 9, 2015

We've seen many of these errors before. On a local win32 build of latest master I get

     * replcompletions     exception on 1: ERROR: LoadError: test failed: 0:-1 == 18:22
 in expression: r == endof(s) - 4:endof(s)
 in default_handler at test.jl:29
 in do_test at test.jl:52
 in anonymous at D:\cygwin64\home\Tony\julia32\test\replcompletions.jl:405
 in open at iostream.jl:113
 in anonymous at D:\cygwin64\home\Tony\julia32\test\replcompletions.jl:402
 in anonymous at no file:401
 in runtests at D:\cygwin64\home\Tony\julia32\test\testdefs.jl:197
 in anonymous at multi.jl:644
 in run_work_thunk at multi.jl:605
 in remotecall_fetch at multi.jl:693
 in anonymous at task.jl:1422
while loading D:\cygwin64\home\Tony\julia32\test\replcompletions.jl, in expression starting on line 395
ERROR: LoadError: LoadError: test failed: 0:-1 == 18:22
 in expression: r == endof(s) - 4:endof(s)
 in default_handler at test.jl:29
 in do_test at test.jl:52
 in anonymous at D:\cygwin64\home\Tony\julia32\test\replcompletions.jl:405
 in open at iostream.jl:113
 in anonymous at D:\cygwin64\home\Tony\julia32\test\replcompletions.jl:402
 in anonymous at no file:401
 in runtests at D:\cygwin64\home\Tony\julia32\test\testdefs.jl:197
 in anonymous at multi.jl:644
 in run_work_thunk at multi.jl:605
 in remotecall_fetch at multi.jl:693
 in anonymous at task.jl:1422
while loading D:\cygwin64\home\Tony\julia32\test\replcompletions.jl, in expression starting on line 395
while loading D:\cygwin64\home\Tony\julia32\test\runtests.jl, in expression starting on line 5

This is #10875, which was fallout from #10380. I don't pretend to understand this type cache stuff, but it's awfully fragile. We sure could use a more deterministic set of tests for it.

For now I think we should merge this.

@carnaval
Copy link
Contributor

Tried a 32b VM with the same ubuntu version as travis, no luck reproducing.

The problem is during the evaluation of typeassert(something,Array{Any,1}). The something is allocated by C code (jl_matching_methods) which does not even go through the type cache (using the array_any_type "cache").

The Array{Any,1} should be a pointer inlined in the ast by inference. It is created by the apply_type tfunc, stored into the .typ (wrapped in a Type{}) and then replaces the expression during the inlining pass.

If for any reason (OOM ? the type cache has 3/2 geom growth but still this looks like a tiny thing) the apply_type fails in inference it would go through static_eval in codegen on what I believe is a codepath which is not much used since for constant arguments inference should have figured it out already (and this does not check for exceptions!).

So either something wrong happens in one of those two cases, or the type cache logic itself is wrong in an almost-right way (since everything mostly works).

@yuyichao
Copy link
Contributor Author

I was also trying to reproduce it and hit #11642 ...

@yuyichao
Copy link
Contributor Author

If for any reason (OOM ? the type cache has 3/2 geom growth but still this looks like a tiny thing) the apply_type fails in inference it would go through static_eval in codegen on what I believe is a codepath which is not much used since for constant arguments inference should have figured it out already (and this does not check for exceptions!).

Is this related to why D(1.1) is allowed in #11597 (comment) ?

@JeffBezanson
Copy link
Sponsor Member

Maybe it's related to the somewhat-special treatment of jl_array_any_type in dump.c?

@carnaval
Copy link
Contributor

Now that I think of it, it cannot even be a caching problem since subtype falls back on jl_egal of parameters if pointer eq. fails. For example :

julia> abstract K
julia> VK = Vector{K}
Array{K,1}
julia> VK.parameters = Base.svec(Any,1); VK
Array{Any,1}
julia> []::VK
0-element Array{Any,1}
julia> 

works fine (one could argue it should not but thats beside the point), as well as in generated code.

So the only way for the error to happen is for those two types to have a different ->name pointer. I don't think we can assume those are corrupted or something since the printing of the error works fine. It's not like we create typenames in a lot of places either so it pretty much has to be during deserialization IMO.

@yuyichao
Copy link
Contributor Author

So

julia> function f()
       VK = Vector{K}
       VK.parameters = Base.svec(Any, 1)
       []::VK
       end
f (generic function with 1 method)

julia> f()
ERROR: TypeError: f: in typeassert, expected Array{Any,1}, got Array{Any,1}
 in f at ./none:4

@carnaval
Copy link
Contributor

Ha. I was wrong indeed, we do codegen pointer equality for leaf typeasserts (emit_typecheck), so it is likely to be a caching problem. The typetag of the object is surely the "right" one since it comes straight from alloc_cell_1d/an_empty_cell. The pointer we are putting in the generated code has probably avoided the cache in some way.

tkelman added a commit that referenced this pull request Jun 10, 2015
Revert "Do not cache the type before we finish constructing it."
@tkelman tkelman merged commit 4e5767e into JuliaLang:master Jun 10, 2015
@tkelman
Copy link
Contributor

tkelman commented Jun 10, 2015

This has made a day's worth of CI builds fail. Not worth it. Please consider it a little more urgent to keep master functioning, or it disrupts everyone who tries to contribute even in completely unrelated areas.

Whatever you do to deal with this, it needs a deterministic way of testing it, or this is going to keep happening over and over again.

@carnaval
Copy link
Contributor

Ok found the issue, thanks to the Travis folks giving ssh into a vm.

In the core test we create a type Node2{T}; v::Vector{Node2{T}}; end. It is uniqued correctly but Vector{Node2{Int}} is inserted into the Array typecache before having be assigned an uid, so it goes into the wrong spot (based on object_id) and makes the binary search fail sometimes when we pick it as the middle later when it now has one. I think we could put this patch back, you just have to assign the uid before instantiating the fields (but cache it at the end).

I'm not sure it solves the original "error while instantiating" but at least we know why it failed even without it :-)

@yuyichao @JeffBezanson

@carnaval
Copy link
Contributor

By the way I think it wouldn't hurt to have a quick look at all the places that touches those uid since everything can "mostly work" when only a few types are compared wrong.

@yuyichao
Copy link
Contributor Author

@carnaval I would have to read the code before I can fully understand what you meant but from what I understand, shouldn't delaying cache insertion make it less likely to happen?

@carnaval
Copy link
Contributor

The ordering of A{T} in the cache is given by either T->uid or, if it is 0, object_id(T). The way (I think) it went was :
apply_type(Node2,Int)
apply_type(Vector,Node2{Int})
insert_into_cache(Vector, Node2{Int}) => based of object_id(Node2{Int})
insert_into_cache(Node2, Int)
setup_uid_for(Node2{Int}) => next time we index into Vector.cache we might cross path with Vector{Node2{Int}} which will now compare with the new uid, so the cache is not sorted anymore

@carnaval
Copy link
Contributor

(then when the cache is not sorted, you can have a false cache miss, say for Vector{Any}, which makes you create a new instance and crash)

@yuyichao
Copy link
Contributor Author

Shouldn't

    if (!jl_is_abstracttype(type) && ((jl_datatype_t*)type)->uid==0)
        ((jl_datatype_t*)type)->uid = jl_assign_type_uid();

in cache_insert_type make sure a uid is assigned before the cache insertion?

@JeffBezanson
Copy link
Sponsor Member

The UID needs to be assigned before instantiating fields.

@carnaval
Copy link
Contributor

yup, its the uid of a type parameter (of Vector) which is missing for an element being inserted in the Vector cache, not the Vector type itself

@yuyichao
Copy link
Contributor Author

Ahh. I see that the is of the type parameters is used in typekey_compare.

@yuyichao
Copy link
Contributor Author

Thanks for the explanation. That part of the code (and the problem) makes much more sense for me now.

I'll try to come up with a better PR if you didn't fix it before me =).

@carnaval
Copy link
Contributor

Feel free to go ahead, I had my dose of type cache today ;)

@yuyichao
Copy link
Contributor Author

@carnaval Do you think this (and this) are related?

@tkelman
Copy link
Contributor

tkelman commented Jun 12, 2015

@yuyichao pretty sure I've seen those on Travis too, those do look quite similar in that type comparison to Vector{Any} is messed up.

@JeffBezanson
Copy link
Sponsor Member

Bump. @yuyichao are you planning to write a patch for this or should I do it?

@tkelman
Copy link
Contributor

tkelman commented Jun 15, 2015

@yuyichao
Copy link
Contributor Author

@JeffBezanson please go ahead if you want it to be fixed. Sorry if I stopped you from doing it.

@JeffBezanson
Copy link
Sponsor Member

Done.

@IainNZ
Copy link
Member

IainNZ commented Jun 16, 2015

fecadcb

@yuyichao yuyichao deleted the revert-11606-delay-cache-type branch July 25, 2015 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants