Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce libjuliarepl to break dependence on runtime libraries #36588

Merged
merged 2 commits into from
Sep 24, 2020

Conversation

staticfloat
Copy link
Sponsor Member

This changes ui/repl.c to compile into a new library, libjuliarepl.
This library is loaded by a very simple loader executable, defined in
loader.c. This loader can have paths to dependent libraries embedded
wtihin it in order to ensure that certain libraries have been loaded
within Julia by the time libjuliarepl and libjulia itself are
attempted to be loaded within Julia.

src/julia.expmap Outdated Show resolved Hide resolved
ui/loader.c Outdated Show resolved Hide resolved
@JeffBezanson
Copy link
Sponsor Member

The names ui and repl have been inaccurate for a while now. I would take this opportunity to rename them to cli (command line interface). cli/cli.c, libjuliacli, etc.

Also repl.c is so small we could consider just moving its code into libjulia.

@davidanthoff
Copy link
Contributor

This looks great!

I think the only drawback I can think of right now is for the scenario that @GunnarFarneback mentioned in the other PR: if a third party executable wants to load libjulia, that won't work as easily as today, right? Because that binary would have to go through the same manual LoadLibrary calls that the new loader performs? The only solution (in the spirit of this PR here) that I can think of is to have a structure where julia.exe loads some libjulia.dll that is a new layer. The calls to LoadLibrary that are in julia.exe in this PR here are moved to libjulia.dll, and then libjulia.dll loads something like truelibjulia.dll, which contains what is today called libjulia.dll. The new libjulia.dll would then have to re-export everything from truelibjulia.dll. A third party exe could then just load this new libjulia.dll, and it would correctly load all the deps of truelibjulia.dll before loading truelibjulia.dll itself. But not clear to me how important that scenario is.

@StefanKarpinski
Copy link
Sponsor Member

Also repl.c is so small we could consider just moving its code into libjulia.

Even if it isn't part of libjulia.so it might make sense to move it into src/ and just have the libraries be separate compilation targets. Having a separate cli directory for a single source file and compilation target seems a bit silly. Historically, there were other UIs (there was a whole C++ webserver interface at one point!).

@GunnarFarneback
Copy link
Contributor

But not clear to me how important that scenario is.

Very important if you ask me. It's a major feature that the third party executable can load multiple versions of Julia without recompilation (it doesn't even include julia.h) and without needing big helper libraries.

@davidanthoff
Copy link
Contributor

The libjulia question probably affects pyjulia also? CC @tkf.

@tkf
Copy link
Member

tkf commented Jul 9, 2020

Thanks for the ping. Yes, something like libjulia.dll/truelibjulia.dll would be great (or maybe libjulialoader.dll/libjulia.dll to match the C-API to other platforms?).

At least, I think this DEP_LIBS should be accessible through some API since it sounds like it's tricky to get the path and ordering right:

julia/cli/loader.c

Lines 27 to 33 in fbcbac7

/*
* DEP_LIBS is our list of dependent libraries that must be loaded before `libjulia`.
* Note that order matters, as each entry will be opened in-order. We define here a
* dummy value just so this file compiles on its own, and also so that developers can
* see what this value should look like. Note that the last entry must always be
* `libjuliarepl`, and that all paths should be relative to this loader `.exe` path.
*/

It'd be great if we can have something like jl_load_deps(bindir) in the shim/loader DLL.

For PyJulia, it's OK if DEP_LIBS is only available through a Julia function (say). But I imagine having jl_load_deps(bindir) would be beneficial/required for other people using the C API.

@staticfloat
Copy link
Sponsor Member Author

So what I'm hearing is the following:

  • Let's get rid of the extra repl.c/cli.c code, and just bundle that into libjulia.

  • We need "no-special-treatment" entrypoints that are both an executable and a library. So let's have julia.exe load libjulia directly, and let's have libjulialoader do the setup that julia.exe would have done, and load libjulia. In this manner, we can have embedding situations just load libjulialoader first, and it will take care of anything that needs to happen.

Does that satisfy all usecases?

@tkf
Copy link
Member

tkf commented Jul 9, 2020

Sounds great to me!

@GunnarFarneback
Copy link
Contributor

It sounds like it would be fine.

Does this PR have any consequences for the jl_init_with_image story in #32614?

@staticfloat
Copy link
Sponsor Member Author

I think the concerns within #32614 are the same with or without this PR.

@staticfloat
Copy link
Sponsor Member Author

@vtjnash I've updated this to work much better on Windows, unfortunately I have to provide my own, worse, definitions for many library functions but that's fine. I'm obfuscating the names away so they don't conflict with anything that is imported in the future that actually wants to find a performant memcpy, for instance.

Unfortunately, when I attempt to build this on Windows, it never completes bootstrap. Looking at stack traces taken while it's running, it looks to me like everything is chugging along properly, it's compiling code and whatnot, but it just never finishes. I can't figure out what here would cause it to never finish. The only thing I can think of is that something has caused a massive slowdown here such that the initial compiler/compiler.jl bootstrap is taking hours instead of seconds.

Do you have any ideas on what might be causing this?

@staticfloat
Copy link
Sponsor Member Author

Ah, looking into it a wee bit more, I see that something is getting hung on libuv_tty_reset_mode()....

image

@staticfloat
Copy link
Sponsor Member Author

I finally, FINALLY, figured it out. And it's oh so simple. I just wasn't calling exit() at the end of my main() method, and because we're running outside of the CRT on windows there, threads that are hung in an infinite loop (such as the libuv TTY resizing message handling loop) keep us hung forever. Hilarious. Check your assumptions people; return ret; at the end of your main() won't actually end your program if you don't have a C runtime library calling exit() for you.

Now we can move on to more typical issues; I now get halfway through bootstrap on windows, but ccall(:fesetround, ...) fails to find fesetround, presumably because we need to pass the explicit libc handle into ccall() because we're no longer linking the libc into julia.exe with the system's dynamic linker, but instead via LoadLibraryEx(). One way to solve this would be to record the libc name in Libc.libc_name or something (which might be useful down the road) and pass that into ccall(), but I also see this note in jl_dlfind_win32() which mentions the possibility of automatically searching the appropriate libc for symbols.

@vtjnash what do you suggest here? I agree with your comment in that function that explicit is better than implicit, but I also don't want to open ourselves up to issues here where we don't get the libc name just right, or certain functions are put into libgcc_s on some platforms and in some other name on other platforms, etc... In this case, fesetround() is probably pretty easy, but I'm sure we'll run into more of these after I fix this one.

@staticfloat
Copy link
Sponsor Member Author

@vtjnash: Attempting to statically link libopenlibm.a into libjulia.dll results in:

ccache i686-w64-mingw32-g++ -march=pentium4 -m32 -shared -Wl,--out-implib,/cygdrive/d/buildbot/worker/package_win32/build/usr/lib/libjulia.dll.a -pipe  -fno-rtti -mincoming-stack-boundary=2  -O3 -ggdb2 -falign-functions -momit-leaf-frame-pointer -D_GNU_SOURCE -I. -I/cygdrive/d/buildbot/worker/package_win32/build/src -I/cygdrive/d/buildbot/worker/package_win32/build/src/flisp -I/cygdrive/d/buildbot/worker/package_win32/build/src/support -I/cygdrive/d/buildbot/worker/package_win32/build/usr/include -I/cygdrive/d/buildbot/worker/package_win32/build/usr/include -DLIBRARY_EXPORTS -I/cygdrive/d/buildbot/worker/package_win32/build/deps/valgrind -Wall -Wno-strict-aliasing -fno-omit-frame-pointer -fvisibility=hidden -fno-common -Wno-comment -Wpointer-arith -Wundef -Wno-unused-result -DJL_BUILD_ARCH='"i686"' -DJL_BUILD_UNAME='"NT"' -ID:\buildbot\worker\package_win32\build\usr/include -DLLVM_SHLIB "-DJL_SYSTEM_IMAGE_PATH=\"../lib/julia/sys.dll\"" ./jloptions.o ./runtime_ccall.o ./rtutils.o ./codegen.o ./llvm-ptls.o ./jltypes.o ./gf.o ./typemap.o ./smallintset.o ./ast.o ./builtins.o ./module.o ./interpreter.o ./symbol.o ./dlload.o ./sys.o ./init.o ./task.o ./array.o ./dump.o ./staticdata.o ./toplevel.o ./jl_uv.o ./datatype.o ./simplevector.o ./runtime_intrinsics.o ./precompile.o ./threading.o ./partr.o ./stackwalk.o ./gc.o ./gc-debug.o ./gc-pages.o ./gc-stacks.o ./method.o ./jlapi.o ./signal-handling.o ./safepoint.o ./timing.o ./subtype.o ./crc32c.o ./APInt-C.o ./processor.o ./ircode.o ./jitlayers.o ./aotcompile.o ./debuginfo.o ./disasm.o ./llvm-simdloop.o ./llvm-muladd.o ./llvm-final-gc-lowering.o ./llvm-pass-helpers.o ./llvm-late-gc-lowering.o ./llvm-lower-handlers.o ./llvm-gc-invariant-verifier.o ./llvm-propagate-addrspaces.o ./llvm-multiversioning.o ./llvm-alloc-opt.o ./cgmemmgr.o ./llvm-api.o ./llvm-remove-addrspaces.o ./llvm-remove-ni.o ./llvm-julia-licm.o  -o /cygdrive/d/buildbot/worker/package_win32/build/usr/bin/libjulia.dll -Wl,--stack,8388608 -Wl,--large-address-aware  -Wl,--whole-archive ./flisp/libflisp.a -Wl,--whole-archive ./support/libsupport.a -L/cygdrive/d/buildbot/worker/package_win32/build/usr/bin -L/cygdrive/d/buildbot/worker/package_win32/build/usr/lib /cygdrive/d/buildbot/worker/package_win32/build/usr/lib/libuv.a /cygdrive/d/buildbot/worker/package_win32/build/usr/lib/libutf8proc.a -Wl,--no-whole-archive  -LD:\buildbot\worker\package_win32\build\usr/lib  -lLLVM -Wl,--export-all-symbols -Wl,--version-script=/cygdrive/d/buildbot/worker/package_win32/build/src/julia.expmap -Wl,--no-whole-archive -lpsapi -lkernel32 -lws2_32 -liphlpapi -lwinmm -ldbghelp -luserenv -lsecur32 -lssp -Wl,--whole-archive /cygdrive/d/buildbot/worker/package_win32/build/usr/lib/libopenlibm.a -Wl,--no-whole-archive 
make[1]: Leaving directory '/cygdrive/d/buildbot/worker/package_win32/build/src'
/usr/lib/gcc/i686-w64-mingw32/9.2.0/../../../../i686-w64-mingw32/bin/ld: /usr/i686-w64-mingw32/sys-root/mingw/lib/../lib/libmingwex.a(lib32_libmingwex_a-isnan.o): in function `_isnan':
/usr/src/debug/mingw64-i686-runtime-7.0.0-1/math/isnan.c:23: multiple definition of `isnan'; /cygdrive/d/buildbot/worker/package_win32/build/usr/lib/libopenlibm.a(s_isnan.c.o):s_isnan.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

Do we need to pass in an option to suppress linking of libm when UNTRUSTED_SYSTEM_LIBM=1? I don't understand why this worked before, unless julia.exe just so happens to not link against libm but libjulia.dll does?

@staticfloat staticfloat force-pushed the sf/libjuliarepl_loader branch 2 times, most recently from acd6ed3 to cd17173 Compare September 1, 2020 23:36
@staticfloat
Copy link
Sponsor Member Author

It passed on all platforms! Hallelujah! If this rebased version passes, let's merge.

@staticfloat staticfloat merged commit 7e8f2c0 into master Sep 24, 2020
@staticfloat staticfloat deleted the sf/libjuliarepl_loader branch September 24, 2020 00:04
# * debug builds must link against libjuliadebug, not libjulia
# * install time relative paths are not equal to build time relative paths (../lib vs. ../lib/julia)
# That second point will no longer be true for most deps once they are placed within Artifacts directories.
LOADER_BUILD_DEP_LIBS = $(LIBGCC_BUILD_DEPLIB):$(LIBOPENLIBM_BUILD_DEPLIB):$(LIBJULIA_BUILD_DEPLIB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it easier on system where the system library is working!

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to USE_SYSTEM_LIBM=1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which option causes it but, it's trying to open usr/lib/libgcc_s.so.1 or sth like that which was not created.

https://build.archlinuxcn.org/~imlonghao/log/julia-git/2020-09-25T01%3A17%3A01.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the openlibm path it is trying to dlopen is indeed wrong as well. The symlink to system lib is at usr/lib/julia/libopenlibm.so whereas this is trying to open usr/lib/libopenlibm.so.3.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we'll need to change LIBOPENLIBM_NAME if USE_SYSTEM_LIBM=1.

}
#elif defined(_OS_LINUX_)
// On Linux, we read from /proc/self/exe
int num_bytes = readlink("/proc/self/exe", exe_dir, PATH_MAX);
Copy link
Contributor

@yuyichao yuyichao Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should be fixed. I have definately used julia in chroot environments that does not /proc mounted.
It was usually a mistake of course but it was previously working. (edit: just checked that this is actually broken due to libuv bug. Still, this is another failure that previously doesn't happen.)

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a good way to find the location of the current process without using /proc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but the system linker behavior is to fallback to default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again, this should just be linked in.

@yuyichao
Copy link
Contributor

In general, for the libraries used by Base, I don't see why not having the libraries linked in directly. With RPATH it ensures the correct one being loaded and also enables better runtime linker optimization. It would have been OK if the linker behavior can be accurately mimicked but this is clearly not doing that (i.e. /proc/self/exe).

Since nothing was broken, AFAICT, this was only done to "in anticipation of JLL stdlibs". If that is the only argument against directly linking the libraries, especially libjulia and it's dependencies (other ones are loaded by julia code and it's only a plus to also have them linked in but isn't as much a requirement) like how normal C/C++ code works, then this part of the change should be reverted.

@staticfloat
Copy link
Sponsor Member Author

In general, for the libraries used by Base, I don't see why not having the libraries linked in directly.

The entire reason this PR exists is to allow us to run code before dynamic link time, because there is no RPATH facility on windows. This PR allows us to emulate RPATH on all systems.

It would have been OK if the linker behavior can be accurately mimicked but this is clearly not doing that (i.e. /proc/self/exe).

/proc/self/exe is just a way to find the location of the currently-running executable, so that we can do relative-path based dynamic library loads. If you have another simple way of finding the path, I'm happy to use that, but honestly, supporting linux distros that don't have /proc mounted is pretty far down my priorities list.

If that is the only argument against directly linking the libraries, especially libjulia and it's dependencies (other ones are loaded by julia code and it's only a plus to also have them linked in but isn't as much a requirement) like how normal C/C++ code works, then this part of the change should be reverted.

Since that's the entire point of this PR, no, it's not going to be reverted. The ideal situation is that this loader depends on as few libraries as possible, and manually dlopen()'s everything, so that we are relying on as few pieces of the system's dynamic linker as possible. This will enable us to do things like not have all the .dll's sitting in bin (and hence be on the PATH) on windows, similarly it will allow us to have libraries like libgcc_s and libLLVM hidden away in an stdlib artifacts directory, so that they are not on LD_LIBRARY_PATH, even for binaries spawned from Julia, etc...

@yuyichao
Copy link
Contributor

yuyichao commented Sep 24, 2020

The entire reason this PR exists is to allow us to run code before dynamic link time, because there is no RPATH facility on windows. This PR allows us to emulate RPATH on all systems.

Yes, but the solution should NOT be throwing away RPATH on platform that has it and then trying to bring back a version that's worse. I'm saying that when it is natively supported it should be used. On platforms that doesn't have it doing whatever needed to simulate it is totally fine.

The ideal situation is that this loader depends on as few libraries as possible, and manually dlopen()'s everything, so that we are relying on as few pieces of the system's dynamic linker as possible.

Again, this is basically saying that since some platforms are broken, let's just break it for everyone and then make everyone as broken as the natively broken one. It's penalizing people that has proper system setup to allow correct lookup. The fix should be ONLY applied to broken systems rather than the other way around.

@yuyichao
Copy link
Contributor

This will enable us to do things like not have all the .dll's sitting in bin (and hence be on the PATH) on windows

And I also don't see why that's a problem. It's only needed for libraries that are needed by libjulia, all the other libraries that want to have all sort of fancy lookup logic can still sit in wherever they want to live and be openned by dlopen. That's how the platform works. It's a bit dirty for sure and is annoy to deal with for other reasons but I just don't see why having a library in bin shipped by the installer rather than some artifact path is a big deal.

Also I don't think it needs to be on the global PATH? I think just being in the same directory of the executable is enough (unless it was cmake applying some magic to make that work but that's all what I needed to do go get a cmake compiled c++ program find the library it needs).

yuyichao added a commit to yuyichao/julia that referenced this pull request Sep 25, 2020
yuyichao added a commit to yuyichao/julia that referenced this pull request Sep 25, 2020
yuyichao added a commit to archlinuxcn/repo that referenced this pull request Sep 25, 2020
@staticfloat
Copy link
Sponsor Member Author

Again, this is basically saying that since some platforms are broken, let's just break it for everyone and then make everyone as broken as the natively broken one. It's penalizing people that has proper system setup to allow correct lookup. The fix should be ONLY applied to broken systems rather than the other way around.

On the other hand, if we use certain mechanisms on one platform and different mechanisms on the other, we get very hard to track down bugs like inconsistent libm symbol resolution. Many months ago I already announced my intention to do this in #33973 and #35193, and through discussion with other members of the compiler team, we decided that using a consistent approach across all platforms is better than having a patchwork approach where platform-specific differences are more likely to occur.

It's penalizing people that has proper system setup to allow correct lookup.

In what concrete ways is this penalizing any users?

And I also don't see why that's a problem. It's only needed for libraries that are needed by libjulia, all the other libraries that want to have all sort of fancy lookup logic can still sit in wherever they want to live and be openned by dlopen.

That's correct, but unfortunately, that list includes libLLVM, libgmp, libmpfr and libpcre. If a user wants to be able to type julia in their shell on windows, all of those libraries will then interfere with anything else that would otherwise load their own copy of e.g. libLLVM, because the entire bin directory of julia will be on the PATH.

@yuyichao
Copy link
Contributor

On the other hand, if we use certain mechanisms on one platform and different mechanisms on the other, we get very hard to track down bugs like inconsistent libm symbol resolution.

Well, you ARE already using different mechanism on different platforms. This is not something you can possibly avoid. Even dlopen doesn't work all the same on different platforms. If something can break, then add tests to catch them. Make sure the libm symbol address you get is correct, make sure the result is correct. Also AFAICT, on linux it was working so trying to get windows to work as similar to linux is basically the goal.

In what concrete ways is this penalizing any users?

Slower startup. Breaking linker optimization. Harder embedding (can't just link to the library anymore). Breaking build. Breaking tools inspecting linkage/symbols/rpaths.

If a user wants to be able to type julia in their shell on windows, all of those libraries will then interfere with anything else that would otherwise load their own copy of e.g. libLLVM, because the entire bin directory of julia will be on the PATH.

If the reason is that julia should not be on PATH then a startup script/exe will fix it.

@staticfloat
Copy link
Sponsor Member Author

staticfloat commented Sep 26, 2020

Well, you ARE already using different mechanism on different platforms. This is not something you can possibly avoid. Even dlopen doesn't work all the same on different platforms.

While you are technically correct, I believe my statement stands in that using the same type of mechanism (e.g. dlopen() on all platforms) is close enough it can be considered doing the same thing.

If something can break, then add tests to catch them.

I agree; any shortcomings in the "dlopen() all libraries" approach can be fixed in the same way.

In what concrete ways is this penalizing any users?

Slower startup.

  1. On my machine, Julia built from 7e8f2c0 is consistently 20% faster to start up than Julia built from f26a8c352fee56b845396f79152ccd01505d8973, the commit before it. What slower startup are you seeing?

Breaking linker optimization.

Concretely, what linker optimizations are being broken, and what performance impact is it having?

Harder embedding (can't just link to the library anymore).

That's precisely why we have libjulialoader; that's the library that an embedding tool can link to, which will load all the libraries necessary to get libjulia happy. The workflow is very easy; first load libjulialoader, next open libjulia.

Breaking build.

Yes, we will address the openlibm linking issue so that it's easier for users to continue to use USE_SYSTEM_LIBOPENLIBM.

Breaking tools inspecting linkage/symbols/rpaths.

This is no more broken than all the other libraries we dlopen().

If the reason is that julia should not be on PATH then a startup script/exe will fix it.

We already discussed this and decided against that because it is not clean enough.

@yuyichao
Copy link
Contributor

yuyichao commented Sep 26, 2020

If something can break, then add tests to catch them.
I agree;

Cool.

Julia built from 7e8f2c0 is consistently 20% faster to start up than Julia built from f26a8c3, the commit before it. What slower startup are you seeing?

There are many other unrelated changes in this commit which is probably the cause of the startup time difference. I never said every single changes here are bad. Compared to a directly linked one, what I see is a ~1% slow down and this is not including more libraries that you are going to dlopen in the future. And this is also a general comment about openning everything up front (i.e. the JLL package approach in general) it can easily cost miliseconds per library on a very fast system and much more on a slower one.

Concretely, what linker optimizations are being broken, and what performance impact is it having?

The thread local variable allocation. It's about as much an impact as the static tls optimization for the JIT code. Using the benchmark I did back then #14083 (comment) , it's ~70-80% slow down on the access of the variable itself. (i.e. almost 2x, not a minus 70-80% change in time, which would be 3x-5x).

The workflow is very easy; first load libjulialoader, next open libjulia.

And that's exactly what's broken about it. The application is now totally on it's own to figure out where the library is, rather than simply relying on the platform linker to find the library. The get_exe_path provided is useless for embedding, as it does not have to be at the same location as the julia executable. All the standard tricks of setting up the system to do link to a library at stardard/unstandard location will fail. And this virtually put a dependency of embedding on the (path of) the julia executable in additional to the library itself.

This is no more broken than all the other libraries we dlopen().

No it's much more broken than that. The way we use dlopen previously was never done using this mechanism to construct a full path without any fallback. Putting the library in the system path/setting LD_LIBRARY_PATH just works. Not so for a hard coded absolute path.

We already discussed this and decided against that because it is not clean enough.

Well, it's no dirtier than what's done here. Both require a relative hardcoded relative path in the user-facing executable to the library/real executable. It has minimum difference between platforms (the real executable would be exactly the same). And it's cleaner in the sense that it doesn't try to use any home baked mechanism to replace the system one.

@StefanKarpinski
Copy link
Sponsor Member

And this is also a general comment about openning everything up front (i.e. the JLL package approach in general)

JLLs no longer open everything up front. That’s one of the points of this whole endeavor. They now load libraries on demand instead, massively speeding up loading of JLLs.

At a high level, Yichao, this feedback should have come months ago. This PR has been here for months. There was another PR before that which did the script wrapper thing you’re talking about and that was open for a month or so and ultimately rejected in favor of this approach.

@staticfloat
Copy link
Sponsor Member Author

JLLs no longer open everything up front. That’s one of the points of this whole endeavor. They now load libraries on demand instead, massively speeding up loading of JLLs.

While that’s not true of most JLLs today, it will happen eventually, thanks to Jeff’s work on allowing non-const expressions as the library name.

@yuyichao
Copy link
Contributor

yuyichao commented Sep 26, 2020

At a high level, Yichao, this feedback should have come months ago. This PR has been here for months.

Well, this is the least useful comment of all time though I did expect exactly this comment from you. Nothing is set in stone yet so it is not a bad time to change things. This is also not the first time something is broken for real users in name of "improving" external library handling only to be caught later.

As for why I didn't comment earlier, it was not clear at all from the title what this was doing. The title reads like it's talking about embedding but is apparently not. There's no time for me to test every single PR on every single revision in the off chance that the title doesn't completely and ambiguiously reflect the change and so some of them have to rely on testing after merging to see the impact.

That’s one of the points of this whole endeavor.

No, I believe this PR has nothing to do with that. Allowing non-const ccall library path was the ONLY thing needed.

While that’s not true of most JLLs today, it will happen eventually, thanks to Jeff’s work on allowing non-const expressions as the library name.

Well, the list you give includes ones (gmp, mpfr, pcre) that are libraries used in julia code, not libjulia. In another word, none of those should be included in the list ever.


And all the concerns about dlopen vs direct linking and breaking other things are still valid. #36588 (comment)

@yuyichao
Copy link
Contributor

And if the other PR was #35629. AFAICT the only concern for that was modifying PATH which is unnecessary using basically the same approach here to figure out the path to the real executable (i.e. store a relative path to the real executable and use the current executable path to find the absolute path, no need to alter PATH). The discussion there also never reaches a conclusion.

@phykos
Copy link

phykos commented Sep 26, 2020

This changes ui/repl.c to compile into a new library, libjuliarepl.
This library is loaded by a very simple loader executable, defined in
loader.c. This loader can have paths to dependent libraries embedded
wtihin it in order to ensure that certain libraries have been loaded
within Julia by the time libjuliarepl and libjulia itself are
attempted to be loaded within Julia.

When running Julia in embed, libjuliarepl needs to be loaded?

@StefanKarpinski
Copy link
Sponsor Member

Well, this is the least useful comment of all time though I did expect exactly this comment from you.

Happy to meet expectations.

@staticfloat
Copy link
Sponsor Member Author

staticfloat commented Sep 27, 2020

Yichao, I understand that you're upset that this got through and you don't like the approach. But frankly, unless you can give explicit, concrete proof of the downsides you are claiming, the basic thrust of this PR still seems the best path forward.

There are many other unrelated changes in this commit which is probably the cause of the startup time difference. I never said every single changes here are bad. Compared to a directly linked one, what I see is a ~1% slow down and this is not including more libraries that you are going to dlopen in the future.

I'm totally fine with a 1% slowdown. Having the security of knowing precisely what library I'm loading (by using full relative path instead of whatever happens to be on the path is a benefit, not a downside.

And this is also a general comment about openning everything up front (i.e. the JLL package approach in general) it can easily cost miliseconds per library on a very fast system and much more on a slower one.

If the library is guaranteed to be needed anyway (like LLVM, GMP, MPFR, and PCRE, which are all required by Base julia) there is no slowdown, because the work must be done no matter what. Once we have the capability to do things like strip regex parsing out of Julia, yes, then we can tackle the issue of lazily loading libpcre, but there's no point in debating it now.

And besides, in this PR, we're only front loading LLVM, libgcc_s and openlibm precisely because this fixes issues in platforms that have problems because of the library loading strategy you are arguing for. If we let the system load its default libgcc_s on FreeBSD, the system will always link the wrong one. We have to manually copy our own over so that it is compatible with libgfortran, and we have to ensure that it is loaded before any other copy could be. On Windows, the system libm calculates things in slightly the wrong way, so we must ensure that openlibm is loaded before any Julia code runs, otherwise LLVM may look up the wrong symbol for things like pow().

These are perfect usecases for why it is useful to have a relative-path RPATH loading mechanism on all platforms. And the added maintainability of not needing to determine the differences between dyld loading, ld loading, and our own loader on Windows is a big maintainability win.

No it's much more broken than that. The way we use dlopen previously was never done using this mechanism to construct a full path without any fallback. Putting the library in the system path/setting LD_LIBRARY_PATH just works. Not so for a hard coded absolute path.

As mentioned above, it does not just work on all platforms, and using our own mechanism can work in all cases. There are large upsides to being able to guarantee that exactly the version of libLLVM that I want to load is exactly the version that we will attempt to load.

As I said above, the USE_SYSTEM_XX options should disable this, but for 99% of people who are using the default binaries, this is a big win for us, and it seems to not have any performance issues at all (quite the opposite, in fact).

That’s one of the points of this whole endeavor.

No, I believe this PR has nothing to do with that. Allowing non-const ccall library path was the ONLY thing needed.

It's all interconnected; it's best for non-const ccall to be able to use full paths so you have guarantees on which library you're loading, which will be serviced by using the paths from JLLs, which will be loaded from the artifacts folder, which is only possible if we have RPATH capabilities on all platforms.

And all the concerns about dlopen vs direct linking and breaking other things are still valid. #36588 (comment)

I see one concrete concern, which is that TLS can be slower. I attempted to recreate your benchmark and found no performance difference at all between this PR and the commit before it. At this point, I believe the burden of proof is upon you with regards to performance concerns; both of the concerns you have raised have turned out to be in favor of this PR, so if you have a performance concern please do a benchmark and show the impact.

This is also not the first time something is broken for real users in name of "improving" external library handling only to be caught later.

Let's not bring old issues into this, I don't think that's helpful. To be clear; that change did improve things, drastically, for real-world users. It sped up GMP and MPFR by a factor of two on older compilers, however, once newer compilers came out, we were able to roll back the march requirements and still retain most of the speed.

When running Julia in embed, libjuliarepl needs to be loaded?

@phykos It depends on what your application is doing, but yes, in most cases. For now, the libraries libjulia will attempt to load are still in the same directory as it, but very soon (like within the next week) we are going to be re-arranging the layout of libraries on-disk, and libjulia will no longer be able to find those libraries as easily. An embedding scenario can load them manually, if they so wish, or they can load libjulialoader which will load the dependent libraries first, then the libjulia library itself. You do so by loading libjulialoader and then calling load_repl(exe_dir, argc, argv). If you have concerns or different needs, please open a new issue and we can work through them together. :)

@yuyichao
Copy link
Contributor

yuyichao commented Sep 27, 2020

unless you can give explicit, concrete proof of the downsides you are claiming, the basic thrust of this PR still seems the best path forward.

I've given you answer every single time you asked for a concrete example so I'm not sure what you are talking about here.

I'm totally fine with a 1% slowdown. Having the security of knowing precisely what library I'm loading (by using full relative path instead of whatever happens to be on the path is a benefit, not a downside.

Well, the issue is that you want to do more of this.

These are perfect usecases for why it is useful to have a relative-path RPATH loading mechanism on all platforms. And the added maintainability of not needing to determine the differences between dyld loading, ld loading, and our own loader on Windows is a big maintainability win.

And I'm complaining exactly because the version you have breaks the maintainability for everyone downstream that expect the standard mechanism to be used. In another word, that's an extremely selfish choice.

loaded from the artifacts folder, which is only possible if we have RPATH capabilities on all platforms.

Errr, no. With full path there's never a need for "RPATH" capability anywhere else. RPATH is only useful if you don't have full path.

both of the concerns you have raised have turned out to be in favor of this PR

So 1% slow down is in favor I guess. Again, given the limited number of library you are loading now, I don't expect there to be much of a difference right now. Trying to load more and more is the problem.

It sped up GMP and MPFR by a factor of two on older compilers

Well, that wasn't what I was talking about. Maybe I should have changed the link but I was talking about going back to old compiler in the first place. There was even a PR (I think) to revert the use of binary builder that you refuse to take for months and instead took the path that break it for other people.

@phykos It depends on what your application is doing, but yes, in most cases. For now, the libraries libjulia will attempt to load are still in the same directory as it, but very soon (like within the next week) we are going to be re-arranging the layout of libraries on-disk, and libjulia will no longer be able to find those libraries as easily. An embedding scenario can load them manually, if they so wish, or they can load libjulialoader which will load the dependent libraries first, then the libjulia library itself. You do so by loading libjulialoader and then calling load_repl(exe_dir, argc, argv). If you have concerns or different needs, please open a new issue and we can work through them together. :)

And again, that's exactly what's broken about it and was the main concern that keep getting ignored, i.e. the machanism you invented is unexpected to users and is broken (and the secondary concern being the performance). You are now requiring the embedding use case to supply the exe_dir, in additional to the path to any library it is directly loading. As I said, the get_exe_dir is totally useless. The API being REPL centric is a fixable issue, but the fact that the embedding user can't just dlopen a library and let it take care of everything, even in case where the libraries are all at the right place, is a fundamental issue.


Edit:

And to be even more explicit.

The workflow is very easy; first load libjulialoader, next open libjulia.

The new workflow is,

  1. Load libjulialoader
  2. Find julia, which may or may not exist, may or may not be called julia and may or may not be the right one.
  3. Call load_repl (currently a bad API but fixable)
  4. Load libjulia (currently has the same issues as finding julia but fixable with a better API from libjulialoader)

Compared to previously on systems with working RPATH

  1. Load libjulia (and done)

I have little problem adding a libjulialoader library to make this work on systems that doesn't have RPATH, which was exactly what I thought this was doing, as long as it doesn't break anything about systems that currently works. It'll even be reasonable to require more information to be passed to the loader library that should have been available to the loader most of the time like the library path or the library handle, as long as a default NULL value is also accepted where no other manual loading steps will be performed. Requiring passing the path to another file and not allowing any fallback are what's functionally broken about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants