Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KNL planning #391

Closed
andreasnoack opened this issue Nov 3, 2016 · 33 comments
Closed

KNL planning #391

andreasnoack opened this issue Nov 3, 2016 · 33 comments
Assignees

Comments

@andreasnoack
Copy link
Collaborator

andreasnoack commented Nov 3, 2016

UPDATED: I'll try to maintain KNL builds. You can use

/global/cscratch1/sd/noack/julia/julia      # built with gcc 4.8.5
/global/cscratch1/sd/noack/juliaintel/julia # built with icc 17.0.0 20160721

Some adjustments are necessary before Celeste can run on KNL. My thoughts were that this issue could be a source for the relevant information related to KNL, i.e. updated build information for Julia and a list issue related to KNL runs.

The main issue is that we'll need LLVM's development version to run on KNL. To use LLVM-svn, as the development version is called, we'll need the development version of Julia. Hence, Celeste will be having two moving targets which might cause some headaches because LLVM changes might break Julia. In order to make this easier to manage, I'll suggest that we keep commits SHAs for a pair of Julia and LLVM-svn that work together. I'll try to keep them updated.

So the build info right now are

Build instructions

In Make.user put

override LLVM_VER=svn
override USE_INTEL_MKL=1

and execute make -C deps get-llvm. Then checkout

Julia df62c9922c320bff0f6d32bdf3faf87337925e2f
LLVM  584a5d174f2d0776a9390b12799abe8262ffc2ba
Updated 6 March 2017

Intel sepcific

If building with the Intel compilers, it is also necessary to apply this patch to LLVM.
https://reviews.llvm.org/D27610

Right now, Julia's master version is not much different from Julia 0.5 but it might change. If that happens, we'll need to decide what to do. We might need fixes from master so the best thing might be to keep Celeste up to date with master but that could be demanding. Alternatively, we could branch off and only cherry-pick the fixes we need but I'll suggest that we take that discussion only when it becomes relevant.

Finally, there are some performance considerations for KNL. To do well on KNL we'd need threading and vectorization to work well. Some of the code base might not vectorize in the form it's written now (I don't know yet) but we should be aware that some refactoring for vectorization might be needed.

@andreasnoack andreasnoack self-assigned this Nov 3, 2016
@andreasnoack
Copy link
Collaborator Author

andreasnoack commented Nov 3, 2016

Celeste is running on KNL. See screenshot below. It seems to work fine except that much of the code is single threaded so the runtime is pretty slow for benchmark_infer.
screen shot 2016-11-03 at 7 12 22 pm

@andreasnoack
Copy link
Collaborator Author

@kpamnany I've updated the SHAs. Yichao's patch to LLVM has been merged so latest svn version works with Julia master.

@Keno
Copy link
Collaborator

Keno commented Nov 28, 2016

I've ported rr to Cori/KNL (rr-debugger/rr#1904). For any segfaults, etc. capturing such a problem in rr and keeping the trace would allow me to easily diagnose. I suspect it won't work with MPI out of the box, but looking at that is on my todo list for the end of next week.

@andreasnoack
Copy link
Collaborator Author

I've updated the SHAs to versions with Keno's fixes included. I've also rebuilt /global/cscratch1/sd/noack/julia/julia such that it includes the fixes. I'll recommend that you wait about 24 hours before you try anything with this binary since many packages need new tags after some changes to in Base's macro handling.

@andreasnoack
Copy link
Collaborator Author

@kpamnany I've now managed to built with icc. See the top post for a link to the binary.

@andreasnoack
Copy link
Collaborator Author

andreasnoack commented Dec 16, 2016

On KNL, the memory allocation required by Julia's threads and LAPACK (eigenvalue problem in Newton's method) through OMP can collide such that Julia crashes. We should probably set OMP_NUM_THREADS=1 when running on KNL.

@jeff-regier
Copy link
Owner

I always set OMP_NUM_THREADS=1 in the slurm scripts: https://github.com/jeff-regier/Celeste.jl/blob/master/nersc/infer.sl#L16 . I think @kpamnany does this too for runs on the supercomputer.

@Keno
Copy link
Collaborator

Keno commented Dec 18, 2016

Linking JuliaLang/julia#19640

@andreasnoack
Copy link
Collaborator Author

I've updated the top comment with the extra info for building with the Intel compilers.

@kpamnany
Copy link
Collaborator

kpamnany commented Jan 4, 2017

Building with Intel compilers, I get:

    JULIA usr/lib/julia/inference.ji
essentials.jl
ctypes.jl
generator.jl
reflection.jl
options.jl
promotion.jl
tuple.jl
range.jl
expr.jl
error.jl
bool.jl
number.jl
int.jl
A method error occurred before the base MethodError type was defined. Aborting...
Core.Inference.#convert() world 1007
(UInt128, 0)
while loading int.jl, in expression starting on line 415
rec_backtrace at /scratch/kpamnany/julia.intel/src/stackwalk.c:84
...

Any ideas?

@andreasnoack
Copy link
Collaborator Author

This must be a version of Julia with JuliaLang/julia#17057. I don't know why it wouldn't work on KNL but we are still adjusting to that change. If you want something running now, I'd try to build 374c3d6b3c3e15c00f7d3df8b5cb7c8a763aa746 instead of latest master.

@andreasnoack
Copy link
Collaborator Author

I've updated and tagged most packages to work and not show warnings on Julia master. The main exception is StaticArrays where my PR is still under review. Therefore you should use the master branch of my fork https://github.com/andreasnoack/StaticArrays.jl. Hopefully, we'll be able to get it reviewed and merged today but I don't have commit access to that repo.

A minor problem is DataFrames and DataArrays. I don't think they are critical in the computations but they still throw a lot of warnings and it is not likely that DataArrays will be fixed. We should check that we don't call any code that uses DataArrays or DataFrames when we benchmark because the deprecation warnings are slowing things down a lot.

Finally, the Intel build in the juliaintel directory segfaults in the benchmark_infer.jl tests but the gcc build doesn't. This only happens on KNL and I only realized this yesterday after all the packages had been fixed. @Keno is looking into this.

@hsseung
Copy link

hsseung commented Feb 21, 2017

Regarding the build info of 26 January 2017,
The Julia checkout returns the error message "fatal: reference is not a tree:"
The LLVM checkout works fine.

@andreasnoack
Copy link
Collaborator Author

I've just updated the commit hashes to the versions we use now. Could you try them instead?

@hsseung
Copy link

hsseung commented Feb 21, 2017

Thanks! This time I have the opposite problem. Julia checks out fine but LLVM does not.

@andreasnoack
Copy link
Collaborator Author

Hm. Just checked and it looks right. It is this commit llvm-mirror/llvm@72258b4

@hsseung
Copy link

hsseung commented Feb 21, 2017

I tried again with a virgin git clone of the repo, but the second checkout doesn't work. Am I doing something wrong?

$ make -C deps get-llvm

[output omitted]

$ git checkout 3181500e361991b25a0ed8d63a821eb3c7a2e4bf
Note: checking out '3181500e361991b25a0ed8d63a821eb3c7a2e4bf'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

git checkout -b new_branch_name

HEAD is now at 3181500... Fix missing root across safe-point
$ git checkout 72258b42b0805337ea2d0d042454d2a8c173fcf4
fatal: reference is not a tree: 72258b42b0805337ea2d0d042454d2a8c173fcf4

@andreasnoack
Copy link
Collaborator Author

Maybe there is an issue with the make target. Try to cd into deps/srccache/llvm-svn and do a git fetch origin. If that doesn't work then try checking the status of the llvm repo.

@hsseung
Copy link

hsseung commented Feb 22, 2017

The LLVM repo was two commits behind, so I did a git pull and then was able to checkout. Compilation worked. Thanks!

peakflops(10000) is almost 1.5 teraflops, about 20x faster than my MacBook Pro. This is encouraging! Performance is presumably limited by MKL for KNL? Theoretical max is 6 teraflops for my KNL, I believe.

Transcendental functions are 4x slower though...

@kpamnany
Copy link
Collaborator

You're now getting the same performance with Julia on KNL that we are. Andreas and Keno have been working to improve things (see the issue Keno linked above), but it's going to be a moving target for some time.

@andreasnoack
Copy link
Collaborator Author

I'd expect MKL's gemms to be close to what you can possibly get. Isn't the 6 teraflops for Float32s? Julia's peakflops measures Float64 performance.

If you are not already doing it, you should try to make vectorized calls to VML for transcendental functions. I see a 50x difference between using Julia's exp and VML's exp for vectors of Float32s.

@hsseung
Copy link

hsseung commented Feb 22, 2017

Ah you're right. The analog of peakflops for Float32 yields 3.6 teraflops, which is getting close to theoretical max.

I'm puzzled by your 50x number. I tried

julia> a=rand(Float32, 100000000);
julia> @time exp.(a);

VML.jl gave me 4x speedup on my MacBook but only a few percent on KNL. Only a single thread is used on both machines.

@andreasnoack
Copy link
Collaborator Author

Did you change the path to the library as described in https://github.com/JuliaMath/VML.jl#using-vmljl? It should point to the avx512_mic version. I guess that should give you 4x because the default is just AVX. Also, I timed the non-allocating version, i.e. AVX.exp!(y,x).

@hsseung
Copy link

hsseung commented Apr 23, 2017

Oops forgot to say that this did solve my problem. Thanks!

@hayatoikoma
Copy link

Is there any status update of this topic after the release of Julia v0.6? I am trying to build Julia v0.6 on KNL with Intel's compiler, but I still haven't been able to build it. It would be great if you can update the information provided above.

@andreasnoack
Copy link
Collaborator Author

Julia 0.6 doesn't support KNL out the box. You can use Julia 0.6 on KNL but you'll need a more recent LLVM. I think building Julia with the Intel compilers works fine but I gave up on building LLVM with the Intel compilers. GCC worked fine. Notice that you can use MKL without building Julia with the Intel compilers.

@hsseung
Copy link

hsseung commented Jul 12, 2017

I did not yet build 0.6 for KNL, but I did manage a MKL build for a regular Intel CPU. As Andreas says, LLVM does not build properly with Intel's icc. I think I got an error message about linking to libirc.a. Building with gcc worked, but I had to manually set the environment variables MKLROOT and LD_LIBRARY_PATH. I couldn't use the Intel script compilervars.sh because it evidently sets some incorrect variables for this build. I found the right setting of LD_LIBRARY_PATH by attempting to build a few times and examining the error messages.

@hayatoikoma
Copy link

Thank you, @andreasnoack and @hsseung.

I have managed to compile Julia and LLVM with GCC. I used the commits suggested on the first Andreas's comment. However, I haven't been able to compile the released v0.6 even with the GCC compiler.

@andreasnoack
Copy link
Collaborator Author

However, I haven't been able to compile the released v0.6 even with the GCC compiler.

Which version of LLVM are you using? The problem is that LLVM's API changes over time so Julia has to be adjusted a bit every time the LLVM version changes.

@hayatoikoma
Copy link

I have checked out the commit you suggested on the first comment for LLVM and checked out the release-v0.6 for Julia. Which version would you suggest for LLVM?

@hayatoikoma
Copy link

After trying out different versions of LLVM, I was able to compile the release-0.6 of Julia with release-40 of LLVM! Thank you for your help!

@aramirezreyes
Copy link

aramirezreyes commented Nov 27, 2018

Hi. I am using Julia on Cori. Currently on Haswell (downloaded binaries for julia 1.0.2). Is there a way to get your compile script for julia on KNL?

@andreasnoack
Copy link
Collaborator Author

Have you tried just to run Julia 1.0.2 on a KNL node? We have upgraded LLVM in the meantime so I believe Julia should just run on KNL now. You might want to build a custom system image to the KNL, see PackageCompiler.jl and you most likely would like to use MKL, see what I'm doing in https://github.com/JuliaComputing/MKL.jl. The current trick is to adjust the paths in build_h.jl to point to libmkl_rt instead of openblas64_ and the build a native system image with PackageCompiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants