Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to set CPU configuration at runtime #451

Open
decandia50 opened this issue Oct 9, 2020 · 6 comments
Open

Ability to set CPU configuration at runtime #451

decandia50 opened this issue Oct 9, 2020 · 6 comments
Assignees

Comments

@decandia50
Copy link

I see that you can use BLIS_ARCH_DEBUG=1 to see what CPU configuration was selected at runtime, but it would be handy if you could set the CPU configuration at runtime instead of recompiling. The reasoning is to have the ability to act like MKL's MKL_CBWR environment variable which will allow you to specify the instruction set at runtime. This is useful when trying to create reproducible results across different machine types. For instance a Haswell machine can use AVX2, but not AVX-512. If you wanted to create a program that ran across a heterogenous set of Haswell and Skylake machines that produced the same result you would need to specify that the software built on the Skylake nodes used the haswell configuration. I would like to be able to specify that at runtime so the program would only use AVX2 instructions but leave the software configured for auto. This would allow me to run using AVX2 for some experiments, and AVX-512 for others. I went through the documentation, and did not find a way to tweak this setting at runtime, so if this already exists, please point me to the proper documentation.

@decandia50
Copy link
Author

Also #351 was merged, so you can update

// NOTE: Change this usage of getenv() to bli_env_get_var() after

@devinamatthews
Copy link
Member

@decandia50 if you configure using e.g. configure intel then it will compile in all the Intel architectures and select the proper one at runtime. While this isn't exactly the feature you asked for, it sounds like it would solve part of your problem. What you wouldn't get is e.g. using AVX2 on SkylakeX instead of AVX-512, but it's not clear why you would want to do that.

@decandia50
Copy link
Author

@devinamatthews thanks for the response, but that's exactly what I'm trying to avoid doing. As you note I can recompile down to a known set of common denominator CPU instructions, but what I'm trying to accomplish is effectively use a specific set of instructions at runtime without the need to recompile/redistribute my software (the code I need is already in there, but I can't self select it).

For a scenario - Let's say I care about number reproducibility with respect to floating point. And also that I have a large HPC cluster with heterogenous host/CPU types. Some support AVX2, some support AVX-512. In a common scenario I will have tools like numpy linked against BLIS, and I will make a calculation like np.linalg.norm(A@B) as part of some regression test suite. What I would expect is that given a known A and B that each host in the cluster would be able to reproduce the same result. However, because BLIS will autodetect the CPU and use AVX2 in some cases and AVX512 in others there is no way to specify that I care more about number reproducibility than performance at runtime without recompiling and redistributing the code, dependencies, and libraries to all hosts. In a large batch-like HPC system you may see many workloads. Some will desire absolute performance, and will want AVX-512 others will require reproducibility and only require the lowest common instruction set. For folks who care about floating point reproducibility this is fairly important. As was once told to me "diff is a wonderful debugging tool".

@devinamatthews
Copy link
Member

Oh, I didn't not see that it's reproducibility that is the main issue. I think this feature should be relatively easy to add, but i can't hazard a guess on a timeline. Do note that, for a reproducible answer, you will also need to run with the same number of threads on each machine.

@decandia50
Copy link
Author

decandia50 commented Oct 9, 2020

Do note that, for a reproducible answer, you will also need to run with the same number of threads on each machine.

Indeed. The thread count is very important for reproducibility. In many cases where reproducibility is required the BLAS functions are often run single threaded to simplify things; e.g. BLIS_NUM_THREADS=1 BLIS_JC_NT=1 BLIS_IC_NT=1 BLIS_JR_NT=1 BLIS_IR_NT=1

@fgvanzee fgvanzee self-assigned this Oct 17, 2020
fgvanzee added a commit that referenced this issue Oct 18, 2020
Details:
- Implemented support for the user manually overriding the automatic
  subconfiguration selection that happens at runtime. This override
  can be requested by setting the BLIS_ARCH_TYPE environment variable.
  The variable must be set to the arch_t id (as enumerated in
  bli_type_defs.h) corresponding to the desired subconfiguration. If a
  value outside this enumerated range is given, BLIS will abort with an
  error message. If the value is in the valid range but corresponds to a
  subconfiguration that was not activated at configure-time/compile-time,
  BLIS will abort with a (different) error message. Thanks to decandia50
  for suggesting this feature via issue #451.
- Defined a new function bli_gks_lookup_id to return the address of an
  internal data structure within the gks. If this address is NULL, then
  it indicates that the subconfig corresponding to the arch_t id passed
  into the function was not compiled into BLIS. This function is used
  in the second of the two abort scenarios described above.
- Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
  is returned for the latter of the two abort scenarios mentioned above,
  along with a corresponding error message and a function to perform
  the error check.
- Added cpp macro branching to bli_env.c to support compilation of the
  auto-detect.x executable during configure-time. This cpp branch is
  similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
- Cleaned up the auto_detect() function to facilitate easier maintenance
  going forward. Also added a convenient debug switch that outputs the
  compilation command for the auto-detect.x executable and exits.
@fgvanzee
Copy link
Member

fgvanzee commented Oct 19, 2020

@decandia50 I've added support for observing the BLIS_ARCH_TYPE environment variable to manually override the automatic subconfiguration selection mechanism. (See commit 2a0682f.) Please note that it must be set to (1) an arch_t id value within the defined range, 0 to BLIS_NUM_ARCHS-1, as defined in frame/include/bli_type_defs.h (note that these enum values may change in future commits!), and (2) an arch_t id value that corresponds to a subconfiguration that is actually compiled into the library to which the executable was linked. If either condition is not met, BLIS will abort with an error message.

Note that you can still use BLIS_ARCH_DEBUG to confirm the subconfiguration selected, whether it is configure-determined, automatic at runtime, or manually overriden.

Hopefully this feature, as implemented, is satisfactory for your purposes. Please test it out and pass along your feedback.

I appreciate the detail you included in your initial issue post and follow-up messages. Ultimately, this is what caught my attention and prompted me to spend part of my weekend on this. Happy early Hanukkah/Christmas/Festivus/birthday/whatever. :)

Also #351 was merged, so you can update

// NOTE: Change this usage of getenv() to bli_env_get_var() after

Thanks for this reminder. I folded that change into 2a0682f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants