Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] [R-package] macOS clang CMake jobs failing with segfaults #6628

Closed
jameslamb opened this issue Aug 29, 2024 · 7 comments · Fixed by #6629
Closed

[ci] [R-package] macOS clang CMake jobs failing with segfaults #6628

jameslamb opened this issue Aug 29, 2024 · 7 comments · Fixed by #6629

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Aug 29, 2024

Description

The r-package (macos-13, clang, R 4.3, cmake) CI jobs are failing with a segfault like this:

* checking examples ... ERROR
Running examples in ‘lightgbm-Ex.R’ failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: lgb.cv
> ### Title: Main CV logic for LightGBM
> ### Aliases: lgb.cv
> 
> ### ** Examples
> 
> ## No test: 
> ## Don't show: 
> setLGBMthreads(2L)
> ## End(Don't show)
> ## Don't show: 
> data.table::setDTthreads(1L)
> ## End(Don't show)
> data(agaricus.train, package = "lightgbm")

 *** caught segfault ***
address 0x540, cause 'memory not mapped'

Traceback:
 1: load(zfile, envir = tmp_env)
 2: data(agaricus.train, package = "lightgbm")
An irrecoverable exception occurred. R is aborting now ...
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
  Running ‘testthat.R’/Library/Frameworks/R.framework/Resources/bin/BATCH: line 60: 12462 Segmentation fault: 11  ${R_HOME}/bin/R -f ${in} ${opts} ${R_BATCH_OPTIONS} > ${out} 2>&1
 [20s/12s]
 [20s/12s] ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  
   *** caught segfault ***
  address 0x7fb41c[2000](https://github.com/microsoft/LightGBM/actions/runs/10581950637/job/29392431659?pr=6625#step:9:2001)34, cause 'memory not mapped'
  Warning: stack imbalance in 'lazyLoadDBfetch', 15 then 17
  Warning: stack imbalance in 'c', 40 then 38
  Warning: stack imbalance in 'lapply', 16 then 17
  
  Traceback:
   1: (function () expr)()
   2: test_files_serial(test_dir = test_dir, test_package = test_package,     test_paths = test_paths, load_helpers = load_helpers, reporter = reporter,     env = env, stop_on_failure = stop_on_failure, stop_on_warning = stop_on_warning,     desc = desc, load_package = load_package, error_call = error_call)
   3: test_files(test_dir = path, test_paths = test_paths, test_package = package,     reporter = reporter, load_helpers = load_helpers, env = env,     stop_on_failure = stop_on_failure, stop_on_warning = stop_on_warning,     load_package = load_package, parallel = parallel)
   4: test_dir("testthat", package = package, reporter = reporter,     ..., load_package = "installed")
   5: test_check(package = "lightgbm", stop_on_failure = TRUE, stop_on_warning = FALSE,     reporter = testthat::SummaryReporter$new())
  An irrecoverable exception occurred. R is aborting now ...
  Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
sh: line 1: 12492 Segmentation fault: 11  R_LIBS=/var/folders/zf/5wcpcvh91p9g4jn5srhsmzv40000gn/T//Rtmpr6e2Sh/RLIBS_29c22c8c2289 R_ENVIRON_USER='' R_LIBS_USER='NULL' R_LIBS_SITE='NULL' '/Library/Frameworks/R.framework/Resources/bin/R' --vanilla --no-echo > '/Users/runner/work/LightGBM/LightGBM/lightgbm.Rcheck/build_vignettes.log' 2>&1 < '/var/folders/zf/5wcpcvh91p9g4jn5srhsmzv40000gn/T//Rtmpr6e2Sh/file29c23693a3f9'
* checking re-building of vignette outputs ... ERROR
Error(s) in re-building vignettes:
  ...
--- re-building ‘basic_walkthrough.Rmd’ using knitr

Reproducible example

This is happening on all PRs. For example, see this build from #6625: https://github.com/microsoft/LightGBM/actions/runs/10581950637/job/29392431659?pr=6625.

On that PR, I manually re-triggered that job 3 times over the last 24 hours.

Additional Comments

It's worth noting that:

  • all other R jobs are passing (including multiple on macOS, across multiple R versions)
  • all non-R CI jobs (including swig, Python, and C++ tests) are passing, many also with CMake + clang
  • this job uses a fixed version of R (4.3.1) ... new changes in R-devel couldn't cause this
  • there are 0 issues reported in the CRAN checks on {lightgbm} v4.5.0: https://cran.r-project.org/web/checks/check_results_lightgbm.html
@jameslamb
Copy link
Collaborator Author

I just tried re-running ALL R jobs on #6625, let's see if any others fail: https://github.com/microsoft/LightGBM/actions/runs/10581950637?pr=6625

@jameslamb
Copy link
Collaborator Author

jameslamb commented Aug 29, 2024

I strongly suspect this is related to a release of one of {lightgbm}'s dependencies. I looked through the list from CI logs and checked their releases on CRAN, here's what I found:

@jameslamb
Copy link
Collaborator Author

jameslamb commented Aug 29, 2024

So I think the new {data.table} release is a suspect. And experience tells me it's probably that release + something related to OpenMP 😭

On my Mac (M2, Sonoma 14.4.1), I built the latest {lightgbm} (fde0157) from source.

Rscript build_r.R --no-build-vignettes -j4

Found that, in combination with the latest {data.table}, the following is enough to reproduce the segfault.

cat > test.R <<EOF
library(lightgbm)
data(agaricus.train, package = "lightgbm")
lgb.Dataset(
    data = agaricus.train\$data
    , label = agaricus.train\$label
)\$construct()
EOF

# fails
Rscript test.R

The error does not occur if I disable OpenMP parallelism.

# succeeds
OMP_NUM_THREADS=1 Rscript test.R

Downgrading to the prior release of {data.table} also resolves it.

Rscript -e "remove.packages('data.table')"
Rscript --vanilla -e "install.packages(c('https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.15.4.tar.gz'), repos = NULL)"

# succeeds
Rscript test.R

# also succeeds
OMP_NUM_THREADS=1 Rscript test.R

So it does look like it's something related to the latest {data.table} release. And since this is only happening on macOS, with clang, for CMake-based builds, I suspect it's related to the changes from #6391 and #6489 as well.

@jameslamb
Copy link
Collaborator Author

I noticed that when I build {data.table} 1.15.4 from source, it isn't passing OpenMP flags.

Building 1.16.0 from source, it does. I see lines like this:

clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I/opt/R/arm64/include   -I/opt/homebrew/opt/libomp/include -Xclang -fopenmp  -DNOZLIB -fPIC  -falign-functions=64 -Wall -g -O2  -c wrappers.c -o wrappers.o

It looks like {data.table}'s shared library is linking to R's OpenMP.

R_LIB=/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

otool -L "${R_LIB}/data.table/libs/data_table.so"
data_table.so (compatibility version 0.0.0, current version 0.0.0)
/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)
/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libR.dylib (compatibility version 4.3.0, current version 4.3.3)
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 2420.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1345.100.2)

{lightgbm} has an RPATH entry

otool -L "${R_LIB}/lightgbm/libs/lightgbm.so"
@rpath/lightgbm.so (compatibility version 0.0.0, current version 0.0.0)
@rpath/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)
/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libR.dylib (compatibility version 4.3.0, current version 4.3.3)
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1700.255.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1345.100.2)

Which will search in this order:

LightGBM/CMakeLists.txt

Lines 829 to 833 in fde0157

# add RPATH entries to ensure the loader looks in the following, in the following order:
#
# - ${OpenMP_LIBRARY_DIR} (wherever find_package(OpenMP) found OpenMP at build time)
# - /opt/homebrew/opt/libomp/lib (where 'brew install' / 'brew link' puts libomp.dylib)
# - /opt/local/lib/libomp (where 'port install' puts libomp.dylib)

So I suspect that this is our old friend, the "multiple versions of OpenMP loaded in the same session" problem.

Things we could try:

  • making {data.table} a Depends dependency of {lightgbm}, so it'll be loaded first (and then {lightgbm} will just find that via its @rpath/libomp.dylib entry)
  • ensuring that R's library directories are added to the list of libomp RPATH entries for CMake-based R builds
  • ensuring that R's library directories are added to e.g. CMAKE_PREFIX_PATH (docs), so that find_package() / find_library() will check there

@jameslamb
Copy link
Collaborator Author

Adding a Depends entry with {data.table} in DESCRIPTION did not solve this.

But I found that adding R's main library directory at the beginning of the OpenMP RPATH list did! 🎉

Opened #6629 proposing that.

Summary

  1. R for macOS (from CRAN) vendors libomp.dylib
  2. CRAN's pre-compiled binaries for macOS embed an absolute-path install name pointing at that vendored library when compiled with -fopenmp
  3. CMake-based builds of {lightgbm} do not find that libomp.dylib at build time
  • because they use CMake's find_package(), and R ships just the library, not CMake config files and possibly not even headers
  1. {data.table}'s newest release, v1.16.0, fixes its OpenMP detection and now CRAN's macOS binaries of that library load the R-vendored libomp.dylib at runtime
  2. macOS CMake builds of {lightgbm} use RPATH-based search for libomp.dylib... and R's library directory is not included in its list
  3. library(data.table) and library(lightgbm) therefore load 2 different libomp.dylib into the process, leading to segfaults 🙃

Impact

Building {lightgbm} on macOS with clang, with OpenMP support enabled, from source using Rscript build_r.R, will probably generate a package that immediately encounters segfaults at runtime if used together with {data.table} >= 1.16.0. Upgrading to a version which contains the changes in #6629 fixes that.

The {lightgbm} distributed via CRAN is unaffected (it uses autotools).

Windows and Linux users are unaffected.

Building with gcc is unaffected.

@jameslamb
Copy link
Collaborator Author

Just for awareness, tagging some folks who might be interested (no action required.... this is a LightGBM problem, not a {data.table} problem): @hcho3 @kevinushey @MichaelChirico

@MichaelChirico
Copy link

Messy! Glad you've found a fix. Linking our recent updates about configuring OpenMP on macOS since they're probably related & I don't see them here yet:

Rdatatable/data.table#6034
Rdatatable/data.table#6283
Rdatatable/data.table#6418

#6418 is in dev only, but we'll probably put it in a patch release soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants