Releases · JuliaGPU/CUDA.jl

18 Sep 14:28

github-actions

v5.5.0

1fe8838

v5.5.0 Latest

Latest

CUDA v5.5.0

Blog post

Diff since v5.4.3

Merged pull requests:

Add support for arbitrary group sizes in gemm_grouped_batched! (#2334) (@lpawela)
Add kernel compilation requirements to docs (#2416) (@termi-official)
Enzyme: reverse mode kernels (#2422) (@wsmoses)
CUFFT: Support Float16 (#2430) (@eschnett)
Updated compute-sanitizer documentation (#2440) (@alexp616)
Add troubleshooting section for NSight Compute (#2442) (@efaulhaber)
Correct typo in documentation (#2445) (@eschnett)
Bump minimal Julia requirement to v1.10. (#2447) (@maleadt)
fix compute-sanitizer typo (#2448) (@alexp616)
Address a corner case when establishing p2p access (#2457) (@findmyway)
Implementation of spdiagm for CUSPARSE (#2458) (@walexaindre)
Update to CUDA 12.6. (#2461) (@maleadt)
CompatHelper: bump compat for GPUCompiler to 0.27, (keep existing compat) (#2462) (@github-actions[bot])
Bump CUDA driver JLL. (#2463) (@maleadt)
CUSOLVER (dense): cache workspace in fat handle (#2465) (@bjarthur)
Revert "Run full GC when under very high memory pressure." (#2469) (@maleadt)
Fix a method deprecation. (#2470) (@maleadt)
Add Enzyme sum derivatives (#2471) (@wsmoses)
Re-use pre-converted kernel arguments when launching kernels. (#2472) (@maleadt)
Bump LLVM compat (#2473) (@maleadt)
Bump subpackage compat. (#2475) (@maleadt)
Enzyme: Reversemode cudaconvert (#2476) (@wsmoses)
Ignore Enzyme.jl CI failures (#2479) (@maleadt)
Re-enable enzyme testing (#2480) (@wsmoses)
Add missing GC.@preserves. (#2487) (@maleadt)
[CUSPARSE] Implement a sparse GEMV for CuSparseMatrixCSC * CuSparseVector (#2488) (@amontoison)
[CUSPARSE] Add conversions between CuSparseVector and CuSparseMatrices (#2489) (@amontoison)
Update to LLVM 9.1. (#2491) (@maleadt)
Use at-consistent_overlay for 1.11 compatibility. (#2492) (@maleadt)
Rework NNlib CI. (#2493) (@maleadt)
CUSPARSE: Fix sparse constructor with duplicate elements. (#2495) (@maleadt)

Closed issues:

LinearAlgebra.norm(x) falls back to generic implementation for x::Transpose and x::Adjoint (#1782)
dlclose'ing the compatibility driver can fail (#1848)
Creating a sparse diagonal matrix of CuArray(u) (#1857)
Support for Julia 1.11 (#2241)
CUDA 12.4 Update 1: CUPTI does not trace kernels anymore (#2328)
Adding CUDA to a PackageCompiler sysimage causes segfault (#2428)
Error using CUDA on Julia 1.10: Number of threads per block exceeds kernel limit (#2438)
Error when I load my model (#2439)
Driver JLL improvements (#2446)
Deadlock when callling CUDA.jl in an adopted thread while blocking the main thread (#2449)
CUDA.Mem.unregister fails with CUDA.jl 5.4 (not with 5.3) (#2452)
Segmentation Fault on Loading CUDA (#2453)
Invalid instruction error when using CUDA (#2454)
Missing adapt for sparse and CUDABackend (#2459)
CUDA precompile cannot find/load "cupti64_2024.2.1.dll" during precompilation (juliaup 1.10.4, Windows 11) (#2466)
Request: Option to disable the "full GC when under very high memory pressure". (#2467)
copyto! ambiguous (#2477)
NeuralODE training failed on GPU with Enzyme (#2478)
issue with atomic - when running standard test, @atomic modify expression missing field access (#2483)
Support for creating a CuSparseMatrixCSC from a CuSparseVector (#2484)
Issue with compiling CUDA and cuTENSOR using local libraries (#2486)
Memory Access error in sparse array constructor (#2494)
Forwards-compatible driver breaks CURAND (#2496)
CUDA 12.6 Update 1 (#2497)

Contributors

eschnett, maleadt, and 11 other contributors

Assets 2

09 Jul 08:09

github-actions

v5.4.3

71311af

v5.4.3

CUDA v5.4.3

Diff since v5.4.2

Merged pull requests:

add cublasgetrsBatched (#2385) (@bjarthur)
add two quirks for rationals (#2403) (@lanceXwq)
Bump cuDNN (#2404) (@maleadt)
Add convert method for ScaledPlan (#2409) (@david-macmahon)
Conditionalize a quirk. (#2411) (@maleadt)
Relax signature of generic matvecmul! (#2414) (@dkarrasch)
Fix kron launch configuration. (#2418) (@maleadt)
Run full GC when under very high memory pressure. (#2421) (@maleadt)
Enzyme: Fix cuarray return type (#2425) (@wsmoses)
CompatHelper: bump compat for LLVM to 8, (keep existing compat) (#2426) (@github-actions[bot])
pre-allocated pivot and info buffers for getrf_batched (#2431) (@bjarthur)
Profiler tweaks. (#2432) (@maleadt)
Update the Julia wrappers for CUDA v12.5.1 (#2436) (@amontoison)
Correct workspace handling (#2437) (@maleadt)

Closed issues:

Legacy cuIpc* APIs incompatible with stream-ordered allocator (#1053)
Broadcasted multiplication with a rational doesn't work (#1926)
Incorrect grid size in kron (#2410)
GEMM of non-contiguous inputs should dispatch to fallback implementation (#2412)
Failure of Eigenvalue Decomposition for Large Matrices. (#2413)
CUDA_Driver_jll's lazy artifacts cause a precompilation-time warning (#2415)
Recurrence of integer overflow bug (#1880) for a large matrix (#2427)
CUDA kernel crash very occasionally when MPI.jl is just loaded. (#2429)
CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc (#2433)
CUDA.jl won't install/run on Jetson Orin NX (#2435)

Contributors

maleadt, david-macmahon, and 5 other contributors

Assets 2

29 May 07:35

github-actions

v5.4.2

7e6a57a

v5.4.2

CUDA v5.4.2

Diff since v5.4.1

Merged pull requests:

Fix and test the legacy memory pool. (#2402) (@maleadt)

Contributors

maleadt

Assets 2

28 May 18:53

github-actions

v5.4.1

5bbd9a7

v5.4.1

CUDA v5.4.1

Diff since v5.4.0

Merged pull requests:

Fixup Enzyme: Mark CuArray as noalias (#2401) (@wsmoses)

Contributors

wsmoses

Assets 2

28 May 06:45

github-actions

v5.4.0

f2062a5

v5.4.0

CUDA v5.4.0

Blog post

Diff since v5.3.5

Merged pull requests:

Support CUDA 12.5 (#2392) (@maleadt)
Mark cuarray as noalias (#2395) (@wsmoses)
Update Julia wrappers for CUDA v12.5 (#2396) (@amontoison)
Enable correct pool access for cublasXt. (#2398) (@maleadt)
More fine-grained CUPTI version checks. (#2399) (@maleadt)

Closed issues:

CUTENSOR breaks after device_reset! (#2319)
cuBLASXt's xt_gemm! incompatible with stream-ordered allocated memory (#2320)
Add helper function to recompile CUDA stack (#2364)

Contributors

maleadt, wsmoses, and amontoison

Assets 2

24 May 13:29

github-actions

v5.3.5

7232f85

v5.3.5

CUDA v5.3.5

Diff since v5.3.4

Merged pull requests:

Avoid constructing MulAddMuls on Julia v1.12+ (#2277) (@dkarrasch)
CompatHelper: bump compat for LLVM to 7, (keep existing compat) (#2365) (@github-actions[bot])
Enzyme: allocation functions (#2386) (@wsmoses)
Tweaks to prevent context construction on some operations (#2387) (@maleadt)
Fixes for Julia 1.12 / LLVM 17 (#2390) (@maleadt)
CUBLAS: Make sure CUBLASLt wrappers use the correct library. (#2391) (@maleadt)
Backport: Enzyme allocation fns (#2393) (@wsmoses)

Closed issues:

Indexing a view uses scalar indexing (#1472)
EnzymeCore is an unconditional dependency. (#2380)
cuBLASLt wrappers ccall into cuBLAS (#2388)
generic_trimatmul! error (#2389)

Contributors

maleadt, wsmoses, and dkarrasch

Assets 2

15 May 19:28

github-actions

v5.3.4

c373258

v5.3.4

CUDA v5.3.4

Diff since v5.3.3

Merged pull requests:

Add Enzyme Forward mode custom rule (#1869) (@wsmoses)
Handle cache improvements (#2352) (@maleadt)
Fix cuTensorNet compat (#2354) (@maleadt)
Optimize array allocation. (#2355) (@maleadt)
Change type restrictions in cuTENSOR operations (#2356) (@lkdvos)
Bump julia-actions/setup-julia from 1 to 2 (#2357) (@dependabot[bot])
Suggest use of 32 bit types over 64 instead of just Float32 over Float64 [skip ci] (#2358) (@Zentrik)
Make generic_trimatmul more specific (#2359) (@tgymnich)
Return the currect memory type when wrapping system memory. (#2363) (@maleadt)
Mark cublas version/handle as non-differentiable (#2368) (@wsmoses)
Enzyme: Forward mode sync (#2369) (@wsmoses)
Enzyme: support fill (#2371) (@wsmoses)
unsafe_wrap: unconditionally use the memory type provided by the user. (#2372) (@maleadt)
Remove external_gvars. (#2373) (@maleadt)
Tegra support with artifacts (#2374) (@maleadt)
Backport Enzyme extension (#2375) (@wsmoses)
Add note about --check-bounds=yes (#2378) (@Zinoex)
Test Enzyme in a separate CI job. (#2379) (@maleadt)
Fix tests for Tegra. (#2381) (@maleadt)
Update Project.toml [remove EnzymeCore unconditional dep] (#2382) (@wsmoses)

Closed issues:

Native Softmax (#175)
CUSOLVER: support eigendecomposition (#173)
backslash with gpu matrices crashes julia (#161)
at-benchmark captures GPU arrays (#156)
Support kernels returning Union{} (#62)
mul! falls back to generic implementation (#148)
\ on qr factorization objects gives a method error (#138)
Compiler failure if dependent module only contains a japi1 function (#49)
copy!(dst, src) and copyto!(dst, src) are significantly slower and allocate more memory than copyto!(dest, do, src, so[, N]) (#126)
Calling Flux.gpu on a view dumps core (#125)
Creating CuArray{Tracker.TrackedReal{Float64},1} a few times causes segfaults (#121)
Guard against exceeding maximum kernel parameter size (#32)
Detect common API misuse in error handlers (#31)
rand and friends default to Float64 (#108)
\ does not work for least squares (#104)
ERROR_ILLEGAL_ADDRESS when broadcasting modular arithmetic (#94)
CuIterator assumes batches to consist of multiple arrays (#86)
Algebra with UniformScaling Uses Generic Fallback Scalar Indexing (#85)
Document (un)supported language features for kernel programming (#13)
Missing dispatch for indexing of reshaped arrays (#556)
Track array ownership to avoid illegal memory accesses (#763)
NVPTX i128 support broken on LLVM 11 / Julia 1.6 (#793)
Support for sm_80 cp.async: asynchronous on-device copies (#850)
Profiling Julia with Nsight Systems on Windows results in blank window (#862)
sort! and partialsort! are considerably slower than CPU versions (#937)
mul! does not dispatch on Adjoint (#1363)
Cross-device copy of wrapped arrays fails (#1377)
Memory allocation becomes very slow when reserved bytes is large (#1540)
Cannot reclaim GPU Memory; CUDA.reclaim() (#1562)
Add eigen for general purpose computation of eigenvectors/eigenvalues (#1572)
device_reset! does not seem to work anymore (#1579)
device-side rand() are not random between successive kernel launches (#1633)
Add EnzymeRules support for CUDA.jl (for forward mode here) (#1811)
cusparseSetStream_v2 not defined (#1820)
Feature request: Integrating the latest CUDA library "cuLitho" into CUDA.jl (#1821)
KernelAbstractions.jl-related issues (#1838)
lock failing in multithreaded plan_fft() (#1921)
CUSolver finalizer tries to take ReentrantLock (#1923)
Testsuite could be more careful about parallel testing (#2192)
Opportunistic GC collection (#2303)
Unable to use local CUDA runtime toolkit (#2367)
Enzyme prevents testing on 1.11 (#2376)

Contributors

maleadt, wsmoses, and 5 other contributors

Assets 2

27 Apr 10:11

github-actions

v5.3.3

50137ae

v5.3.3

CUDA v5.3.3

Diff since v5.3.2

Merged pull requests:

Rework context handling (#2346) (@maleadt)
fix kernel launch logic (#2353) (@xaellison)

Closed issues:

Excessive allocations when running on multiple threads (#1429)
Fix and test multigpu support (#2218)
Bitonic sort exceeds launch resources (#2331)

Contributors

maleadt and xaellison

Assets 2

26 Apr 13:59

github-actions

v5.3.2

e2e7b57

v5.3.2

CUDA v5.3.2

Diff since v5.3.1

Merged pull requests:

Add EnzymeCore extension for parent_job (#2281) (@vchuravy)
Consider running GC when allocating and synchronizing (#2304) (@maleadt)
Refactor memory wrappers (#2335) (@maleadt)
Auto-detect external profilers. (#2339) (@maleadt)
Fix performance of indexing unified memory. (#2340) (@maleadt)
Improve exception output (#2342) (@maleadt)
Test multigpu on CI (#2348) (@maleadt)
cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt)
cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)

Closed issues:

CuArrays don't seem to display correctly in VS code (#875)
Task scheduling can result in delays when synchronizing (#1525)
Docs: add example on task-based parallelism with explicit synchronization (#1566)
Exception output from many threads is not helpful (#1780)
Autodetect external profiler (#2176)
LazyInitialized is not GC-safe (#2216)
Track CuArray stream usage (#2236)
Improve cross-device usage (#2323)
CUBLASLt wrapper for cublasLtMatmulDescSetAttribute can have device buffers as input (#2337)
Improve error message when assigning real valued arrray with complex numbers (#2341)
@device_code_sass broken (#2343)
Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
@gcsafe_ccall breaks inlining of ccall wrappers (#2347)

Contributors

vchuravy and maleadt

Assets 2

19 Apr 07:16

github-actions

v5.3.1

9c9a05f

v5.3.1

CUDA v5.3.1

Diff since v5.3.0

Merged pull requests:

[CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison)
Regenerate headers (#2324) (@maleadt)
Add some installation tips to docs/README.md (#2326) (@jlchan)
fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3)
Diagnose kernel limits on launch failure. (#2329) (@maleadt)
Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)

Closed issues:

Missing CUBLASLt wrappers (#2322)
error when switching device (#2323)
v5.3.0: regression in Zygote performance (#2333)

Contributors

maleadt, jlchan, and 2 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA v5.5.0

Contributors

CUDA v5.4.3

Contributors

CUDA v5.4.2

Contributors

CUDA v5.4.1

Contributors

CUDA v5.4.0

Contributors

CUDA v5.3.5

Contributors

CUDA v5.3.4

Contributors

CUDA v5.3.3

Contributors

CUDA v5.3.2

Contributors

CUDA v5.3.1

Contributors

Releases: JuliaGPU/CUDA.jl

v5.5.0

CUDA v5.5.0

Contributors

v5.4.3

CUDA v5.4.3

Contributors

v5.4.2

CUDA v5.4.2

Contributors

v5.4.1

CUDA v5.4.1

Contributors

v5.4.0

CUDA v5.4.0

Contributors

v5.3.5

CUDA v5.3.5

Contributors

v5.3.4

CUDA v5.3.4

Contributors

v5.3.3

CUDA v5.3.3

Contributors

v5.3.2

CUDA v5.3.2

Contributors

v5.3.1

CUDA v5.3.1

Contributors