Releases · JuliaGPU/CUDA.jl

26 Apr 13:59

github-actions

v5.3.2

e2e7b57

v5.3.2

CUDA v5.3.2

Diff since v5.3.1

Merged pull requests:

Add EnzymeCore extension for parent_job (#2281) (@vchuravy)
Consider running GC when allocating and synchronizing (#2304) (@maleadt)
Refactor memory wrappers (#2335) (@maleadt)
Auto-detect external profilers. (#2339) (@maleadt)
Fix performance of indexing unified memory. (#2340) (@maleadt)
Improve exception output (#2342) (@maleadt)
Test multigpu on CI (#2348) (@maleadt)
cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt)
cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)

Closed issues:

CuArrays don't seem to display correctly in VS code (#875)
Task scheduling can result in delays when synchronizing (#1525)
Docs: add example on task-based parallelism with explicit synchronization (#1566)
Exception output from many threads is not helpful (#1780)
Autodetect external profiler (#2176)
LazyInitialized is not GC-safe (#2216)
Track CuArray stream usage (#2236)
Improve cross-device usage (#2323)
CUBLASLt wrapper for cublasLtMatmulDescSetAttribute can have device buffers as input (#2337)
Improve error message when assigning real valued arrray with complex numbers (#2341)
@device_code_sass broken (#2343)
Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
@gcsafe_ccall breaks inlining of ccall wrappers (#2347)

Contributors

vchuravy and maleadt

Assets 2

19 Apr 07:16

github-actions

v5.3.1

9c9a05f

v5.3.1

CUDA v5.3.1

Diff since v5.3.0

Merged pull requests:

[CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison)
Regenerate headers (#2324) (@maleadt)
Add some installation tips to docs/README.md (#2326) (@jlchan)
fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3)
Diagnose kernel limits on launch failure. (#2329) (@maleadt)
Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)

Closed issues:

Missing CUBLASLt wrappers (#2322)
error when switching device (#2323)
v5.3.0: regression in Zygote performance (#2333)

Contributors

maleadt, jlchan, and 2 other contributors

Assets 2

12 Apr 14:27

github-actions

v5.3.0

5da4d1d

v5.3.0

CUDA v5.3.0

Diff since v5.2.0

Merged pull requests:

CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
Slightly rework error handling (#2245) (@maleadt)
cuTENSOR improvements (#2246) (@maleadt)
Make @device_code_sass work with non-Julia kernels. (#2247) (@maleadt)
Improve Tegra detection. (#2251) (@maleadt)
Added few SparseArrays functions (#2254) (@albertomercurio)
Reduce locking in the handle cache (#2256) (@maleadt)
Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
Re-generate headers. (#2265) (@maleadt)
Update to CUDNN 9. (#2267) (@maleadt)
[CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
Add extension package for StaticArrays (#2273) (@trahflow)
Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
Cached workspace prototype for custatevec (#2279) (@kshyatt)
Update the Julia wrappers for v12.4 (#2282) (@amontoison)
Add support for CUDA 12.4. (#2286) (@maleadt)
Test suite changes (#2288) (@maleadt)
Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
Towards supporting Julia 1.11 (#2291) (@maleadt)
Fix typo in performance tips (#2294) (@Zentrik)
Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela)
Avoid OOMs during OOM handling. (#2299) (@maleadt)
[CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
[CUSOLVER] Interface larft! (#2301) (@amontoison)
Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
sortperm with dims (#2308) (@xaellison)
[CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
[CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
Avoid capturing AbstractArrays in BoundsError (#2314) (@lcw)
Clarify debug level hint. (#2316) (@maleadt)

Closed issues:

Failed to compile PTX code when using NSight on Win11 (#1601)
sortperm fails with dims keyword (#2061)
NVTX-related segfault on Windows under compute-sanitizer (#2204)
Inverse Complex-to-Real FFT allocates GPU memory (#2249)
cuDNN not available for your platform (#2252)
Cannot reset CuArray to zero (#2257)
Cannot take gradient of sort on 2D CuArray (#2259)
Multi-threaded code hanging forever with Julia 1.10 (#2261)
CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
Adjoint not supported on Diagonal arrays (#2275)
Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
Release v5.3? (#2283)
Wrap CUDSS? (#2287)
Bug concerning broadcast between device array and unified array (#2289)
StackOverflowError trying to throw OutOfGPUMemoryError, subsequent errors (#2292)
BUG: sortperm! seems to perform much slower than it should (#2293)
Multiplying CuSparseMatrixCSC by CuMatrix results in Out of GPU memory (#2296)
BFloat16 support broken on Julia 1.11 (#2306)
does not emit line info for debbuging/profiling (#2312)
Kernel using StaticArray compiles in julia v1.9.4 but not in v1.10.2 (#2313)
Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)

Contributors

lcw, vchuravy, and 11 other contributors

Assets 2

04 Apr 09:27

github-actions

v4.4.2

03c5f72

v4.4.2

CUDA v4.4.2

Diff since v4.4.1

Merged pull requests:

Added support for more transform directions (#1903) (@RainerHeintzmann)
CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
Add some performance tips to the documentation (#1999) (@Zentrik)
Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
Adapt to GPUCompiler#master. (#2062) (@maleadt)
Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
Use released GPUCompiler. (#2064) (@maleadt)
Fixes for Windows. (#2065) (@maleadt)
Switch to GPUArrays buffer management. (#2068) (@maleadt)
Update CUDA 12 to Update 2. (#2071) (@maleadt)
[CUSOLVER] Add generic routines (#2074) (@amontoison)
Update manifest (#2076) (@github-actions[bot])
Test improvements (#2079) (@maleadt)
Rework and extend the cooperative groups API. (#2081) (@maleadt)
Update manifest (#2082) (@github-actions[bot])
[CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
Fix some typos in perfomance tips (#2086) (@Zentrik)
Improve PTX ISA selection (#2088) (@maleadt)
Update manifest (#2090) (@github-actions[bot])
support ChainRulesCore inplaceability (#2091) (@piever)
Add a method inv(CuMatrix) (#2095) (@amontoison)
Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
Add a with_workspaces function (#2099) (@amontoison)
[CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
[CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
Call exit when handling exceptions. (#2103) (@maleadt)
Bump packages. (#2104) (@maleadt)
Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
Update manifest (#2107) (@github-actions[bot])
Make Ref mutable on the GPU. (#2109) (@maleadt)
CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
Small profiler improvements (#2113) (@maleadt)
Update manifest (#2114) (@github-actions[bot])
[CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
[CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
Fix incorrect timing results for CUDA.@Elapsed (#2118) (@thomasfaingnaert)
[CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
Update manifest (#2123) (@github-actions[bot])
Profiler: Show used local memory. (#2124) (@maleadt)
Support for CUDA 12.3 (#2125) (@maleadt)
[CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
[CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
[CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
Better support for unified and host memory (#2138) (@maleadt)
Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
Avoid allocations during derived array construction. (#2142) (@maleadt)
More performance tweaks for memory copying (#2143) (@maleadt)
Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
Update documentation (#2146) (@maleadt)
Fixes for sm_61 (#2151) (@maleadt)
Update sparse factorizations (#2152) (@amontoison)
Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
Sanitizer improvements. (#2157) (@maleadt)
[CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
[CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
[CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
expand docs on launch parameters (#2167) (@simonbyrne)
Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
[CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
Added kronecker product support for dense matrices (#2177) (@albertomercurio)
Update to CUTENSOR 2.0 (#2178) (@maleadt)
Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
provide more information on kernel compilation error (#2180) (@simonbyrne)
[CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
[CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
[CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
[CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
Support more kwarg syntax with kernel launches (#2189) (@maleadt)
Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
NVML: Add support for clock queries. (#2194) (@maleadt)
Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
Improvements to context handling (#2200) (@maleadt)
Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
Rework unique context management. (#2202) (@maleadt)
Preserve the buffer type when broadcasting. (#2203) (@maleadt)
Fixes for Windows (#2206) (@maleadt)
Bump Aqua. (#2207) (@maleadt)
Updates for new CUQUANTUM (#2210) (@kshyatt)
CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
Default to testing with only a single device. (#2221) (@maleadt)
Backports for v5.1 (#2224) (@maleadt)
Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
cuTensor fixes (#2228) (@maleadt)
Bump versions. (#2229) (@maleadt)
Add a note about threaded for-blocks. (#2232) (@kshyatt)
cuTENSOR plan handling changes. (#2234) (@maleadt)
Fix dynamic dispatch issues (#2235) (@MilesCranmer)
CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
Fixes for nightly (#2240) (@maleadt)
CUBLAS: Support more strided inputs (#2242) (@maleadt)
CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
Slightly rework error handling (#2245) (@maleadt)
cuTENSOR improvements (#2246) (@maleadt)
Make @device_code_sass work with non-Julia kernels. (#2247) (@maleadt)
Improve Tegra detection. (#2251) (@maleadt)
Added few SparseArrays functions (#2254) (@albertomercurio)
Reduce locking in the handle cache (#2256) (@maleadt)
Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
Re-generate headers. (#2265) (@maleadt)
Update to CUDNN 9. (#2267) (@maleadt)
[CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
Add extension package for StaticArrays (#2273) (@trahflow)
Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
Cached workspace prototype for custatevec (#2279) (@kshyatt)
Update the Julia wrappers for v12.4 (#2282) (@amontoison)
Add support for CUDA 12.4. (#2286) (@maleadt)
Test suite changes (#2288) (@maleadt)
Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
Fix typo in performance tips (#2294) (@Zentrik)
Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
Set default buffer size in CUSPARSE mm! functions (#2298) (@lpawela)
Avoid OOMs during OOM handling. (#2299) (@maleadt)
[CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
[CUSOLVER] Interface larft! (#2301) (@amontoison)
Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
[CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
[CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)

Closed issues:

Element-wise conversion to Duals (#127)
IDEA: CuHostArray (#28)
Make Ref pass by-reference (#267)
Failed to compile PTX code when using NSight on Win11 (#1601)
view(data, idx) boundschecking is disproportionately expensive (#1678)
[CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
Trouble using nsight systems for profiling CUDA in Julia (#1779)
dlopen("libcudart") results in duplicate libraries (#1814)
Support for JLD2 (#1833)
Windows Defender mis-labels artifacts as threat (#1836)
Support Cholesky factorization of CuSparseMatrixCSR (#1855)
Runtime not re-selected after driver upgrade (#1877)
Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
Cannot precompile GPU code with PrecompileTools (#2006)
Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
Support for LinearAlgebra.pinv (#2070)
PTX ISA 8.1 support (#2080)
Segmentation fault when importing CUDA (#2083)
"No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
Miss...

Contributors

sync, vchuravy, and 19 other contributors

Assets 2

18 Jan 10:44

github-actions

v5.2.0

5876e9d

v5.2.0

CUDA v5.2.0

Diff since v5.1.2

Merged pull requests:

CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
Update to CUTENSOR 2.0 (#2178) (@maleadt)
Updates for new CUQUANTUM (#2210) (@kshyatt)
Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
cuTensor fixes (#2228) (@maleadt)
Bump versions. (#2229) (@maleadt)
Add a note about threaded for-blocks. (#2232) (@kshyatt)
cuTENSOR plan handling changes. (#2234) (@maleadt)
Fix dynamic dispatch issues (#2235) (@MilesCranmer)
CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
Fixes for nightly (#2240) (@maleadt)
CUBLAS: Support more strided inputs (#2242) (@maleadt)

Closed issues:

Trouble using nsight systems for profiling CUDA in Julia (#1779)
Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
First test for Julia/CUDA with 15 failures (#2158)
Update to CUTENSOR 2.0 (#2174)
Tests fail for CUDA#master (#2223)
Test failures on Nvidia GH200 (#2227)
mul! should support strided outputs (#2230)
Please add support for older cuda versions (cuda 8 and older) (#2231)
NSight Compute: prevent API calls during precompilation (#2233)
Integrated profiler: detect lack of permissions (#2237)

Contributors

maleadt, kshyatt, and 2 other contributors

Assets 2

07 Jan 10:34

github-actions

v5.1.2

fc99b1d

v5.1.2

CUDA v5.1.2

Diff since v5.1.1

Merged pull requests:

kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
[CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
Added kronecker product support for dense matrices (#2177) (@albertomercurio)
Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
provide more information on kernel compilation error (#2180) (@simonbyrne)
[CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
[CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
[CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
[CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
Support more kwarg syntax with kernel launches (#2189) (@maleadt)
Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
NVML: Add support for clock queries. (#2194) (@maleadt)
Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
Improvements to context handling (#2200) (@maleadt)
Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
Rework unique context management. (#2202) (@maleadt)
Preserve the buffer type when broadcasting. (#2203) (@maleadt)
Fixes for Windows (#2206) (@maleadt)
Bump Aqua. (#2207) (@maleadt)
CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
Default to testing with only a single device. (#2221) (@maleadt)
Backports for v5.1 (#2224) (@maleadt)

Closed issues:

More informative errors when parameter size is too big (#2119)
Modifying struct containing CuArray fails in threads in 5.0.0 and 5.1.0 (#2171)
Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
Support for combining duplicate elements in sparse matrices (#2185)
Interactive sessions: periodically trim the memory pool (#2190)
Broadcast does not preserve buffer type (#2191)
CUDA doesn't precompile on Julia nightly/1.11 (#2195)
Latest julia: UndefVarError: make_seed not defined in Random (#2198)
CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
Most recent package versions not supported on CUDA.jl (#2212)
Testing of CUDA fails (#2222)
--debug-info=2 makes NNlibCUDACUDNNExt precompilation run forever (#2225)

Contributors

maleadt, jcsahnwaldt, and 5 other contributors

Assets 2

20 Nov 11:38

github-actions

v5.1.1

ffcd7e3

v5.1.1

CUDA v5.1.1

Diff since v5.1.0

Merged pull requests:

Sanitizer improvements. (#2157) (@maleadt)
[CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
[CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
[CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
expand docs on launch parameters (#2167) (@simonbyrne)
Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)

Closed issues:

High CPU load during GPU syncronization (#2161)

Contributors

maleadt, simonbyrne, and amontoison

Assets 2

07 Nov 15:10

github-actions

v5.1.0

6daddc2

v5.1.0

CUDA v5.1.0

CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.

Diff since v5.0.0

Merged pull requests:

[CUSOLVER] Add generic routines (#2074) (@amontoison)
Rework and extend the cooperative groups API. (#2081) (@maleadt)
[CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
Fix some typos in perfomance tips (#2086) (@Zentrik)
Improve PTX ISA selection (#2088) (@maleadt)
Update manifest (#2090) (@github-actions[bot])
support ChainRulesCore inplaceability (#2091) (@piever)
Add a method inv(CuMatrix) (#2095) (@amontoison)
Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
Add a with_workspaces function (#2099) (@amontoison)
[CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
[CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
Call exit when handling exceptions. (#2103) (@maleadt)
Bump packages. (#2104) (@maleadt)
Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
Update manifest (#2107) (@github-actions[bot])
Make Ref mutable on the GPU. (#2109) (@maleadt)
CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
Small profiler improvements (#2113) (@maleadt)
Update manifest (#2114) (@github-actions[bot])
[CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
[CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
Fix incorrect timing results for CUDA.@elapsed (#2118) (@thomasfaingnaert)
[CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
Update manifest (#2123) (@github-actions[bot])
Profiler: Show used local memory. (#2124) (@maleadt)
Support for CUDA 12.3 (#2125) (@maleadt)
[CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
[CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
[CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
Better support for unified and host memory (#2138) (@maleadt)
Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
Avoid allocations during derived array construction. (#2142) (@maleadt)
More performance tweaks for memory copying (#2143) (@maleadt)
Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
Update documentation (#2146) (@maleadt)
Fixes for sm_61 (#2151) (@maleadt)
Update sparse factorizations (#2152) (@amontoison)
Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)

Closed issues:

Element-wise conversion to Duals (#127)
IDEA: CuHostArray (#28)
Make Ref pass by-reference (#267)
view(data, idx) boundschecking is disproportionately expensive (#1678)
[CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
dlopen("libcudart") results in duplicate libraries (#1814)
Support for JLD2 (#1833)
Windows Defender mis-labels artifacts as threat (#1836)
Support Cholesky factorization of CuSparseMatrixCSR (#1855)
Runtime not re-selected after driver upgrade (#1877)
Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
Cannot precompile GPU code with PrecompileTools (#2006)
CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
PTX ISA 8.1 support (#2080)
Segmentation fault when importing CUDA (#2083)
"No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n) can not be used when m or n is zero (#2093)
Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
Binaries for Jetson (#2105)
Minimum/maximum of array of NaNs is infinity (#2111)
Performance regression for multiple @sync copyto! on CUDA v5 (#2112)
[CUBLAS] Regenerate the wrappers with updated argument types (#2115)
Unable to allocate unified memory buffers (#2120)
CUDA 12.3 has been released (#2122)
atomic min, max for Float32 and Float64 (#2129)
Native profiler output is limited to around 100 columns when printing to a file (#2130)
LLVM generates max.NaN which only works on sm_80 (#2148)
Unified memory-related error on Tegra T194 (#2149)
Errors on sm_61 (#2150)

Contributors

maleadt, piever, and 4 other contributors

Assets 2

19 Sep 08:39

github-actions

v5.0.0

2fa6572

v5.0.0

CUDA v5.0.0

Blog post: https://info.juliahub.com/cuda-jl-5-0-changes

This is a breaking release, but the breaking changes are minimal (see the blog post for details):

Julia 1.8 is now required, and only CUDA 11.4+ is supported
selection of local toolkits has changed slightly

Diff since v4.4.1

Merged pull requests:

Added support for more transform directions (#1903) (@RainerHeintzmann)
Add some performance tips to the documentation (#1999) (@Zentrik)
Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
Adapt to GPUCompiler#master. (#2062) (@maleadt)
Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
Use released GPUCompiler. (#2064) (@maleadt)
Fixes for Windows. (#2065) (@maleadt)
Switch to GPUArrays buffer management. (#2068) (@maleadt)
Update CUDA 12 to Update 2. (#2071) (@maleadt)
Update manifest (#2076) (@github-actions[bot])
Test improvements (#2079) (@maleadt)
Update manifest (#2082) (@github-actions[bot])

Closed issues:

StaticArrays.SHermitianCompact not working in kernels in Julia 1.10.0-beta2 (#2069)
Support for LinearAlgebra.pinv (#2070)

Contributors

maleadt, RainerHeintzmann, and 2 other contributors

Assets 2

25 Aug 20:24

github-actions

v4.4.1

9888ac9

v4.4.1

CUDA v4.4.1

Diff since v4.4.0

Closed issues:

CUDA driver device support does not match toolkit (#70)
Launching kernels should not allocate (#66)
sync_threads() appears to not be sync'ing threads (#61)
Exception when using CuArrays with Flux (#129)
Kernel using MVector fails to compile or crashes at runtime due to heap allocation (#45)
Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master (#538)
Improve 'VS C++ redistributable' error message (#764)
CUSPARSE does not support reductions (#1406)
CUDA test failed (#1690)
Type constructor in broadcast doesn't compile (#1761)
accumulate(+) gives different results for CuArray compared to Array. (#1810)
Compat driver: preload all libraries (#1859)
Stream synchronization is slow when waiting on the event from CUDA (#1910)
cuDNN: Store convolution algorithm choice to disk. (#1947)
Disable 'No CUDA-capable device found' error log (#1955)
CUDNN_STATUS_NOT_SUPPORTED using 1D CNN model (#1977)
Memory allocations during in-place sparse matrix-vector multiplication (#1982)
CUSPARSE.sum_dim1 sums the absolute values of elements (#1983)
Update to CUDA 12.2 (#1984)
unsafe_wrap fails on zero element CuArrays (#1985)
rand in kernel works in a deterministic way (#2008)
Scalar indexing with CuArray * ReshapedArray{SubArray{CuArray}}} (#2009)
volumerhs performance regression (#2010)
CuSparseMatrix constructors allocate too much memory? (#2015)
Native profiler using CUPTI (#2017)
libLLVM-15jl.so (#2018)
"symbol multiply defined" error (#2021)
Confusion on row major vs column major (#2023)
Printing of CuArrays gives zeros or random numbers (#2033)
sortperm! fails when output is UInt vector (#2046)
Re-introduce spinning loop before nonblocking synchronization (#2057)

Merged pull requests:

Check mathType only if not Float32 (#1943) (@RomeoV)
1.10 enablement (#1946) (@dkarrasch)
Implement reverse lookup (Ptr->Tuple) for CUDNN descriptors. (#1948) (@RomeoV)
Wrapper with tests for gemmBatchedEx! (#1975) (@lpawela)
Add wrappers for gemv_batched! (#1981) (@lpawela)
Update CUSPARSE.sum_dim<n> to allow for arbitrary function on elements (#1987) (@lpawela)
Update manifest (#1988) (@github-actions[bot])
Add vectorized cached loads (#1993) (@Zentrik)
Update manifest (#1995) (@github-actions[bot])
Fix typo in captured macro example (#1996) (@Zentrik)
Adapt Type call broadcasting to a function (#2000) (@simonbyrne)
[CUSPARSE] Added support for generalized dot product dot(x, A, y) = dot(x, A * y) without allocating A * y (#2001) (@albertomercurio)
Update manifest (#2002) (@github-actions[bot])
Support for printing types. (#2003) (@maleadt)
Fix accumulate bug (#2005) (@chrstphrbrns)
Update manifest (#2013) (@github-actions[bot])
Add a raw mode to code_sass. (#2019) (@maleadt)
Update manifest (#2022) (@github-actions[bot])
Add a native profiler. (#2024) (@maleadt)
Perform synchronization on a worker thread (#2025) (@maleadt)
Remove broken video link in docs (#2028) (@christiangnrd)
When freeing memory, use the high-level device getter. (#2029) (@maleadt)
Add support for @cuda fastmath (#2030) (@maleadt)
Make "CUDA.jl" a link on the doc entry page (#2031) (@carstenbauer)
Add support for CUDA 12.2. (#2034) (@maleadt)
rand: seed kernels from the host. (#2035) (@maleadt)
Update wrappers for CUDA 12.2. (#2039) (@maleadt)
On CUDA 12.2, have the memory pool enforce hard memory limits. (#2040) (@maleadt)
Delay all initialization errors until run time. (#2041) (@maleadt)
JLL/CI/Julia changes. (#2042) (@maleadt)
Add support for NVTX events to the integrated profiler. (#2043) (@maleadt)
Update cuStateVec to cuQuantum 23.6. (#2044) (@maleadt)
Add some more fastmath functions (#2047) (@Zentrik)
Fixup wrong key lookup. (#2048) (@RomeoV)
Update manifest (#2049) (@github-actions[bot])
Make sortperm! resilient to type mismatches. (#2051) (@maleadt)
Disable tests that cause GC corruption on 1.10. (#2053) (@maleadt)
enable dependabot for GitHub actions (#2054) (@ranocha)
Bump actions/checkout from 2 to 3 (#2055) (@dependabot[bot])
Bump peter-evans/create-pull-request from 3 to 5 (#2056) (@dependabot[bot])
Rework how local toolkits are selected. (#2058) (@maleadt)
Busy-wait before doing nonblocking synchronization. (#2059) (@maleadt)

Contributors

cuda, carstenbauer, and 11 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA v5.3.2

Contributors

CUDA v5.3.1

Contributors

CUDA v5.3.0

Contributors

CUDA v4.4.2

Contributors

CUDA v5.2.0

Contributors

CUDA v5.1.2

Contributors

CUDA v5.1.1

Contributors

CUDA v5.1.0

Contributors

CUDA v5.0.0

Contributors

CUDA v4.4.1

Contributors

Releases: JuliaGPU/CUDA.jl

v5.3.2

CUDA v5.3.2

Contributors

v5.3.1

CUDA v5.3.1

Contributors

v5.3.0

CUDA v5.3.0

Contributors

v4.4.2

CUDA v4.4.2

Contributors

v5.2.0

CUDA v5.2.0

Contributors

v5.1.2

CUDA v5.1.2

Contributors

v5.1.1

CUDA v5.1.1

Contributors

v5.1.0

CUDA v5.1.0

Contributors

v5.0.0

CUDA v5.0.0

Contributors

v4.4.1

CUDA v4.4.1

Contributors