Releases: JuliaGPU/CUDA.jl
v5.3.2
CUDA v5.3.2
Merged pull requests:
- Add EnzymeCore extension for parent_job (#2281) (@vchuravy)
- Consider running GC when allocating and synchronizing (#2304) (@maleadt)
- Refactor memory wrappers (#2335) (@maleadt)
- Auto-detect external profilers. (#2339) (@maleadt)
- Fix performance of indexing unified memory. (#2340) (@maleadt)
- Improve exception output (#2342) (@maleadt)
- Test multigpu on CI (#2348) (@maleadt)
- cuQuantum 24.3: Bump cuTensorNet. (#2350) (@maleadt)
- cuQuantum 24.3: Bump cuStateVec. (#2351) (@maleadt)
Closed issues:
- CuArrays don't seem to display correctly in VS code (#875)
- Task scheduling can result in delays when synchronizing (#1525)
- Docs: add example on task-based parallelism with explicit synchronization (#1566)
- Exception output from many threads is not helpful (#1780)
- Autodetect external profiler (#2176)
- LazyInitialized is not GC-safe (#2216)
- Track CuArray stream usage (#2236)
- Improve cross-device usage (#2323)
- CUBLASLt wrapper for
cublasLtMatmulDescSetAttribute
can have device buffers as input (#2337) - Improve error message when assigning real valued arrray with complex numbers (#2341)
@device_code_sass
broken (#2343)- Readme says Cuda 11 is supported but also the last version to support it is v4.4 (#2345)
@gcsafe_ccall
breaks inlining of ccall wrappers (#2347)
v5.3.1
CUDA v5.3.1
Merged pull requests:
- [CUSOLVER] Fix the dispatch for syevd! and heevd! (#2309) (@amontoison)
- Regenerate headers (#2324) (@maleadt)
- Add some installation tips to docs/README.md (#2326) (@jlchan)
- fix broadcast defaulting to Mem.Unified() (#2327) (@vpuri3)
- Diagnose kernel limits on launch failure. (#2329) (@maleadt)
- Work around a CUPTI bug in CUDA 12.4 Update 1. (#2330) (@maleadt)
Closed issues:
v5.3.0
CUDA v5.3.0
Merged pull requests:
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make
@device_code_sass
work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Towards supporting Julia 1.11 (#2291) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in
CUSPARSE
mm!
functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- sortperm with dims (#2308) (@xaellison)
- [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
- Avoid capturing
AbstractArray
s inBoundsError
(#2314) (@lcw) - Clarify debug level hint. (#2316) (@maleadt)
Closed issues:
- Failed to compile PTX code when using NSight on Win11 (#1601)
sortperm
fails withdims
keyword (#2061)- NVTX-related segfault on Windows under compute-sanitizer (#2204)
- Inverse Complex-to-Real FFT allocates GPU memory (#2249)
- cuDNN not available for your platform (#2252)
- Cannot reset CuArray to zero (#2257)
- Cannot take gradient of
sort
on 2D CuArray (#2259) - Multi-threaded code hanging forever with Julia 1.10 (#2261)
- CUBLAS: nrm2 support for StridedCuArray with length requiring Int64 (#2268)
- Adjoint not supported on Diagonal arrays (#2275)
- Regression in broadcast: getting Array (Julia 1.10) instead of CuArray (Julia 1.9) (#2276)
- Release v5.3? (#2283)
- Wrap CUDSS? (#2287)
- Bug concerning broadcast between device array and unified array (#2289)
StackOverflowError
trying to throwOutOfGPUMemoryError
, subsequent errors (#2292)- BUG: sortperm! seems to perform much slower than it should (#2293)
- Multiplying
CuSparseMatrixCSC
byCuMatrix
results inOut of GPU memory
(#2296) - BFloat16 support broken on Julia 1.11 (#2306)
- does not emit line info for debbuging/profiling (#2312)
- Kernel using
StaticArray
compiles in julia v1.9.4 but not in v1.10.2 (#2313) - Using copyto! with SharedArray trigger scalar indexing disallowed error (#2317)
v4.4.2
CUDA v4.4.2
Merged pull requests:
- Added support for more transform directions (#1903) (@RainerHeintzmann)
- CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
- Add some performance tips to the documentation (#1999) (@Zentrik)
- Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
- Adapt to GPUCompiler#master. (#2062) (@maleadt)
- Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
- Use released GPUCompiler. (#2064) (@maleadt)
- Fixes for Windows. (#2065) (@maleadt)
- Switch to GPUArrays buffer management. (#2068) (@maleadt)
- Update CUDA 12 to Update 2. (#2071) (@maleadt)
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Update manifest (#2076) (@github-actions[bot])
- Test improvements (#2079) (@maleadt)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- Update manifest (#2082) (@github-actions[bot])
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a with_workspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for CUDA.@Elapsed (#2118) (@thomasfaingnaert)
- [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm_61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
- Sanitizer improvements. (#2157) (@maleadt)
- [CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
- Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
- [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
- [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
- Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
- expand docs on launch parameters (#2167) (@simonbyrne)
- Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
- kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
- [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
- Added kronecker product support for dense matrices (#2177) (@albertomercurio)
- Update to CUTENSOR 2.0 (#2178) (@maleadt)
- Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
- provide more information on kernel compilation error (#2180) (@simonbyrne)
- [CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
- [CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
- [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
- Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
- [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
- Support more kwarg syntax with kernel launches (#2189) (@maleadt)
- Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
- NVML: Add support for clock queries. (#2194) (@maleadt)
- Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
- Improvements to context handling (#2200) (@maleadt)
- Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
- Rework unique context management. (#2202) (@maleadt)
- Preserve the buffer type when broadcasting. (#2203) (@maleadt)
- Fixes for Windows (#2206) (@maleadt)
- Bump Aqua. (#2207) (@maleadt)
- Updates for new CUQUANTUM (#2210) (@kshyatt)
- CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
- CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
- Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
- Default to testing with only a single device. (#2221) (@maleadt)
- Backports for v5.1 (#2224) (@maleadt)
- Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
- cuTensor fixes (#2228) (@maleadt)
- Bump versions. (#2229) (@maleadt)
- Add a note about threaded for-blocks. (#2232) (@kshyatt)
- cuTENSOR plan handling changes. (#2234) (@maleadt)
- Fix dynamic dispatch issues (#2235) (@MilesCranmer)
- CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
- Fixes for nightly (#2240) (@maleadt)
- CUBLAS: Support more strided inputs (#2242) (@maleadt)
- CuSparseArrayCSR (fixed cat ambiguitites from #1944) (#2244) (@nikopj)
- Slightly rework error handling (#2245) (@maleadt)
- cuTENSOR improvements (#2246) (@maleadt)
- Make
@device_code_sass
work with non-Julia kernels. (#2247) (@maleadt) - Improve Tegra detection. (#2251) (@maleadt)
- Added few SparseArrays functions (#2254) (@albertomercurio)
- Reduce locking in the handle cache (#2256) (@maleadt)
- Mark all CUDA ccalls as GC safe (#2262) (@vchuravy)
- cuTENSOR: Fix reference to undefined variable (#2263) (@lkdvos)
- cuTENSOR: refactor obtaining compute_type as part of plan (#2264) (@lkdvos)
- Re-generate headers. (#2265) (@maleadt)
- Update to CUDNN 9. (#2267) (@maleadt)
- [CUBLAS] Use the ILP64 API with CUDA 12 (#2270) (@amontoison)
- CompatHelper: bump compat for GPUCompiler to 0.26, (keep existing compat) (#2271) (@github-actions[bot])
- Minor improvements to nonblocking synchronization. (#2272) (@maleadt)
- Add extension package for StaticArrays (#2273) (@trahflow)
- Fix cuTensor, cuTensorNet and cuStateVec when using local Toolkit (#2274) (@bjoe2k4)
- Cached workspace prototype for custatevec (#2279) (@kshyatt)
- Update the Julia wrappers for v12.4 (#2282) (@amontoison)
- Add support for CUDA 12.4. (#2286) (@maleadt)
- Test suite changes (#2288) (@maleadt)
- Fix mixed-buffer/mixed-shape broadcasts. (#2290) (@maleadt)
- Fix typo in performance tips (#2294) (@Zentrik)
- Make it possible to customize the CuIterator adaptor. (#2297) (@maleadt)
- Set default buffer size in
CUSPARSE
mm!
functions (#2298) (@lpawela) - Avoid OOMs during OOM handling. (#2299) (@maleadt)
- [CUSOLVER] Add tests for geqrf, orgqr and ormqr (#2300) (@amontoison)
- [CUSOLVER] Interface larft! (#2301) (@amontoison)
- Fix RNG determinism when using wrapped arrays. (#2307) (@maleadt)
- [CUBLAS] Interface gemm_grouped_batched (#2310) (@amontoison)
- [CUSPARSE] Add a method convert for the type cusparseSpSMUpdate_t (#2311) (@amontoison)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- Failed to compile PTX code when using NSight on Win11 (#1601)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
- Trouble using nsight systems for profiling CUDA in Julia (#1779)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
- CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
StaticArrays.SHermitianCompact
not working in kernels in Julia 1.10.0-beta2 (#2069)- Support for LinearAlgebra.pinv (#2070)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n)
can not be used whenm
orn
is zero (#2093)- Miss...
v5.2.0
CUDA v5.2.0
Merged pull requests:
- CuSparseArrayCSR (N dim array) with batched matmatmul (bmm) (#1944) (@nikopj)
- Update to CUTENSOR 2.0 (#2178) (@maleadt)
- Updates for new CUQUANTUM (#2210) (@kshyatt)
- Take care not to spawn tasks during precompilation. (#2226) (@maleadt)
- cuTensor fixes (#2228) (@maleadt)
- Bump versions. (#2229) (@maleadt)
- Add a note about threaded for-blocks. (#2232) (@kshyatt)
- cuTENSOR plan handling changes. (#2234) (@maleadt)
- Fix dynamic dispatch issues (#2235) (@MilesCranmer)
- CUPTI: Add high-level wrappers for the callback API. (#2239) (@maleadt)
- Fixes for nightly (#2240) (@maleadt)
- CUBLAS: Support more strided inputs (#2242) (@maleadt)
Closed issues:
- Trouble using nsight systems for profiling CUDA in Julia (#1779)
- Evaluating sparse matrices in the REPL has a huge memory footprint (#2016)
- Intermittent CI failure: Segfault during nonblocking synchronization (#2141)
- First test for Julia/CUDA with 15 failures (#2158)
- Update to CUTENSOR 2.0 (#2174)
- Tests fail for CUDA#master (#2223)
- Test failures on Nvidia GH200 (#2227)
- mul! should support strided outputs (#2230)
- Please add support for older cuda versions (cuda 8 and older) (#2231)
- NSight Compute: prevent API calls during precompilation (#2233)
- Integrated profiler: detect lack of permissions (#2237)
v5.1.2
CUDA v5.1.2
Merged pull requests:
- kernel docs: fix formatting, clean up awkward sentence (#2172) (@simonbyrne)
- [CUSOLVER] Don't reuse the sparse handles (#2173) (@amontoison)
- Added kronecker product support for dense matrices (#2177) (@albertomercurio)
- Fix typos and simplify wording in performance tips docs (#2179) (@Zentrik)
- provide more information on kernel compilation error (#2180) (@simonbyrne)
- [CUSPARSE] Test CUSPARSE_SPMV_COO_ALG2 (#2182) (@amontoison)
- [CUSPARSE] Use cusparseSpMM_preprocess (#2183) (@amontoison)
- [CUSPARSE] Use cusparseSDDMM_preprocess (#2184) (@amontoison)
- Add the structures ILU0Info() and IC0Info() for the preconditioners (#2187) (@amontoison)
- [CUSOLVER] Add a structure CuSolverParameters fro the generic API (#2188) (@amontoison)
- Support more kwarg syntax with kernel launches (#2189) (@maleadt)
- Fix typo in docs/src/development/troubleshooting.md (#2193) (@jcsahnwaldt)
- NVML: Add support for clock queries. (#2194) (@maleadt)
- Fix Random.jl seeding for 1.11 (#2199) (@IanButterworth)
- Improvements to context handling (#2200) (@maleadt)
- Add a concurrent kwarg to profiling macros. (#2201) (@maleadt)
- Rework unique context management. (#2202) (@maleadt)
- Preserve the buffer type when broadcasting. (#2203) (@maleadt)
- Fixes for Windows (#2206) (@maleadt)
- Bump Aqua. (#2207) (@maleadt)
- CUSPARSE: Eagerly combine duplicate element on construction. (#2213) (@maleadt)
- CompatHelper: bump compat for BFloat16s to 0.5, (keep existing compat) (#2214) (@github-actions[bot])
- Bump the CUDA Runtime for CUDA 12.3.2. (#2217) (@maleadt)
- Default to testing with only a single device. (#2221) (@maleadt)
- Backports for v5.1 (#2224) (@maleadt)
Closed issues:
- More informative errors when parameter size is too big (#2119)
- Modifying
struct
containingCuArray
fails in threads in 5.0.0 and 5.1.0 (#2171) - Matmul of CuArray{ComplexF32} and CuArray{Float32} is slow (#2175)
- Support for combining duplicate elements in sparse matrices (#2185)
- Interactive sessions: periodically trim the memory pool (#2190)
- Broadcast does not preserve buffer type (#2191)
- CUDA doesn't precompile on Julia nightly/1.11 (#2195)
- Latest julia: UndefVarError:
make_seed
not defined inRandom
(#2198) - CUDA installation fails on Apple Silicon/Julia 1.10 (#2211)
- Most recent package versions not supported on CUDA.jl (#2212)
- Testing of CUDA fails (#2222)
--debug-info=2
makesNNlibCUDACUDNNExt
precompilation run forever (#2225)
v5.1.1
CUDA v5.1.1
Merged pull requests:
- Sanitizer improvements. (#2157) (@maleadt)
- [CUSPARSE] Update the wrapper of cusparseSpSV_updateMatrix (#2159) (@amontoison)
- Profiler improvements: (textual) time distribution, at-bprofile. (#2162) (@maleadt)
- [CUSPARSE] Update the interface for triangular solves (#2164) (@amontoison)
- [CUSPARSE] Remove code related to old CUDA toolkits (#2165) (@amontoison)
- Detect compute-exclusive mode and adjust testing. (#2166) (@maleadt)
- expand docs on launch parameters (#2167) (@simonbyrne)
- Make CUDA.set_runtime_version force the default behavior. (#2169) (@maleadt)
Closed issues:
- High CPU load during GPU syncronization (#2161)
v5.1.0
CUDA v5.1.0
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming. For more details, see the blog post.
Merged pull requests:
- [CUSOLVER] Add generic routines (#2074) (@amontoison)
- Rework and extend the cooperative groups API. (#2081) (@maleadt)
- [CUSOLVER] Add a method for geqrf! (#2085) (@amontoison)
- Fix some typos in perfomance tips (#2086) (@Zentrik)
- Improve PTX ISA selection (#2088) (@maleadt)
- Update manifest (#2090) (@github-actions[bot])
- support ChainRulesCore inplaceability (#2091) (@piever)
- Add a method inv(CuMatrix) (#2095) (@amontoison)
- Add mul!(A, B, C) where B or C is a diagonal matrix (#2096) (@amontoison)
- Add CUDA_Runtime_Discovery dependency to sublibraries. (#2097) (@maleadt)
- Handle and test zero-size inputs to RNGs. (#2098) (@maleadt)
- Add a with_workspaces function (#2099) (@amontoison)
- [CUSOLVER] Add a method for getrf! (#2100) (@amontoison)
- [CUSOLVER] Fix a typo with jobu / jobvt in gesvd (#2101) (@amontoison)
- Call exit when handling exceptions. (#2103) (@maleadt)
- Bump packages. (#2104) (@maleadt)
- Bump actions/checkout from 3 to 4 (#2106) (@dependabot[bot])
- Update manifest (#2107) (@github-actions[bot])
- Make Ref mutable on the GPU. (#2109) (@maleadt)
- CompatHelper: bump compat for CEnum to 0.5, (keep existing compat) (#2110) (@github-actions[bot])
- Small profiler improvements (#2113) (@maleadt)
- Update manifest (#2114) (@github-actions[bot])
- [CUSPARSE] Wrap new functions added with CUDA 12.2 (#2116) (@amontoison)
- [CUSOLVER] Add new methods for \ and inv (#2117) (@amontoison)
- Fix incorrect timing results for
CUDA.@elapsed
(#2118) (@thomasfaingnaert) - [CUSOLVER] Interface sparse Cholesky and QR factorizations (#2121) (@amontoison)
- Update manifest (#2123) (@github-actions[bot])
- Profiler: Show used local memory. (#2124) (@maleadt)
- Support for CUDA 12.3 (#2125) (@maleadt)
- [CUSOLVER] Add Add Xsyevdx! and Xgesvdr! (#2127) (@amontoison)
- [CUSOLVER] Add Xgesvdp (#2128) (@amontoison)
- Profiler: don't crop when rendering to a file. (#2131) (@maleadt)
- Regenerate headers for CUDA 12.3. (#2132) (@maleadt)
- [CUSPARSE] Fix a bug with triangular solves (#2134) (@amontoison)
- CompatHelper: add new compat entry for Statistics at version 1, (keep existing compat) (#2135) (@github-actions[bot])
- CompatHelper: add new compat entry for LazyArtifacts at version 1, (keep existing compat) (#2136) (@github-actions[bot])
- Profiler: Parse and visualize NVTX marker data. (#2137) (@maleadt)
- Better support for unified and host memory (#2138) (@maleadt)
- Profiler: Improve compatibility with Pluto.jl and friends. (#2139) (@maleadt)
- Avoid allocations during derived array construction. (#2142) (@maleadt)
- More performance tweaks for memory copying (#2143) (@maleadt)
- Don't use libdevice's fmin/fmax. (#2144) (@maleadt)
- Update documentation (#2146) (@maleadt)
- Fixes for sm_61 (#2151) (@maleadt)
- Update sparse factorizations (#2152) (@amontoison)
- Don't call into LLVM's fmin/fmax on <sm_80. (#2154) (@maleadt)
- Only prefect unified memory if concurrent access is possible. (#2155) (@maleadt)
- Support wrapping an Array with a CuArray without HMM. (#2156) (@maleadt)
Closed issues:
- Element-wise conversion to Duals (#127)
- IDEA: CuHostArray (#28)
- Make Ref pass by-reference (#267)
- view(data, idx) boundschecking is disproportionately expensive (#1678)
- [CUSOLVER] Add a with_workspaces function to allocate two buffers (Device / Host) (#1767)
- dlopen("libcudart") results in duplicate libraries (#1814)
- Support for JLD2 (#1833)
- Windows Defender mis-labels artifacts as threat (#1836)
- Support Cholesky factorization of CuSparseMatrixCSR (#1855)
- Runtime not re-selected after driver upgrade (#1877)
- Failure to initialize with CUDA_VISIBLE_DEVICES='' (#1945)
- Cannot precompile GPU code with PrecompileTools (#2006)
- CUDA_SDK_jll: cuda.h in different locations depending on the platform (#2066)
- PTX ISA 8.1 support (#2080)
- Segmentation fault when importing CUDA (#2083)
- "No system CUDA driver found" on NixOS (#2089)
CUDA.rand(Int64, m, n)
can not be used whenm
orn
is zero (#2093)- Missing CUDA_Runtime_Discovery as a dependency in cuDNN (#2094)
- Binaries for Jetson (#2105)
- Minimum/maximum of array of NaNs is infinity (#2111)
- Performance regression for multiple
@sync
copyto! on CUDA v5 (#2112) - [CUBLAS] Regenerate the wrappers with updated argument types (#2115)
- Unable to allocate unified memory buffers (#2120)
- CUDA 12.3 has been released (#2122)
- atomic min, max for Float32 and Float64 (#2129)
- Native profiler output is limited to around 100 columns when printing to a file (#2130)
- LLVM generates max.NaN which only works on sm_80 (#2148)
- Unified memory-related error on Tegra T194 (#2149)
- Errors on sm_61 (#2150)
v5.0.0
CUDA v5.0.0
Blog post: https://info.juliahub.com/cuda-jl-5-0-changes
This is a breaking release, but the breaking changes are minimal (see the blog post for details):
- Julia 1.8 is now required, and only CUDA 11.4+ is supported
- selection of local toolkits has changed slightly
Merged pull requests:
- Added support for more transform directions (#1903) (@RainerHeintzmann)
- Add some performance tips to the documentation (#1999) (@Zentrik)
- Re-introduce the 'blocking' kwargs to at-sync. (#2060) (@maleadt)
- Adapt to GPUCompiler#master. (#2062) (@maleadt)
- Batched SVD added (gesvdjBatched and gesvdaStridedBatched) (#2063) (@nikopj)
- Use released GPUCompiler. (#2064) (@maleadt)
- Fixes for Windows. (#2065) (@maleadt)
- Switch to GPUArrays buffer management. (#2068) (@maleadt)
- Update CUDA 12 to Update 2. (#2071) (@maleadt)
- Update manifest (#2076) (@github-actions[bot])
- Test improvements (#2079) (@maleadt)
- Update manifest (#2082) (@github-actions[bot])
Closed issues:
v4.4.1
CUDA v4.4.1
Closed issues:
- CUDA driver device support does not match toolkit (#70)
- Launching kernels should not allocate (#66)
- sync_threads() appears to not be sync'ing threads (#61)
- Exception when using CuArrays with Flux (#129)
- Kernel using MVector fails to compile or crashes at runtime due to heap allocation (#45)
- Performance regression on matrix multiplication between CUDA.jl 1.3.3 and 2.1.0/master (#538)
- Improve 'VS C++ redistributable' error message (#764)
- CUSPARSE does not support reductions (#1406)
- CUDA test failed (#1690)
- Type constructor in broadcast doesn't compile (#1761)
- accumulate(+) gives different results for CuArray compared to Array. (#1810)
- Compat driver: preload all libraries (#1859)
- Stream synchronization is slow when waiting on the event from CUDA (#1910)
- cuDNN: Store convolution algorithm choice to disk. (#1947)
- Disable 'No CUDA-capable device found' error log (#1955)
- CUDNN_STATUS_NOT_SUPPORTED using 1D CNN model (#1977)
- Memory allocations during in-place sparse matrix-vector multiplication (#1982)
CUSPARSE.sum_dim1
sums the absolute values of elements (#1983)- Update to CUDA 12.2 (#1984)
unsafe_wrap
fails on zero element CuArrays (#1985)rand
in kernel works in a deterministic way (#2008)- Scalar indexing with
CuArray * ReshapedArray{SubArray{CuArray}}}
(#2009) - volumerhs performance regression (#2010)
- CuSparseMatrix constructors allocate too much memory? (#2015)
- Native profiler using CUPTI (#2017)
- libLLVM-15jl.so (#2018)
- "symbol multiply defined" error (#2021)
- Confusion on row major vs column major (#2023)
- Printing of CuArrays gives zeros or random numbers (#2033)
sortperm!
fails when output isUInt
vector (#2046)- Re-introduce spinning loop before nonblocking synchronization (#2057)
Merged pull requests:
- Check mathType only if not Float32 (#1943) (@RomeoV)
- 1.10 enablement (#1946) (@dkarrasch)
- Implement reverse lookup (Ptr->Tuple) for CUDNN descriptors. (#1948) (@RomeoV)
- Wrapper with tests for
gemmBatchedEx!
(#1975) (@lpawela) - Add wrappers for
gemv_batched!
(#1981) (@lpawela) - Update
CUSPARSE.sum_dim<n>
to allow for arbitrary function on elements (#1987) (@lpawela) - Update manifest (#1988) (@github-actions[bot])
- Add vectorized cached loads (#1993) (@Zentrik)
- Update manifest (#1995) (@github-actions[bot])
- Fix typo in captured macro example (#1996) (@Zentrik)
- Adapt Type call broadcasting to a function (#2000) (@simonbyrne)
- [CUSPARSE] Added support for generalized dot product dot(x, A, y) = dot(x, A * y) without allocating A * y (#2001) (@albertomercurio)
- Update manifest (#2002) (@github-actions[bot])
- Support for printing types. (#2003) (@maleadt)
- Fix accumulate bug (#2005) (@chrstphrbrns)
- Update manifest (#2013) (@github-actions[bot])
- Add a raw mode to code_sass. (#2019) (@maleadt)
- Update manifest (#2022) (@github-actions[bot])
- Add a native profiler. (#2024) (@maleadt)
- Perform synchronization on a worker thread (#2025) (@maleadt)
- Remove broken video link in docs (#2028) (@christiangnrd)
- When freeing memory, use the high-level device getter. (#2029) (@maleadt)
- Add support for @cuda fastmath (#2030) (@maleadt)
- Make "CUDA.jl" a link on the doc entry page (#2031) (@carstenbauer)
- Add support for CUDA 12.2. (#2034) (@maleadt)
- rand: seed kernels from the host. (#2035) (@maleadt)
- Update wrappers for CUDA 12.2. (#2039) (@maleadt)
- On CUDA 12.2, have the memory pool enforce hard memory limits. (#2040) (@maleadt)
- Delay all initialization errors until run time. (#2041) (@maleadt)
- JLL/CI/Julia changes. (#2042) (@maleadt)
- Add support for NVTX events to the integrated profiler. (#2043) (@maleadt)
- Update cuStateVec to cuQuantum 23.6. (#2044) (@maleadt)
- Add some more fastmath functions (#2047) (@Zentrik)
- Fixup wrong key lookup. (#2048) (@RomeoV)
- Update manifest (#2049) (@github-actions[bot])
- Make sortperm! resilient to type mismatches. (#2051) (@maleadt)
- Disable tests that cause GC corruption on 1.10. (#2053) (@maleadt)
- enable dependabot for GitHub actions (#2054) (@ranocha)
- Bump actions/checkout from 2 to 3 (#2055) (@dependabot[bot])
- Bump peter-evans/create-pull-request from 3 to 5 (#2056) (@dependabot[bot])
- Rework how local toolkits are selected. (#2058) (@maleadt)
- Busy-wait before doing nonblocking synchronization. (#2059) (@maleadt)