Skip to content

Releases: m4rs-mt/ILGPU

Release v0.8.1.1

03 Jan 03:00
v0.8.1.1
38bd74e
Compare
Choose a tag to compare

The new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Fixed related to Trace and Debug asserts (#176).
  • Fixed related to Trace and Debug asserts (#176).
  • Improved compile-time performance by up to 4X (#110).
  • Reduced memory footprint by up to 3X (#109, #118).
  • Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
  • No compiler release builds in Nuget package to improve runtime performance (#130).
  • Added new IR verifier that can be enabled via ContextFlags.EnableVerifier (#121).
  • Added generation of vectorized instructions to PTX backend (#111).
  • Fixed critical code-generation issue on Unix platforms (#116).
  • Added dynamic shared memory support for all platforms (#97, #98).
  • Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).

Release v0.8.1-beta1

03 Jan 02:59
Compare
Choose a tag to compare
Release v0.8.1-beta1 Pre-release
Pre-release

The new beta version offers significant performance improvements of the generated kernel programs.

  • Improved compile-time performance by up to 4X (#110).
  • Reduced memory footprint by up to 3X (#109, #118).
  • Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
  • No compiler release builds in Nuget package to improve runtime performance (#130).
  • Added new IR verifier that can be enabled via ContextFlags.EnableVerifier (#121).
  • Added generation of vectorized instructions to PTX backend (#111).
  • Fixed critical code-generation issue on Unix platforms (#116).
  • Added dynamic shared memory support for all platforms (#97, #98).
  • Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).

Release v0.8.0

03 Jan 02:58
Compare
Choose a tag to compare

The new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
  • Added support for dynamic shared memory (CPU & Cuda backends).
  • Added new KernelConfig structure to specify launch dimensions for explicitly grouped kernels.
  • Added new Index1 structure to avoid name clashes with new System.Index structure.
  • Added additional tuple conversion methods to Index2 and Index3 types.
  • Added new EntryPointDescription structure to specify an entry point and its index type.
  • Added RuntimeKernelConfig structure to combine static and dynamic information about a particular kernel launch.
  • Added support for linear arrays in local memory.
  • Added support for enum-value interop (#66).
  • Reworked explicitly grouped kernel launchers to use the new KernelConfig structure instead of GroupedIndex types.
  • Simplified static Grid and Group properties.
  • Removed all GroupedIndex types.
  • Updated the whole compilation pipeline to enable more aggressive optimizations.
  • Significantly improved performance of emitted PTX and OpenCL code by enabling more aggressive optimizations and clever code generation (#70).
  • Added Support for "unmanaged" C# structures in the scope of buffers and views.
  • Reworked PTX backend to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic the Cuda compiler (#68).
  • Reworked OpenCL backend to support all API changes and to fix several
    critical code-generation issues (#67, #72, #73, #74, #78, #85, #88, #91, #92).
  • New debug information input module to support the latest PDB format updates.
  • Considerably improved error messages using debug information. (#86)
  • Reduced memory consumption during the compilation process.
  • Performance improvements of the internal compilation pipeline.
  • Improved performance of kernel launchers.
  • Extended CudaAPI to supported paged-lock host-memory allocation functions.
  • Extended ExchangeBuffer to use new page-locked memory allocation (if available).
  • Added new IR-rewriter API to perform more advanced IR transformations.
  • Adapted all existing transformations to use the new rewriter API.
  • Reduced memory consumption of all nodes by compressing information.
  • Redesigned several IR nodes to support global program transformations.
  • Reworked implementation of GetSubView in the context of generic and multidimensional array views (#19).
  • Fixed several issues in the scope of address-space inference.
  • Fixed critical code generation issues that could occur when replacing values.

Special thanks to @MoFtZ for contributing to this release.

Release v0.8.0-beta3

03 Jan 02:55
Compare
Choose a tag to compare
Release v0.8.0-beta3 Pre-release
Pre-release
  • Considerably improved error messages using debug information. (#86)
  • Reduced memory consumption during the compilation process.
  • Performance improvements of the internal compilation pipeline.
  • Added Support for "unmanaged" C# structures in the scope of buffers and views.
  • New debug information input module to support the latest PDB format updates.
  • Fixed several OpenCL code generation issues (#85, #88, #91, #92)

Special thanks to @MoFtZ for contributing to this release.

Release v0.8.0-beta2

03 Jan 02:54
Compare
Choose a tag to compare
Release v0.8.0-beta2 Pre-release
Pre-release
  • Significantly improved performance of emitted PTX and OpenCL code by enabling more aggressive optimizations and clever code generation (#70).
  • Improved performance of kernel launchers.
  • Added support for linear arrays in local memory.
  • Added support for enum-value interop (#66).
  • Reworked PTXBackend to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic the Cuda compiler.
  • Reworked OpenCL backend to support all API changes and to fix several critical code-generation issues (#72, #73, #74, #78).
  • Updated the whole compilation pipeline to enable more aggressive optimizations.
  • Added new IR-rewriter API to perform more advanced IR transformations.
  • Adapted all existing transformations to use the new rewriter API.
  • Reduced memory consumption of all nodes by compressing information.
  • Redesigned several IR nodes to support global program transformations.

Special thanks to @MoFtZ for contributing to this release.

Release v0.8.0-beta1

03 Jan 02:52
Compare
Choose a tag to compare
Release v0.8.0-beta1 Pre-release
Pre-release
  • Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
  • Added support for dynamic shared memory (CPU & Cuda backends).
  • Added new KernelConfig structure to specify launch dimensions for explicitly grouped kernels.
  • Reworked explicitly grouped kernel launchers to use the new KernelConfig structure instead of GroupedIndex types.
  • Simplified static Grid and Group properties.
  • Added new Index1 structure to avoid name clashes with new System.Index structure.
  • Added additional tuple conversion methods to Index2 and Index3 types.
  • Added new EntryPointDescription structure to specify an entry point and its index type.
  • Added RuntimeKernelConfig structure to combine static and dynamic information about a particular kernel launch.
  • Removed all GroupedIndex types.
  • Extended PTXInstructions to support bool-based IOs in PTXBackend (#68).
  • Extended ExchangeBuffer to use new page-locked memory allocation (if available).
  • Extended CudaAPI to supported paged-lock host-memory allocation functions.
  • Reworked implementation of GetSubView in the context of generic and multidimensional array views (#19).
  • Fixed several issues in the scope of address-space inference.
  • Fixed critical code generation issues that could occur when replacing values.
  • Fixed invalid pointer types in the scope of AtomicCAS operations on AMD hardware (#67).

Release v0.7.1

03 Jan 02:50
Compare
Choose a tag to compare
  • Added extension method to load the effective address for Cuda and CPU-based array views.
  • Added support for data blocks (value containers) for easy the interop with value tuples.
  • Added additional primitive data blocks to simplify operations on tuples consisting of primitive values.
  • Added new ExchangeBuffer class to simplify memory transfers between CPU and GPU memory.
  • Fixed invalid sub-group extension name in CLAccelerator.
  • Fixed invalid association of supported and unsupported CL accelerators.
  • Removed obsolete dispose functionality from AcceleratorId classes.
  • Fixed OpenCL code generator for float values that are assign integers values.
  • Fixed invalid creation of kernel interop types in OpenCL backend.
  • Made ABI thread safe to support concurrent queries of size/alignment information.

Release v0.7.0

03 Jan 02:47
Compare
Choose a tag to compare
  • Added support for .Net Standard 2.1.
  • Added support for OpenCL-compatible GPUs (beta)
  • Added parallel code generation in backends to improve code-generation speed.
  • Added minimum CUDA driver version detection.
  • Enabled adaptive shared-memory allocation in CPUAccelerator.
  • Added new Utility.Select method that can be used to create highly-efficient select instructions in favor of if branches.
  • Added support to access Grid and Group indices via properties.
  • Added support for generic Warp intrinsics that will be automatically generated by the compiler.
  • Redesigned intrinsic math functions and moved XMath functions to the ILGPU.Algorihtms library. Use the new IntrinsicMath class for math functions that are supported on all platforms.
  • Reworked intrinsic functions to allow custom implementations of intrinsics for different backends.
  • Ported project to VS2019 including all static-program analysis checks.
  • Applied generate code cleanup to be compliant with the new analysis checks.
  • Redesigned AcceleratorId functionality.
  • Updated CudaMemoryBuffer to support MemSetToZero using alternate streams.
  • Fixed retrieving version number of ILGPU assembly.
  • Fixed non-deterministic generation of Phi mappings.
  • Fixed invalid loading of small basic types onto the evaluation stack.
  • Added utility property to Accelerator to resolve a launch extent with the maximum number of groups.
  • Fixed invalid shared-memory allocation within non-kernel functions in PTXBackend.

Special thanks to @MoFtZ for contributing to this release.

Release v0.6.0

03 Jan 02:45
Compare
Choose a tag to compare

Greatly improved ILGPU version that included significant performance and code quality improvements.

  • Added support for new GeForce RTX cards.
  • Added initial support for arrays in kernels.
  • Added additional 3D indexing functionality to ArrayView types.
  • Added automatic binding of accelerators in advanced multi-GPU scenarios.
  • Tested debugging and profiling capabilities on NVIDIA GPUs.
  • Released test framework to verify generated kernel code.
  • Improved performance of predicates in PTXBackend.
  • Removed strict array-length restriction from allocation nodes.
  • Enhanced generation of get/set field operations.
  • Optimized generation of conditional branches.
  • Fixed invalid generation of predicate barriers in PTXBackend.
  • Fixed invalid register allocation of string types in PTXBackend.
  • Removed explicit tracking of predecessors in phi nodes.
  • Fixed invalid debug assertion in SequencePoint.
  • Fixed invalid alignment of shared-memory allocations in PTXBackend.
  • Fixed invalid shared memory configuration of Cuda kernels.

Special thanks to @MoFtZ and @mikhail-khalizev for contributing to this release.

Release v0.5.1

03 Jan 02:42
Compare
Choose a tag to compare

Improved version of v0.5 that contains bug fixes and performance improvements and features based on community feedback.

  • Polished error messages and util methods.
  • Fixed invalid DebuggerDisplay attributes on array views.
  • Added support for loading addresses of static fields.
  • Added support to disable kernel caches and automatic disposal of kernels and memory buffers (Community request)..
  • Extended kernel loaders with additional overloads.
  • Added support to clear internal caches (Community request).
  • Fixed invalid extent and bounds checks in MemoryBuffer.CopyTo.
  • Fixed invalid initialization of PTX-specific intrinsic functions.
  • Fixed invalid load/store instructions of bytes in PTXBackend.
  • Fixed invalid generation of null values in PTXBackend.