Releases: m4rs-mt/ILGPU
Release v0.8.1.1
The new stable version offers significant performance and code quality improvements of the generated kernel programs.
- Fixed related to Trace and Debug asserts (#176).
- Fixed related to Trace and Debug asserts (#176).
- Improved compile-time performance by up to 4X (#110).
- Reduced memory footprint by up to 3X (#109, #118).
- Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
- No compiler release builds in Nuget package to improve runtime performance (#130).
- Added new IR verifier that can be enabled via
ContextFlags.EnableVerifier
(#121). - Added generation of vectorized instructions to PTX backend (#111).
- Fixed critical code-generation issue on Unix platforms (#116).
- Added dynamic shared memory support for all platforms (#97, #98).
- Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).
Release v0.8.1-beta1
The new beta version offers significant performance improvements of the generated kernel programs.
- Improved compile-time performance by up to 4X (#110).
- Reduced memory footprint by up to 3X (#109, #118).
- Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
- No compiler release builds in Nuget package to improve runtime performance (#130).
- Added new IR verifier that can be enabled via
ContextFlags.EnableVerifier
(#121). - Added generation of vectorized instructions to PTX backend (#111).
- Fixed critical code-generation issue on Unix platforms (#116).
- Added dynamic shared memory support for all platforms (#97, #98).
- Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).
Release v0.8.0
The new stable version offers significant performance and code quality improvements of the generated kernel programs.
- Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
- Added support for dynamic shared memory (CPU & Cuda backends).
- Added new KernelConfig structure to specify launch dimensions for explicitly grouped kernels.
- Added new Index1 structure to avoid name clashes with new System.Index structure.
- Added additional tuple conversion methods to Index2 and Index3 types.
- Added new EntryPointDescription structure to specify an entry point and its index type.
- Added RuntimeKernelConfig structure to combine static and dynamic information about a particular kernel launch.
- Added support for linear arrays in local memory.
- Added support for enum-value interop (#66).
- Reworked explicitly grouped kernel launchers to use the new KernelConfig structure instead of GroupedIndex types.
- Simplified static Grid and Group properties.
- Removed all GroupedIndex types.
- Updated the whole compilation pipeline to enable more aggressive optimizations.
- Significantly improved performance of emitted PTX and OpenCL code by enabling more aggressive optimizations and clever code generation (#70).
- Added Support for "unmanaged" C# structures in the scope of buffers and views.
- Reworked PTX backend to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic the Cuda compiler (#68).
- Reworked OpenCL backend to support all API changes and to fix several
critical code-generation issues (#67, #72, #73, #74, #78, #85, #88, #91, #92). - New debug information input module to support the latest PDB format updates.
- Considerably improved error messages using debug information. (#86)
- Reduced memory consumption during the compilation process.
- Performance improvements of the internal compilation pipeline.
- Improved performance of kernel launchers.
- Extended CudaAPI to supported paged-lock host-memory allocation functions.
- Extended ExchangeBuffer to use new page-locked memory allocation (if available).
- Added new IR-rewriter API to perform more advanced IR transformations.
- Adapted all existing transformations to use the new rewriter API.
- Reduced memory consumption of all nodes by compressing information.
- Redesigned several IR nodes to support global program transformations.
- Reworked implementation of
GetSubView
in the context of generic and multidimensional array views (#19). - Fixed several issues in the scope of address-space inference.
- Fixed critical code generation issues that could occur when replacing values.
Special thanks to @MoFtZ for contributing to this release.
Release v0.8.0-beta3
- Considerably improved error messages using debug information. (#86)
- Reduced memory consumption during the compilation process.
- Performance improvements of the internal compilation pipeline.
- Added Support for "unmanaged" C# structures in the scope of buffers and views.
- New debug information input module to support the latest PDB format updates.
- Fixed several
OpenCL
code generation issues (#85, #88, #91, #92)
Special thanks to @MoFtZ for contributing to this release.
Release v0.8.0-beta2
- Significantly improved performance of emitted
PTX
andOpenCL
code by enabling more aggressive optimizations and clever code generation (#70). - Improved performance of kernel launchers.
- Added support for linear arrays in local memory.
- Added support for
enum
-value interop (#66). - Reworked
PTXBackend
to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic theCuda
compiler. - Reworked
OpenCL
backend to support all API changes and to fix several critical code-generation issues (#72, #73, #74, #78). - Updated the whole compilation pipeline to enable more aggressive optimizations.
- Added new
IR-rewriter
API to perform more advanced IR transformations. - Adapted all existing transformations to use the new
rewriter API
. - Reduced memory consumption of all nodes by compressing information.
- Redesigned several IR nodes to support global program transformations.
Special thanks to @MoFtZ for contributing to this release.
Release v0.8.0-beta1
- Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
- Added support for dynamic shared memory (
CPU
&Cuda
backends). - Added new
KernelConfig
structure to specify launch dimensions for explicitly grouped kernels. - Reworked explicitly grouped kernel launchers to use the new
KernelConfig
structure instead ofGroupedIndex
types. - Simplified static
Grid
andGroup
properties. - Added new
Index1
structure to avoid name clashes with newSystem.Index
structure. - Added additional tuple conversion methods to
Index2
andIndex3
types. - Added new
EntryPointDescription
structure to specify an entry point and its index type. - Added
RuntimeKernelConfig
structure to combine static and dynamic information about a particular kernel launch. - Removed all
GroupedIndex
types. - Extended
PTXInstructions
to support bool-based IOs inPTXBackend
(#68). - Extended
ExchangeBuffer
to use new page-locked memory allocation (if available). - Extended
CudaAPI
to supported paged-lock host-memory allocation functions. - Reworked implementation of
GetSubView
in the context of generic and multidimensional array views (#19). - Fixed several issues in the scope of address-space inference.
- Fixed critical code generation issues that could occur when replacing values.
- Fixed invalid pointer types in the scope of
AtomicCAS
operations on AMD hardware (#67).
Release v0.7.1
- Added extension method to load the effective address for
Cuda
andCPU
-based array views. - Added support for data blocks (value containers) for easy the interop with value tuples.
- Added additional primitive data blocks to simplify operations on tuples consisting of primitive values.
- Added new ExchangeBuffer class to simplify memory transfers between
CPU
andGPU
memory. - Fixed invalid sub-group extension name in
CLAccelerator
. - Fixed invalid association of supported and unsupported
CL
accelerators. - Removed obsolete dispose functionality from
AcceleratorId
classes. - Fixed
OpenCL
code generator for float values that are assign integers values. - Fixed invalid creation of kernel interop types in
OpenCL
backend. - Made
ABI
thread safe to support concurrent queries of size/alignment information.
Release v0.7.0
- Added support for .Net Standard 2.1.
- Added support for
OpenCL
-compatible GPUs (beta) - Added parallel code generation in backends to improve code-generation speed.
- Added minimum
CUDA
driver version detection. - Enabled adaptive shared-memory allocation in
CPUAccelerator
. - Added new
Utility.Select
method that can be used to create highly-efficient select instructions in favor of if branches. - Added support to access Grid and Group indices via properties.
- Added support for generic Warp intrinsics that will be automatically generated by the compiler.
- Redesigned intrinsic math functions and moved
XMath
functions to theILGPU.Algorihtms
library. Use the newIntrinsicMath
class for math functions that are supported on all platforms. - Reworked intrinsic functions to allow custom implementations of intrinsics for different backends.
- Ported project to VS2019 including all static-program analysis checks.
- Applied generate code cleanup to be compliant with the new analysis checks.
- Redesigned
AcceleratorId
functionality. - Updated
CudaMemoryBuffer
to supportMemSetToZero
using alternate streams. - Fixed retrieving version number of ILGPU assembly.
- Fixed non-deterministic generation of Phi mappings.
- Fixed invalid loading of small basic types onto the evaluation stack.
- Added utility property to
Accelerator
to resolve a launch extent with the maximum number of groups. - Fixed invalid shared-memory allocation within non-kernel functions in
PTXBackend
.
Special thanks to @MoFtZ for contributing to this release.
Release v0.6.0
Greatly improved ILGPU version that included significant performance and code quality improvements.
- Added support for new GeForce RTX cards.
- Added initial support for arrays in kernels.
- Added additional 3D indexing functionality to ArrayView types.
- Added automatic binding of accelerators in advanced multi-GPU scenarios.
- Tested debugging and profiling capabilities on NVIDIA GPUs.
- Released test framework to verify generated kernel code.
- Improved performance of predicates in
PTXBackend
. - Removed strict array-length restriction from allocation nodes.
- Enhanced generation of get/set field operations.
- Optimized generation of conditional branches.
- Fixed invalid generation of predicate barriers in
PTXBackend
. - Fixed invalid register allocation of string types in
PTXBackend
. - Removed explicit tracking of predecessors in phi nodes.
- Fixed invalid debug assertion in
SequencePoint
. - Fixed invalid alignment of shared-memory allocations in
PTXBackend
. - Fixed invalid shared memory configuration of Cuda kernels.
Special thanks to @MoFtZ and @mikhail-khalizev for contributing to this release.
Release v0.5.1
Improved version of v0.5
that contains bug fixes and performance improvements and features based on community feedback.
- Polished error messages and util methods.
- Fixed invalid
DebuggerDisplay
attributes on array views. - Added support for loading addresses of static fields.
- Added support to disable kernel caches and automatic disposal of kernels and memory buffers (Community request)..
- Extended kernel loaders with additional overloads.
- Added support to clear internal caches (Community request).
- Fixed invalid extent and bounds checks in
MemoryBuffer.CopyTo
. - Fixed invalid initialization of PTX-specific intrinsic functions.
- Fixed invalid load/store instructions of bytes in
PTXBackend
. - Fixed invalid generation of
null
values inPTXBackend
.