Change log

Generated on 2024-02-17

Release 24.02

Features


#9926	[FEA] Add config option for the parquet reader input read limit.
#10270	[FEA] Add support for single quotes when reading JSON
#10253	[FEA] Enable mixed types as string in GpuJsonToStruct
#9692	[FEA] Remove Pascal support
#8806	[FEA] Support lazy quantifier and specified group index in regexp_extract function
#10079	[FEA] Add string parameter support for `unix_timestamp` for non-UTC time zones
#9667	[FEA][JSON] Add support for non default `dateFormat` in `from_json`
#9173	[FEA] Support format_number
#10145	[FEA] Support to_utc_timestamp
#9927	[FEA] Support to_date with non-UTC timezones without DST
#10006	[FEA] Support `ParseToTimestamp` for non-UTC time zones
#9096	[FEA] Add Spark 3.3.4 support
#9585	[FEA] support ascii function
#9260	[FEA] Create Spark 3.4.2 shim and build env
#10076	[FEA] Add performance test framework for non-UTC time zone features.
#9881	[TASK] Remove `spark.rapids.sql.nonUTC.enabled` configuration option
#9801	[FEA] Support DateFormat on GPU with a non-UTC timezone
#6834	[FEA] Support GpuHour expression for timezones other than UTC
#6842	[FEA] Support TimeZone aware operations for value extraction
#1860	[FEA] Optimize row based window operations for BOUNDED ranges
#9606	[FEA] Support unix_timestamp with CST(China Time Zone) support
#9815	[FEA] Support `unix_timestamp` for non-DST timezones
#8807	[FEA] support ‘yyyyMMdd’ format in from_unixtime function
#9605	[FEA] Support from_unixtime with CST(China Time Zone) support
#6836	[FEA] Support FromUnixTime for non UTC timezones
#9175	[FEA] Support Databricks 13.3
#6881	[FEA] Support RAPIDS Spark plugin on ARM
#9274	[FEA] Regular deploy process to include arm artifacts
#9844	[FEA] Let Gpu arrow python runners support writing one batch one time for the single threaded model.
#7309	[FEA] Detect multiple versions of the RAPIDS jar on the classpath at the same time

Performance


#9442	[FEA] For hash joins where the build side can change use the smaller table for the build side
#10142	[TASK] Benchmark existing timestamp functions that work in non-UTC time zone (non-DST)

Bugs Fixed


#9974	[BUG] host memory Leak in MultiFileCoalescingPartitionReaderBase in UTC time zone
#10359	[BUG] Build failure on Databricks nightly run with `GpuMapInPandasExecMeta`
#10327	[BUG] Unit test FAILED against : SPARK-24957: average with decimal followed by aggregation returning wrong result
#10324	[BUG] hash_aggregate_test.py test FAILED: Type conversion is not allowed from Table {...}
#10291	[BUG] SIGSEGV in libucp.so
#9212	[BUG] `from_json` fails with cuDF error `Invalid list size computation error`
#10264	[BUG] hash aggregate test failures due to type conversion errors
#10262	[BUG] Test "SPARK-24957: average with decimal followed by aggregation returning wrong result" failed.
#9353	[BUG] [JSON] A mix of lists and structs within the same column is not supported
#10099	[BUG] orc_test.py::test_orc_scan_with_aggregate_pushdown fails with a standalone cluster on spark 3.3.0
#10047	[BUG] CudfException during conditional hash join while running nds query64
#9779	[BUG] 330cdh failed test_hash_reduction_sum_full_decimal on CI
#10197	[BUG] Disable GetJsonObject by default and update docs
#10165	[BUG] Databricks 13.3 executor side broadcast failure
#10224	[BUG] DBR builds fails when installing Maven
#10222	[BUG] to_utc_timestamp and from_utc_timestamp fallback when TZ is supported time zone
#10195	[BUG] test_window_aggs_for_negative_rows_partitioned failure in CI
#10182	[BUG] test_dpp_bypass / test_dpp_via_aggregate_subquery failures in CI (databricks)
#10169	[BUG] Host column vector leaks when running `test_cast_timestamp_to_date`
#10050	[BUG] test_cast_decimal_to_decimal[to:DecimalType(1,-1)-from:Decimal(5,-3)] fails with DATAGEN_SEED=1702439569
#10088	[BUG] GpuExplode single row split to fit cuDF limits
#10174	[BUG] json_test.py::test_from_json_struct_timestamp failed on: Part of the plan is not columnar
#10186	[BUG] test_to_date_with_window_functions failed in non-UTC nightly CI
#10154	[BUG] 'spark-test.sh' integration tests FAILED on 'ps: command not found" in Rocky Docker environment
#10175	[BUG] string_test.py::test_format_number_float_special FAILED : AssertionError 'NaN' ==
#10166	Detect Undeclared Shim in POM.xml
#10170	[BUG] `test_cast_timestamp_to_date` fails with `TZ=Asia/Hebron`
#10149	[BUG] GPU illegal access detected during delta_byte_array.parquet read
#9905	[BUG] GpuJsonScan incorrect behavior when parsing dates
#10163	Spark 3.3.4 Shim Build Failure
#10105	[BUG] scala:compile is not thread safe unless compiler bridge already exists
#10026	[BUG] test_hash_agg_with_nan_keys failed with a DATAGEN_SEED=1702335559
#10075	[BUG] `non-pinned blocking alloc with spill` unit test failed in HostAllocSuite
#10134	[BUG] test_window_aggs_for_batched_finite_row_windows_partitioned failed on Scala 2.13 with DATAGEN_SEED=1704033145
#10118	[BUG] non-UTC Nightly CI failed
#10136	[BUG] The canonicalized version of `GpuFileSourceScanExec`s that suppose to be semantic-equal can be different
#10110	[BUG] disable collect_list and collect_set for window operations by default.
#10129	[BUG] Unit test suite fails with `Null data pointer` in GpuTimeZoneDB
#10089	[BUG] DATAGEN_SEED= environment does not override the marker datagen_overrides
#10108	[BUG] @datagen_overrides seed is sticky when it shouldn't be
#10064	[BUG] test_unsupported_fallback_regexp_replace failed with DATAGEN_SEED=1702662063
#10117	[BUG] test_from_utc_timestamp failed on Cloudera Env when TZ is Iran
#9914	[BUG] Report GPU OOM on recent passed CI premerges.
#10094	[BUG] spark351 PR check failure MockTaskContext method isFailed in class TaskContext of type ()Boolean is not defined
#10017	[BUG] test_casting_from_double_to_timestamp failed for DATAGEN_SEED=1702329497
#9992	[BUG] conditionals_test.py::test_conditional_with_side_effects_cast[String] failed with DATAGEN_SEED=1701976979
#9743	[BUG][AUDIT] SPARK-45652 - SPJ: Handle empty input partitions after dynamic filtering
#9859	[AUDIT] [SPARK-45786] Inaccurate Decimal multiplication and division results
#9555	[BUG] Scala 2.13 build with JDK 11 or 17 fails OpcodeSuite tests
#10073	[BUG] test_csv_prefer_date_with_infer_schema failed with DATAGEN_SEED=1702847907
#10004	[BUG] If a host memory buffer is spilled, it cannot be unspilled
#10063	[BUG] CI build failure with 341db: method getKillReason has weaker access privileges; it should be public
#10055	[BUG] array_test.py::test_array_transform_non_deterministic failed with non-UTC time zone
#10056	[BUG] Unit tests ToPrettyStringSuite FAILED on spark-3.5.0
#10048	[BUG] Fix `out of range` error from `pySpark` in `test_timestamp_millis` and other two integration test cases
#4204	casting double to string does not match Spark
#9938	Better to do some refactor for the Python UDF code
#10018	[BUG] `GpuToUnixTimestampImproved` off by 1 on GPU when handling timestamp before epoch
#10012	[BUG] test_str_to_map_expr_random_delimiters with DATAGEN_SEED=1702166057 hangs
#10029	[BUG] doc links fail with 404 for shims.md
#9472	[BUG] Non-Deterministic expressions in an array_transform can cause errors
#9884	[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER...
#9977	[BUG] test_cast_date_integral fails on databricks 3.4.1
#9936	[BUG] Nightly CI of non-UTC time zone reports 'year 0 is out of range' error
#9941	[BUG] A potential data corruption in Pandas UDFs
#9897	[BUG] Error message for multiple jars on classpath is wrong
#9916	[BUG] `test_cast_string_ts_valid_format` failed at `seed = 1701362564`
#9559	[BUG] precommit regularly fails with error trying to download a dependency
#9708	[BUG] test_cast_string_ts_valid_format fails with DATAGEN_SEED=1699978422

PRs


#10439	Reverts NVIDIA#10232 and fixes the plugin build on Databricks 11.3
#10380	Init changelog 24.02 [skip ci]
#10367	Update rapids JNI and private version to release 24.02.0
#10414	[DOC] Fix 24.02.0 documentation errors [skip ci]
#10403	Cherry-pick: Fix a memory leak in json tuple (#10360)
#10387	[DOC] Update docs for 24.02.0 release [skip ci]
#10399	Update NOTICE-binary
#10389	Change version and branch to 24.02 in docs [skip ci]
#10309	[DOC] add custom 404 page and fix some document issue [skip ci]
#10352	xfail mixed type test
#10355	Revert "Support barrier mode for mapInPandas/mapInArrow (#10343)"
#10353	Use fixed seed for test_from_json_struct_decimal
#10343	Support barrier mode for mapInPandas/mapInArrow
#10345	Fix auto merge conflict 10339 [skip ci]
#9991	Start to use explicit memory limits in the parquet chunked reader
#10328	Fix typo in spark-tests.sh [skip ci]
#10279	Run '--packages' only with default cuda11 jar
#10273	Support reading JSON data with single quotes around attribute names and values
#10306	Fix performance regression in from_json
#10272	Add FullOuter support to GpuShuffledSymmetricHashJoinExec
#10260	Add perf test for time zone operators
#10275	Add tests for window Python udf with array input
#10278	Clean up $M2_CACHE to avoid side-effect of previous dependency:get [skip ci]
#10268	Add config to enable mixed types as string in GpuJsonToStruct & GpuJsonScan
#10297	Revert "UCX 1.16.0 upgrade (#10190)"
#10289	Add gerashegalov to CODEOWNERS [skip ci]
#10290	Fix merge conflict with 23.12 [skip ci]
#10190	UCX 1.16.0 upgrade
#10211	Use parse_url kernel for QUERY literal and column key
#10267	Update to libcudf unsigned sum aggregation types change
#10208	Added Support for Lazy Quantifier
#9993	Enable mixed types as string in GpuJsonScan
#10246	Refactor full join iterator to allow access to build tracker
#10257	Enable auto-merge from branch-24.02 to branch-24.04 [skip CI]
#10178	Mark hash reduction decimal overflow test as a permanent seed override
#10244	Use POSIX mode in assembly plugin to avoid issues with large UID/GID
#10238	Smoke test with '--package' to fetch the plugin jar
#10201	Deploy release candidates to local maven repo for dependency check[skip ci]
#10240	Improved inner joins with large build side
#10220	Disable GetJsonObject by default and add tests for as many issues with it as possible
#10230	Fix Databricks 13.3 BroadcastHashJoin using executor side broadcast fed by ColumnarToRow [Databricks]
#10232	Fixed 330db Shims to Adopt the PythonRunner Changes
#10225	Download Maven from apache.org archives [skip ci]
#10210	Add string parameter support for unix_timestamp for non-UTC time zones
#10223	Fix to_utc_timestamp and from_utc_timestamp fallback when TZ is supported time zone
#10205	Deterministic ordering in window tests
#10204	Further prevent degenerative joins in dpp_test
#10156	Update string to float compatibility doc[skip ci]
#10193	Fix explode with carry-along columns on GpuExplode single row retry handling
#10191	Updating the config documentation for filecache configs [skip ci]
#10131	With a single row GpuExplode tries to split the generator array
#10179	Fix build regression against Spark 3.2.x
#10189	test needs marks for non-UTC and for non_supported timezones
#10176	Fix format_number NaN symbol in high jdk version
#10074	Update the legacy mode check: only take effect when reading date/timestamp column
#10167	Defined Shims Should Be Declared In POM
#10168	Prevent a degenerative join in test_dpp_reuse_broadcast_exchange
#10171	Fix `test_cast_timestamp_to_date` when running in a DST time zone
#9975	Improve dateFormat support in GpuJsonScan and make tests consistent with GpuStructsToJson
#9790	Support float case of format_number with format_float kernel
#10144	Support to_utc_timestamp
#10162	Fix Spark 334 Build
#10146	Refactor the window code so it is not mostly kept in a few very large files
#10155	Install procps tools for rocky docker images [skip ci]
#10153	Disable multi-threaded Maven
#10100	Enable to_date (via gettimestamp and casting timestamp to date) for non-UTC time zones
#10140	Removed Unnecessary Whitespaces From Spark 3.3.4 Shim [skip ci]
#10148	fix test_hash_agg_with_nan_keys floating point sum failure
#10150	Increase timeouts in HostAllocSuite to avoid timeout failures on slow machines
#10143	Fix `test_window_aggs_for_batched_finite_row_windows_partitioned` fail
#9887	Reduce time-consuming of pre-merge
#10130	Change unit tests that force ooms to specify the oom type (gpu
#10138	Update copyright dates in NOTICE files [skip ci]
#10139	Add Delta Lake 2.3.0 to list of versions to test for Spark 3.3.x
#10135	Fix CI: can't find script when there is pushd in script [skip ci]
#10137	Fix the canonicalizing for GPU file scan
#10132	Disable collect_list and collect_set for window by default
#10084	Refactor GpuJsonToStruct to reduce code duplication and manage resources more efficiently
#10087	Additional unit tests for GeneratedInternalRowToCudfRowIterator
#10082	Add Spark 3.3.4 Shim
#10054	Support Ascii function for ascii and latin-1
#10127	Fix merge conflict with branch-23.12
#10097	[DOC] Update docs for 23.12.1 release [skip ci]
#10109	Fixes a bug where datagen seed overrides were sticky and adds datagen_seed_override_disabled
#10093	Fix test_unsupported_fallback_regexp_replace
#10119	Fix from_utc_timestamp case failure on Cloudera when TZ is Iran
#10106	Add `isFailed()` to MockTaskContext and Remove MockTaskContextBase.scala
#10112	Remove datagen seed override for test_conditional_with_side_effects_cast
#10104	[DOC] Add in docs about memory debugging [skip ci]
#9925	Use threads, cache Scala compiler in GH mvn workflow
#9967	Added Spark-3.4.2 Shims
#10061	Use parse_url kernel for QUERY parsing
#10101	[DOC] Add column order error docs [skip ci]
#10078	Add perf test for non-UTC operators
#10096	Shim MockTaskContext to fix Spark 3.5.1 build
#10092	Implement Math.round using floor on GPU
#10085	Update tests that originally restricted the Spark timestamp range
#10090	Replace GPU-unsupported `\z` with an alternative RLIKE expression
#10095	Temporarily fix date format failed cases for non-UTC time zone.
#9999	Add some odd time zones for timezone transition tests
#9962	Add 3.5.1-SNAPSHOT Shim
#10071	Cleanup usage of non-utc configuration here
#10057	Add support for StringConcatFactory.makeConcatWithConstants (#9555)
#9996	Test full timestamp output range in PySpark
#10081	Add a fallback Cloudera Maven repo URL [skip ci]
#10065	Improve host memory spill interfaces
#10070	Fix 332db build failure
#10060	Fix failed cases for non-utc time zone
#10038	Remove spark.rapids.sql.nonUTC.enabled configuration option
#10059	Fixed Failing ToPrettyStringSuite Test for 3.5.0
#10013	Extended configuration of OOM injection mode
#10052	Set seed=0 for some integration test cases
#10053	Remove invalid user from CODEOWNER file [skip ci]
#10049	Fix out of range error from pySpark in test_timestamp_millis and other two integration test cases
#9721	Support date_format via Gpu for non-UTC time zone
#9845	Use parse_url kernel for HOST parsing
#10024	Support hour minute second for non-UTC time zone
#9973	Batching support for row-based bounded window functions
#10042	Update tests to not have hard coded fallback when not needed
#9816	Support unix_timestamp and to_unix_timestamp with non-UTC timezones (non-DST)
#9902	Some refactor for the Python UDF code
#10023	GPU supports `yyyyMMdd` format by post process for the `from_unixtime` function
#10033	Remove GpuToTimestampImproved and spark.rapids.sql.improvedTimeOps.enabled
#10016	Fix infinite loop in test_str_to_map_expr_random_delimiters
#10030	Update links in shims.md
#10015	Fix array_transform to not recompute the argument
#10011	Add cpu oom retry split handling to InternalRowToColumnarBatchIterator
#10019	Fix auto merge conflict 10010 [skip ci]
#9760	Support split broadcast join condition into ast and non-ast
#9827	Enable ORC timestamp and decimal predicate push down tests
#10002	Use Spark 3.3.3 instead of 3.3.2 for Scala 2.13 premerge builds
#10000	Optimize from_unixtime
#10003	Fix merge conflict with branch-23.12
#9984	Fix 340+(including DB341+) does not support casting date to integral/float
#9972	Fix year 0 is out of range in test_from_json_struct_timestamp
#9814	Support from_unixtime via Gpu for non-UTC time zone
#9929	Add host memory retries for GeneratedInternalRowToCudfRowIterator
#9957	Update cases for cast between integral and (date/time)
#9959	Append new authorized user to blossom-ci whitelist [skip ci]
#9942	Fix a potential data corruption for Pandas UDF
#9922	Fix `allowMultipleJars` recommend setting message
#9947	Fix merge conflict with branch-23.12
#9908	Register default allocator for host memory
#9944	Fix Java OOM caused by incorrect state of shouldCapture when exception occurred
#9937	Refactor to use CLASSIFIER instead of CUDA_CLASSIFIER [skip ci]
#9904	Params for build and test CI scripts on Databricks
#9719	Support fine grained timezone checker instead of type based
#9918	Prevent generation of 'year 0 is out of range' strings in IT
#9852	Avoid generating duplicate nan keys with MapGen(FloatGen)
#9674	Add cache action to speed up mvn workflow [skip ci]
#9900	Revert "Remove Databricks 13.3 from release 23.12 (#9890)"
#9888	Update nightly build and deploy script for arm artifacts [skip ci]
#9656	Update for new retry state machine JNI APIs
#9654	Detect multiple jars on the classpath when init plugin
#9857	Skip redundant steps in nightly build [skip ci]
#9812	Update JNI and private dep version to 24.02.0-SNAPSHOT

Release 23.12

Features


#6832	[FEA] Convert Timestamp/Timezone tests/checks to be per operator instead of generic
#9805	[FEA] Support `current_date` expression function with CST (UTC + 8) timezone support
#9515	[FEA] Support temporal types in to_json
#9872	[FEA][JSON] Support Decimal type in `to_json`
#9802	[FEA] Support FromUTCTimestamp on the GPU with a non-UTC time zone
#6831	[FEA] Support timestamp transitions to and from UTC for single time zones with no repeating rules
#9590	[FEA][JSON] Support temporal types in `from_json`
#9804	[FEA] Support CPU path for from_utc_timestamp function with timezone
#9461	[FEA] Validate nvcomp-3.0 with spark rapids plugin
#8832	[FEA] rewrite join conditions where only part of it can fit on the AST
#9059	[FEA] Support spark.sql.parquet.datetimeRebaseModeInRead=LEGACY
#9037	[FEA] Support spark.sql.parquet.int96RebaseModeInWrite= LEGACY
#9632	[FEA] Take into account `org.apache.spark.timeZone` in Parquet/Avro from Spark 3.2
#8770	[FEA] add more metrics to Eventlogs or Executor logs
#9597	[FEA][JSON] Support boolean type in `from_json`
#9516	[FEA] Add support for JSON data source option `ignoreNullFields=false` in `to_json`
#9520	[FEA] Add support for `LAST()` as running window function
#9518	[FEA] Add support for relevant JSON data source options in `to_json`
#9218	[FEA] Support stack function
#9532	[FEA] Support Delta Lake 2.3.0
#1525	[FEA] Support Scala 2.13
#7279	[FEA] Support OverwriteByExpressionExecV1 for Delta Lake
#9326	[FEA] Specify `recover_with_null` when reading JSON files
#8780	[FEA] Support to_json function
#7278	[FEA] Support AppendDataExecV1 for Delta Lake
#6266	[FEA] Support Percentile
#7277	[FEA] Support AtomicReplaceTableAsSelect for Delta Lake
#7276	[FEA] Support AtomicCreateTableAsSelect for Delta Lake

Performance


#8137	[FEA] Upgrade to UCX 1.15
#8157	[FEA] Add string comparison to AST expressions
#9398	[FEA] Compress/encrypt spill to disk

Bugs Fixed


#9687	[BUG] `test_in_set` fails when DATAGEN_SEED=1698940723
#9659	[BUG] executor crash intermittantly in scala2.13-built spark332 integration tests
#9923	[BUG] Failed case about `test_timestamp_seconds_rounding_necessary[Decimal(20,7)][DATAGEN_SEED=1701412018] – src.main.python.date_time_test`
#9982	[BUG] test "convert large InternalRow iterator to cached batch single col" failed with arena pool
#9683	[BUG] test_map_scalars_supported_key_types fails with DATAGEN_SEED=1698940723
#9976	[BUG] test_part_write_round_trip[Float] Failed on -0.0 partition
#9948	[BUG] parquet reader data corruption in nested schema after rapidsai/cudf#13302
#9867	[BUG] Unable to use Spark Rapids with Spark Thrift Server
#9934	[BUG] test_delta_multi_part_write_round_trip_unmanaged and test_delta_part_write_round_trip_unmanaged failed DATA_SEED=1701608331
#9933	[BUG] collection_ops_test.py::test_sequence_too_long_sequence[Long(not_null)][DATAGEN_SEED=1701553915, INJECT_OOM]
#9837	[BUG] test_part_write_round_trip failed
#9932	[BUG] Failed test_multi_tier_ast[DATAGEN_SEED=1701445668] on CI
#9829	[BUG] Java OOM when testing non-UTC time zone with lots of cases fallback.
#9403	[BUG] test_cogroup_apply_udf[Short(not_null)] failed with pandas 2.1.X
#9684	[BUG] test_coalesce fails with DATAGEN_SEED=1698940723
#9685	[BUG] test_case_when fails with DATAGEN_SEED=1698940723
#9776	[BUG] fastparquet compatibility tests fail with data mismatch if TZ is not set and system timezone is not UTC
#9733	[BUG] Complex AST expressions can crash with non-matching operand type error
#9877	[BUG] Fix resource leak in to_json
#9722	[BUG] test_floor_scale_zero fails with DATAGEN_SEED=1700009407
#9846	[BUG] test_ceil_scale_zero may fail with different datagen_seed
#9781	[BUG] test_cast_string_date_valid_format fails on DATAGEN_SEED=1700250017
#9714	Scala Map class not found when executing the benchmark on Spark 3.5.0 with Scala 2.13
#9856	collection_ops_test.py failed on Dataproc-2.1 with: Column 'None' does not exist
#9397	[BUG] RapidsShuffleManager MULTITHREADED on Databricks, we see loss of executors due to Rpc issues
#9738	[BUG] `test_delta_part_write_round_trip_unmanaged` and `test_delta_multi_part_write_round_trip_unmanaged` fail with `DATAGEN_SEED=1700105176`
#9771	[BUG] ast_test.py::test_X[(String, True)][DATAGEN_SEED=1700205785] failed
#9782	[BUG] Error messages appear in a clean build
#9798	[BUG] GpuCheckOverflowInTableInsert should be added to databricks shim
#9820	[BUG] test_parquet_write_roundtrip_datetime_with_legacy_rebase fails with "year 0 is out of range"
#9817	[BUG] FAILED dpp_test.py::test_dpp_reuse_broadcast_exchange[false-0-parquet][DATAGEN_SEED=1700572856, IGNORE_ORDER]
#9768	[BUG] `cast decimal to string` ScalaTest relies on a side effects
#9711	[BUG] test_lte fails with DATAGEN_SEED=1699987762
#9751	[BUG] cmp_test test_gte failed with DATAGEN_SEED=1700149611
#9469	[BUG] [main] ERROR com.nvidia.spark.rapids.GpuOverrideUtil - Encountered an exception applying GPU overrides java.lang.IllegalStateException: the broadcast must be on the GPU too
#9648	[BUG] Existence default values in schema are not being honored
#9676	Fix Delta Lake Integration tests; `test_delta_atomic_create_table_as_select` and `test_delta_atomic_replace_table_as_select`
#9701	[BUG] test_ts_formats_round_trip and test_datetime_roundtrip_with_legacy_rebase fail with DATAGEN_SEED=1699915317
#9691	[BUG] Repeated Maven invocations w/o changes recompile too many Scala sources despite recompileMode=incremental
#9547	Update buildall and doc to generate bloop projects for test debugging
#9697	[BUG] Iceberg multiple file readers can not read files if the file paths contain encoded URL unsafe chars
#9681	Databricks Build Failing For 330db+
#9521	[BUG] Multi Threaded Shuffle Writer needs flow control
#9675	Failing Delta Lake Tests for Databricks 13.3 Due to WriteIntoDeltaCommand
#9669	[BUG] Rebase exception states not in UTC but timezone is Etc/UTC
#7940	[BUG] UCX peer connection issue in multi-nic single node cluster
#9650	[BUG] Github workflow for missing scala2.13 updates fails to detect when pom is new
#9621	[BUG] Scala 2.13 with-classifier profile is picking up Scala2.12 spark.version
#9636	[BUG] All parquet integration tests failed "Part of the plan is not columnar class" in databricks runtimes
#9108	[BUG] nullability on some decimal operations is wrong
#9625	[BUG] Typo in github Maven check install-modules
#9603	[BUG] fastparquet_compatibility_test fails on dataproc
#8729	[BUG] nightly integration test failed OOM kill in JDK11 ENV
#9589	[BUG] Scala 2.13 build hard-codes Java 8 target
#9581	Delta Lake 2.4 missing equals/hashCode override for file format and some metrics for merge
#9507	[BUG] Spark 3.2+/ParquetFilterSuite/Parquet filter pushdown - timestamp/ FAILED
#9540	[BUG] Job failed with SparkUpgradeException no matter which value are set for spark.sql.parquet.datetimeRebaseModeInRead
#9545	[BUG] Dataproc 2.0 test_reading_file_rewritten_with_fastparquet tests failing
#9552	[BUG] Inconsistent CDH dependency overrides across submodules
#9571	[BUG] non-deterministic compiled SQLExecPlugin.class with scala 2.13 deployment
#9569	[BUG] test_window_running failed in 3.1.2+3.1.3
#9480	[BUG] mapInPandas doesn't invoke udf on empty partitions
#8644	[BUG] Parquet file with malformed dictionary does not error when loaded
#9310	[BUG] Improve support for reading JSON files with malformed rows
#9457	[BUG] CDH 332 unit tests failing
#9404	[BUG] Spark reports a decimal error when create lit scalar when generate Decimal(34, -5) data.
#9110	[BUG] GPU Reader fails due to partition column creating column larger then cudf column size limit
#8631	[BUG] Parquet load failure on repeated_no_annotation.parquet
#9364	[BUG] CUDA illegal access error is triggering split and retry logic

PRs


#10384	[DOC] Update docs for 23.12.2 release [skip ci]
#10341	Update changelog for v23.12.2 [skip ci]
#10340	Copyright to 2024 [skip ci]
#10323	Upgrade version to 23.12.2-SNAPSHOT
#10329	update download page for v23.12.2 release [skip ci]
#10274	PythonRunner Changes
#10124	Update changelog for v23.12.1 [skip ci]
#10123	Change version to v23.12.1 [skip ci]
#10122	Init changelog for v23.12.1 [skip ci]
#10121	[DOC] update download page for db hot fix [skip ci]
#10116	Upgrade to 23.12.1-SNAPSHOT
#10069	Revert "Support split broadcast join condition into ast and non-ast […
#9470	Use float to string kernel
#9481	Use parse_url kernel for PROTOCOL parsing
#9935	Init 23.12 changelog [skip ci]
#9943	[DOC] Update docs for 23.12.0 release [skip ci]
#10014	Add documentation for how to run tests with a fixed datagen seed [skip ci]
#9954	Update private and JNI version to released 23.12.0
#10009	Using fix seed to unblock 23.12 release; Move the blocked issues to 24.02
#10007	Fix Java OOM in non-UTC case with lots of xfail (#9944)
#9985	Avoid allocating GPU memory out of RMM managed pool in test
#9970	Avoid leading and trailing zeros in test_timestamp_seconds_rounding_necessary
#9978	Avoid using floating point values as partition values in tests
#9979	Add compatibility notes for writing ORC with lost Gregorian days [skip ci]
#9949	Override the seed for `test_map_scalars_supported_key_types` for version of Spark before 3.4.0 [Databricks]
#9961	Avoid using floating point for partition values in Delta Lake tests
#9960	Fix LongGen accidentally using special cases when none are desired
#9950	Avoid generating NaNs as partition values in test_part_write_round_trip
#9940	Fix 'year 0 is out of range' by setting a fix seed
#9946	Fix test_multi_tier_ast to ignore ordering of output rows
#9928	Test `inset` with `NaN` only for Spark from 3.1.3
#9906	Fix test_initcap to use the intended limited character set
#9831	Skip fastparquet timestamp tests when plugin cannot read/write timestamps
#9893	Add multiple expression tier regression test for AST
#9889	Fix test_cast_string_ts_valid_format test
#9833	Fix a hang for Pandas UDFs on DB 13.3
#9873	Add support for decimal in `to_json`
#9890	Remove Databricks 13.3 from release 23.12
#9874	Fix zero-scale floor and ceil tests
#9879	Fix resource leak in to_json
#9600	Add date and timestamp support to to_json
#9871	Fix test_cast_string_date_valid_format generating year 0
#9885	Preparation for non-UTC nightly CI [skip ci]
#9810	Support from_utc_timestamp on the GPU for non-UTC timezones (non-DST)
#9865	Fix problems with nulls in sequence tests
#9864	Add compatibility documentation with respect to decimal overflow detection [skip ci]
#9860	Fixing FAQ deadlink in plugin code [skip ci]
#9840	Avoid using NaNs as Delta Lake partition values
#9773	xfail all the impacted cases when using non-UTC time zone
#9849	Instantly Delete pre-merge content of stage workspace if success
#9848	Force datagen_seed for test_ceil_scale_zero and test_decimal_round
#9677	Enable build for Databricks 13.3
#9809	Re-enable AST string integration cases
#9835	Avoid pre-Gregorian dates in schema_evolution_test
#9786	Check paths for existence to prevent ignorable error messages during build
#9824	UCX 1.15 upgrade
#9800	Add GpuCheckOverflowInTableInsert to Databricks 11.3+
#9821	Update timestamp gens to avoid "year 0 is out of range" errors
#9826	Set seed to 0 for test_hash_reduction_sum
#9720	Support timestamp in `from_json`
#9818	Specify nullable=False when generating filter values in dpp tests
#9689	Support CPU path for from_utc_timestamp function with timezone
#9769	Use withGpuSparkSession to customize SparkConf
#9780	Fix NaN handling in GpuLessThanOrEqual and GpuGreaterThanOrEqual
#9795	xfail AST string tests
#9666	Add support for parsing strings as dates in `from_json`
#9673	Fix the broadcast joins issues caused by InputFileBlockRule
#9785	Force datagen_seed for 9781 and 9784 [skip ci]
#9765	Let GPU scans fall back when default values exist in schema
#9729	Fix Delta Lake atomic table operations on spark341db
#9770	[BUG] Fix the doc for Maven and Scala 2.13 test example [skip ci]
#9761	Fix bug in tagging of JsonToStructs
#9758	Remove forced seed from Delta Lake part_write_round_trip_unmanaged tests
#9652	Add time zone config to set non-UTC
#9736	Fix `TimestampGen` to generate value not too close to the minimum allowed timestamp
#9698	Speed up build: unnecessary invalidation in the incremental recompile mode
#9748	Fix Delta Lake part_write_round_trip_unmanaged tests with floating point
#9702	Support split BroadcastNestedLoopJoin condition for AST and non-AST
#9746	Force test_hypot to be single seed for now
#9745	Avoid generating null filter values in test_delta_dfp_reuse_broadcast_exchange
#9741	Set seed=0 for the delta lake part roundtrip tests
#9660	Fully support date/time legacy rebase for nested input
#9672	Support String type for AST
#9716	Initiate project version 24.02.0-SNAPSHOT
#9732	Temporarily force `datagen_seed=0` for `test_re_replace_all` to unblock CI
#9726	Fix leak in BatchWithPartitionData
#9717	Encode the file path from Iceberg when converting to a PartitionedFile
#9441	Add a random seed specific to datagen cases
#9649	Support `spark.sql.parquet.datetimeRebaseModeInRead=LEGACY` and `spark.sql.parquet.int96RebaseModeInRead=LEGACY`
#9612	Escape quotes and newlines when converting strings to json format in to_json
#9644	Add Partial Delta Lake Support for Databricks 13.3
#9690	Changed `extractExecutedPlan` to consider ResultQueryStageExec for Databricks 13.3
#9686	Removed Maven Profiles From `tests/pom.xml`
#9509	Fine-grained spill metrics
#9658	Support `spark.sql.parquet.int96RebaseModeInWrite=LEGACY`
#9695	Revert "Support split non-AST-able join condition for BroadcastNested…
#9693	Enable automerge from 23.12 to 24.02 [skip ci]
#9679	[Doc] update the dead link in download page [skip ci]
#9678	Add flow control for multithreaded shuffle writer
#9635	Support split non-AST-able join condition for BroadcastNestedLoopJoin
#9646	Fix Integration Test Failures for Databricks 13.3 Support
#9670	Normalize file timezone and handle missing file timezone in datetimeRebaseUtils
#9657	Update verify check to handle new pom files [skip ci]
#9663	Making User Guide info in bold and adding it as top right link in github.io [skip ci]
#9609	Add valid retry solution to mvn-verify [skip ci]
#9655	Document problem with handling of invalid characters in CSV reader
#9620	Add support for parsing boolean values in `from_json`
#9615	Bloop updates - require JDK11 in buildall + docs, build bloop for all targets.
#9631	Refactor Parquet readers
#9637	Added Support For Various Execs for Databricks 13.3
#9640	Add support for `ignoreNullFields=false` in `to_json`
#9623	Running window optimization for `LAST()`
#9641	Revert "Support rebase checking for nested dates and timestamps (#9617)"
#9423	Re-enable `from_json` / `JsonToStructs`
#9624	Add jenkins-level retry for pre-merge build in databricks runtimes
#9608	Fix nullability issues for some decimal operations
#9617	Support rebase checking for nested dates and timestamps
#9611	Move simple classes after refactoring to sql-plugin-api
#9618	Remove unused dataTypes argument from HostShuffleCoalesceIterator
#9626	Fix ENV typo in pre-merge github actions [skip ci]
#9593	PythonRunner and RapidsErrorUtils Changes For Databricks 13.3
#9607	Integration tests: Install specific fastparquet version.
#9610	Propagate local properties to broadcast execs
#9544	Support batching for `RANGE` running window aggregations. Including on
#9601	Remove usage of deprecated scala.Proxy
#9591	Enable implicit JDK profile activation
#9586	Merge metrics and file format fixes to Delta 2.4 support
#9594	Revert "Ignore failing Parquet filter test to unblock CI (#9519)"
#9454	Support encryption and compression in disk store
#9439	Support stack function
#9583	Fix fastparquet tests to work with HDFS
#9508	Consolidate deps switching in an intermediate pom
#9562	Delta Lake 2.3.0 support
#9576	Move Stack classes to wrapper classes to fix non-deterministic build issue
#9572	Add retry for CrossJoinIterator and ConditionalNestedLoopJoinIterator
#9575	Fix `test_window_running*()` for `NTH_VALUE IGNORE NULLS`.
#9574	Fix broken #endif scala comments [skip ci]
#9568	Enforce Apache 3.3.0+ for Scala 2.13
#9557	Support launching Map Pandas UDF on empty partitions
#9489	Batching support for ROW-based `FIRST()` window function
#9510	Add Databricks 13.3 shim boilerplate code and refactor Databricks 12.2 shim
#9554	Fix fastparquet installation for
#9536	Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark
#9558	Support integration test against scala2.13 spark binaries[skip ci]
#8592	Scala 2.13 Support
#9551	Enable malformed Parquet failure test
#9546	Support OverwriteByExpressionExecV1 for Delta Lake tables
#9527	Support Split And Retry for GpuProjectAstExec
#9541	Move simple classes to API
#9548	Append new authorized user to blossom-ci whitelist [skip ci]
#9418	Fix STRUCT comparison between Pandas and Spark dataframes in fastparquet tests
#9468	Add SplitAndRetry to GpuRunningWindowIterator
#9486	Add partial support for `to_json`
#9538	Fix tiered project breaking higher order functions
#9539	Add delta-24x to delta-lake/README.md [skip ci]
#9534	Add pyarrow tests for Databricks runtime
#9444	Remove redundant pass-through shuffle manager classes
#9531	Fix relative path for spark-shell nightly test [skip ci]
#9525	Follow-up to dbdeps consolidation
#9506	Move ProxyShuffleInternalManagerBase to api
#9504	Add a spark-shell smoke test to premerge and nightly
#9519	Ignore failing Parquet filter test to unblock CI
#9478	Support AppendDataExecV1 for Delta Lake tables
#9366	Add tests to check compatibility with `fastparquet`
#9419	Add retry to RoundRobin Partitioner and Range Partitioner
#9502	Install Dependencies Needed For Databricks 13.3
#9296	Implement `percentile` aggregation
#9488	Add Shim JSON Headers for Databricks 13.3
#9443	Add AtomicReplaceTableAsSelectExec support for Delta Lake
#9476	Refactor common Delta Lake test code
#9463	Fix Cloudera 3.3.2 shim for handling CheckOverflowInTableInsert and orc zstd support
#9460	Update links in old release notes to new doc locations [skip ci]
#9405	Wrap scalar generation into spark session in integration test
#9459	Fix 332cdh build [skip ci]
#9425	Add support for AtomicCreateTableAsSelect with Delta Lake
#9434	Add retry support to `HostToGpuCoalesceIterator.concatAllAndPutOnGPU`
#9453	Update codeowner and blossom-ci ACL [skip ci]
#9396	Add support for Cloudera CDS-3.3.2
#9380	Fix parsing of Parquet legacy list-of-struct format
#9438	Fix auto merge conflict 9437 [skip ci]
#9424	Refactor aggregate functions
#9414	Add retry to GpuHashJoin.filterNulls
#9388	Add developer documentation about working with data sources [skip ci]
#9369	Improve JSON empty row fix to use less memory
#9373	Fix auto merge conflict 9372
#9308	Initiate arm64 CI support [skip ci]
#9292	Init project version 23.12.0-SNAPSHOT

Older Releases

Changelog of older releases can be found at docs/archives

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Change log

Release 24.02

Features

Performance

Bugs Fixed

PRs

Release 23.12

Features

Performance

Bugs Fixed

PRs

Older Releases

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Change log

Release 24.02

Features

Performance

Bugs Fixed

PRs

Release 23.12

Features

Performance

Bugs Fixed

PRs

Older Releases