Skip to content

Latest commit

 

History

History
371 lines (353 loc) · 34.1 KB

40.0.0.md

File metadata and controls

371 lines (353 loc) · 34.1 KB

Apache DataFusion 40.0.0 Changelog

This release consists of 263 commits from 64 contributors. See credits at the end of this changelog for more information.

Breaking changes:

  • Convert StringAgg to UDAF #10945 (lewiszlw)
  • Convert bool_and & bool_or to UDAF #11009 (jcsherin)
  • Convert Average to UDAF #10942 #10964 (dharanad)
  • fix: remove the Sized requirement on ExecutionPlan::name() #11047 (waynexia)
  • Return &Arc reference to inner trait object #11103 (linhr)
  • Support COPY TO Externally Defined File Formats, add FileType trait #11060 (devinjdangelo)
  • expose table name in proto extension codec #11139 (leoyvens)
  • fix(typo): unqualifed to unqualified #11159 (waynexia)
  • Consolidate Filter::remove_aliases into Expr::unalias_nested #11001 (alamb)
  • Convert nth_value to UDAF #11287 (jcsherin)

Implemented enhancements:

  • feat: Add support for Int8 and Int16 data types in data page statistics #10931 (Weijun-H)
  • feat: add CliSessionContext trait for cli #10890 (tshauck)
  • feat(optimizer): handle partial anchored regex cases and improve doc #10977 (waynexia)
  • feat: support uint data page extraction #11018 (tshauck)
  • feat: propagate EmptyRelation for more join types #10963 (tshauck)
  • feat: Add method to add analyzer rules to SessionContext #10849 (pingsutw)
  • feat: Support duplicate column names in Joins in Substrait consumer #11049 (Blizzara)
  • feat: Add support for Timestamp data types in data page statistics. #11123 (efredine)
  • feat: Add support for Binary/LargeBinary/Utf8/LargeUtf8 data types in data page statistics #11136 (PsiACE)
  • feat: Support Map type in Substrait conversions #11129 (Blizzara)
  • feat: Conditionally allow to keep partition_by columns when using PARTITIONED BY enhancement #11107 (hveiga)
  • feat: enable "substring" as a UDF in addition to "substr" #11277 (Blizzara)

Fixed bugs:

  • fix: use total ordering in the min & max accumulator for floats #10627 (westonpace)
  • fix: Support double quotes in date_part #10833 (Weijun-H)
  • fix: Ignore nullability of list elements when consuming Substrait #10874 (Blizzara)
  • fix: Support NOT <field> IN (<subquery>) via anti join #10936 (akoshchiy)
  • fix: CTEs defined in a subquery can escape their scope #10954 (jonahgao)
  • fix: Fix the incorrect null joined rows for SMJ outer join with join filter #10892 (viirya)
  • fix: gcd returns negative results #11099 (jonahgao)
  • fix: LCM panicked due to overflow #11131 (jonahgao)
  • fix: Support dictionary type in parquet metadata statistics. #11169 (efredine)
  • fix: Ignore nullability in Substrait structs #11130 (Blizzara)
  • fix: typo in comment about FinalPhysicalPlan #11181 (c8ef)
  • fix: Support Substrait's compound names also for window functions #11163 (Blizzara)
  • fix: Incorrect LEFT JOIN evaluation result on OR conditions #11203 (viirya)
  • fix: Be more lenient in interpreting input args for builtin window functions #11199 (Blizzara)
  • fix: correctly handle Substrait windows with rows bounds (and validate executability of test plans) #11278 (Blizzara)
  • fix: When consuming Substrait, temporarily rename clashing duplicate columns #11329 (Blizzara)

Documentation updates:

  • Minor: Clarify SessionContext::state docs #10847 (alamb)
  • Minor: Update SIGMOD paper reference url #10860 (alamb)
  • docs(variance): Correct typos in comments #10844 (pingsutw)
  • Add missing code close tick in LiteralGuarantee docs #10859 (adriangb)
  • Minor: Add more docs and examples for Transformed and TransformedResult #11003 (alamb)
  • doc: Update links in the documantation #11044 (Weijun-H)
  • Minor: Examples cleanup + more docs in pruning example #11086 (alamb)
  • Minor: refine documentation pointing to examples #11110 (alamb)
  • Fix running in Docker instructions #11141 (findepi)
  • docs: add example for custom file format with COPY TO #11174 (tshauck)
  • Fix docs wordings #11226 (findepi)
  • Fix count() docs around including null values #11293 (findepi)

Other:

  • chore: Prepare 39.0.0-rc1 #10828 (andygrove)
  • Remove expr_fn::sum and replace them with function stub #10816 (jayzhan211)
  • Debug print as many fields as possible for SessionState #10818 (lewiszlw)
  • Prune Parquet RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API #10802 (alamb)
  • Remove Built-in sum and Rename to lowercase sum #10831 (jayzhan211)
  • Convert stddev and stddev_pop to UDAF #10834 (goldmedal)
  • Introduce expr builder for aggregate function #10560 (jayzhan211)
  • chore: Improve change log generator #10841 (andygrove)
  • Support user defined ParquetAccessPlan in ParquetExec, validation to ParquetAccessPlan::select #10813 (alamb)
  • Convert VariancePopulation to UDAF #10836 (mknaw)
  • Convert approx_median to UDAF #10840 (goldmedal)
  • MINOR: use workspace deps in proto-common (upgrade object store dependency) #10848 (waynexia)
  • Minor: add Window::try_new_with_schema constructor #10850 (sadboy)
  • Add support for reading CSV files with comments #10467 (bbannier)
  • Convert approx_distinct to UDAF #10851 (Lordworms)
  • minor: add proto-common crate to release instructions #10858 (andygrove)
  • Implement TPCH substrait integration teset, support tpch_1 #10842 (Lordworms)
  • Remove unecessary passing around of suffix: &str in pruning.rs's RequiredColumns #10863 (adriangb)
  • chore: Make DFSchema::datatype_is_logically_equal function public #10867 (advancedxy)
  • Bump braces from 3.0.2 to 3.0.3 in /datafusion/wasmtest/datafusion-wasm-app #10865 (dependabot[bot])
  • Docs: Add unnest to SQL Reference #10839 (gloomweaver)
  • Support correct output column names and struct field names when consuming/producing Substrait #10829 (Blizzara)
  • Make Logical Plans more readable by removing extra aliases #10832 (MohamedAbdeen21)
  • Minor: Improve ListingTable documentation #10854 (alamb)
  • Extending join fuzz tests to support join filtering #10728 (edmondop)
  • replace and(, not()) with and_not(*) #10885 (RTEnzyme)
  • Disabling test for semi join with filters #10887 (edmondop)
  • Minor: Update min_statistics and max_statistics to be helpers, update docs #10866 (alamb)
  • Remove Interval column test // parquet extraction #10888 (marvinlanhenke)
  • Minor: SMJ fuzz tests fix for rowcounts #10891 (comphead)
  • Move Count to functions-aggregate, update MSRV to rust 1.75 #10484 (jayzhan211)
  • refactor: fetch statistics for a given ParquetMetaData #10880 (NGA-TRAN)
  • Move FileSinkExec::metrics to the correct place #10901 (joroKr21)
  • Refine ParquetAccessPlan comments and tests #10896 (alamb)
  • ci: fix clippy failures on main #10903 (jonahgao)
  • Minor: disable flaky fuzz test #10904 (comphead)
  • Remove builtin count #10893 (jayzhan211)
  • Move Regr_* functions to use UDAF #10898 (eejbyfeldt)
  • Docs: clarify when the parquet reader will read from object store when using cached metadata #10909 (alamb)
  • Minor: Fix bench.sh tpch data #10905 (alamb)
  • Minor: use venv in benchmark compare #10894 (tmi)
  • Support explicit type and name during table creation #10273 (duongcongtoai)
  • Simplify Join Partition Rules #10911 (berkaysynnada)
  • Move Literal to physical-expr-common #10910 (lewiszlw)
  • chore: update some error messages for clarity #10916 (jeffreyssmith2nd)
  • Initial Extract parquet data page statistics API #10852 (marvinlanhenke)
  • Add contains function, and support in datafusion substrait consumer #10879 (Lordworms)
  • Minor: Improve arrow_statistics tests #10927 (alamb)
  • Minor: Remove prefer_hash_join env variable for clickbench #10933 (jayzhan211)
  • Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF #10917 (goldmedal)
  • refactor: remove extra default in max rows #10941 (tshauck)
  • chore: Improve performance of Parquet statistics conversion #10932 (Weijun-H)
  • Add catalog::resolve_table_references #10876 (leoyvens)
  • Convert BitAnd, BitOr, BitXor to UDAF #10930 (dharanad)
  • refactor: improve PoolType argument handling for CLI #10940 (tshauck)
  • Minor: remove potential string copy from Column::from_qualified_name #10947 (alamb)
  • Fix: StatisticsConverter counts for missing columns #10946 (marvinlanhenke)
  • Add initial support for Utf8View and BinaryView types #10925 (XiangpengHao)
  • Use shorter aliases in CSE #10939 (peter-toth)
  • Substrait support for ParquetExec round trip for simple select #10949 (xinlifoobar)
  • Support to unparse ScalarValue::IntervalMonthDayNano to String #10956 (goldmedal)
  • Minor: Return option from row_group_row_count #10973 (marvinlanhenke)
  • Minor: Add routine to debug join fuzz tests #10970 (comphead)
  • Support to unparse ScalarValue::TimestampNanosecond to String #10984 (goldmedal)
  • build(deps-dev): bump ws from 8.14.2 to 8.17.1 in /datafusion/wasmtest/datafusion-wasm-app #10988 (dependabot[bot])
  • Minor: reuse Rows buffer in GroupValuesRows #10980 (alamb)
  • Add example for writing SQL analysis using DataFusion structures #10938 (LorrensP-2158466)
  • Push down filter for Unnest plan #10974 (jayzhan211)
  • Add parquet page stats for float{16, 32, 64} #10982 (tmi)
  • Fix file_stream_provider example compilation failure on windows #10975 (lewiszlw)
  • Stop copying LogicalPlan and Exprs in CommonSubexprEliminate (2-3% planning speed improvement) #10835 (alamb)
  • chore: Update documentation link in PhysicalOptimizerRule comment #11002 (Weijun-H)
  • Push down filter plan for unnest on non-unnest column only #10991 (jayzhan211)
  • Minor: add test for pushdown past unnest #11017 (alamb)
  • Update docs for protoc minimum installed version #11006 (jcsherin)
  • propagate error instead of panicking on out of bounds in physical-expr/src/analysis.rs #10992 (LorrensP-2158466)
  • Add drop_columns to dataframe api #11010 (Omega359)
  • Push down filter plan for non-unnest column #11019 (jayzhan211)
  • Consider timezones with UTC and +00:00 to be the same #10960 (marvinlanhenke)
  • Deprecate OptimizerRule::try_optimize #11022 (lewiszlw)
  • Relax combine partial final rule #10913 (mustafasrepo)
  • Compute gcd with u64 instead of i64 because of overflows #11036 (LorrensP-2158466)
  • Add distinct_on to dataframe api #11012 (Omega359)
  • chore: add test to show current behavior of AT TIME ZONE for string vs. timestamp #11056 (appletreeisyellow)
  • Boolean parquet get datapage stat #11054 (LorrensP-2158466)
  • Using display_name for Expr::Aggregation #11020 (Lordworms)
  • Minor: Convert Count's name to lowercase #11028 (jayzhan211)
  • Minor: Move function::Hint to datafusion-expr crate to avoid physical-expr dependency for datafusion-function crate #11061 (jayzhan211)
  • Support to unparse ScalarValue::TimestampMillisecond to String #11046 (pingsutw)
  • Support to unparse IntervalYearMonth and IntervalDayTime to String #11065 (goldmedal)
  • SMJ: fix streaming row concurrency issue for LEFT SEMI filtered join #11041 (comphead)
  • Add advanced_parquet_index.rs example of index in into parquet files #10701 (alamb)
  • Add Expr::column_refs to find column references without copying #10948 (alamb)
  • Give OptimizerRule::try_optimize default implementation and cleanup duplicated custom implementations #11059 (lewiszlw)
  • Fix FormatOptions::CSV propagation #10912 (svranesevic)
  • Support parsing SQL strings to Exprs #10995 (xinlifoobar)
  • Support dictionary data type in array_to_string #10908 (EduardoVega)
  • Implement min/max for interval types #11015 (maxburke)
  • Improve LIKE performance for Dictionary arrays #11058 (Lordworms)
  • handle overflow in gcd and return this as an error #11057 (LorrensP-2158466)
  • Convert Correlation to UDAF #11064 (pingsutw)
  • Migrate more code from Expr::to_columns to Expr::column_refs #11067 (alamb)
  • decimal support for unparser #11092 (y-f-u)
  • Improve CommonSubexprEliminate identifier management (10% faster planning) #10473 (peter-toth)
  • Change wildcard qualifier type from String to TableReference #11073 (linhr)
  • Allow access to UDTF in SessionContext #11071 (linhr)
  • Strip table qualifiers from schema in UNION ALL for unparser #11082 (phillipleblanc)
  • Update ListingTable to use StatisticsConverter #11068 (xinlifoobar)
  • to_timestamp functions should preserve timezone #11038 (maxburke)
  • Rewrite array operator to function in parser #11101 (jayzhan211)
  • Resolve empty relation opt for join types #11066 (LorrensP-2158466)
  • Add composed extension codec example #11095 (lewiszlw)
  • Minor: Avoid some repetition in to_timestamp #11116 (alamb)
  • Minor: fix ScalarValue::new_ten error message (cites one not ten) #11126 (gstvg)
  • Deprecate Expr::column_refs #11115 (alamb)
  • Overflow in negate operator #11084 (LorrensP-2158466)
  • Minor: Add Architectural Goals to the docs #11109 (alamb)
  • Fix overflow in pow #11124 (LorrensP-2158466)
  • Support to unparse Time scalar value to String #11121 (goldmedal)
  • Support to unparse TimestampSecond and TimestampMicrosecond to String #11120 (goldmedal)
  • Add standalone example for OptimizerRule #11087 (alamb)
  • Fix overflow in factorial #11134 (LorrensP-2158466)
  • Temporary Fix: Query error when grouping by case expressions #11133 (jonahgao)
  • Fix nullability of return value of array_agg #11093 (eejbyfeldt)
  • Support filter for List #11091 (jayzhan211)
  • [MINOR]: Fix some minor silent bugs #11127 (mustafasrepo)
  • Minor Fix for Logical and Physical Expr Conversions #11142 (berkaysynnada)
  • Support Date Parquet Data Page Statistics #11135 (dharanad)
  • fix flaky array query slt test #11140 (leoyvens)
  • Support Decimal and Decimal256 Parquet Data Page Statistics #11138 (Lordworms)
  • Implement comparisons on nested data types such that distinct/except would work #11117 (rtyler)
  • Minor: dont panic with bad arguments to round #10899 (tmi)
  • Minor: reduce replication for nested comparison #11149 (alamb)
  • [Minor]: Remove datafusion-functions-aggregate dependency from physical-expr crate #11158 (mustafasrepo)
  • adding config to control Varchar behavior #11090 (Lordworms)
  • minor: consolidate gcd related tests #11164 (jonahgao)
  • Minor: move batch spilling methods to lib.rs to make it reusable #11154 (comphead)
  • Move schema projection to where it's used in ListingTable #11167 (adriangb)
  • Make running in docker instruction be copy-pastable #11148 (findepi)
  • Rewrite array @> array and array <@ array in sql_expr_to_logical_expr #11155 (jayzhan211)
  • Minor: make some physical_optimizer rules public #11171 (askalt)
  • Remove pr_benchmarks.yml #11165 (alamb)
  • Optionally display schema in explain plan #11177 (alamb)
  • Minor: Add more support for ScalarValue::Float16 #11156 (Lordworms)
  • Minor: fix SQLOptions::with_allow_ddl comments #11166 (alamb)
  • Update sqllogictest requirement from 0.20.0 to 0.21.0 #11189 (dependabot[bot])
  • Support Time Parquet Data Page Statistics #11187 (dharanad)
  • Adds support for Dictionary data type statistics from parquet data pages. #11195 (efredine)
  • [Minor]: Make sort_batch public #11191 (mustafasrepo)
  • Introduce user defined SQL planner API #11180 (jayzhan211)
  • Covert grouping to udaf #11147 (Rachelint)
  • Make statistics_from_parquet_meta a sync function #11205 (adriangb)
  • Allow user defined SQL planners to be registered #11208 (samuelcolvin)
  • Recursive unnest #11062 (duongcongtoai)
  • Document how to test examples in user guide, add some more coverage #11178 (alamb)
  • Minor: Move MemoryCatalog*Provider into a module, improve comments #11183 (alamb)
  • Add standalone example of using the SQL frontend #11088 (alamb)
  • Add Optimizer Sanity Checker, improve sortedness equivalence properties #11196 (mustafasrepo)
  • Implement user defined planner for extract #11215 (xinlifoobar)
  • Move basic SQL query examples to user guide #11217 (alamb)
  • Support FixedSizedBinaryArray Parquet Data Page Statistics #11200 (dharanad)
  • Implement ScalarValue::Map #11224 (goldmedal)
  • Remove unmaintained python pre-commit configuration #11255 (findepi)
  • Enable clone_on_ref_ptr clippy lint on execution crate #11239 (lewiszlw)
  • Minor: Improve documentation about pushdown join predicates #11209 (alamb)
  • Minor: clean up data page statistics tests and fix bugs #11236 (efredine)
  • Replacing pattern matching through downcast with trait method #11257 (edmondop)
  • Update substrait requirement from 0.34.0 to 0.35.0 #11206 (dependabot[bot])
  • Enhance short circuit handling in CommonSubexprEliminate #11197 (peter-toth)
  • Add bench for data page statistics parquet extraction #10950 (marvinlanhenke)
  • Register SQL planners in SessionState constructor #11253 (dharanad)
  • Support DuckDB style struct syntax #11214 (jayzhan211)
  • Enable clone_on_ref_ptr clippy lint on expr crate #11238 (lewiszlw)
  • Optimize PushDownFilter to avoid recreating schema columns #11211 (alamb)
  • Remove outdated rewrite_expr.rs example #11085 (alamb)
  • Implement TPCH substrait integration teset, support tpch_2 #11234 (Lordworms)
  • Enable clone_on_ref_ptr clippy lint on physical-expr crate #11240 (lewiszlw)
  • Add standalone AnalyzerRule example that implements row level access control #11089 (alamb)
  • Replace println! with assert! if possible in DataFusion examples #11237 (Nishi46)
  • minor: format Expr::get_type() #11267 (jonahgao)
  • Fix hash join for nested types #11232 (eejbyfeldt)
  • Infer count() aggregation is not null #11256 (findepi)
  • Remove unnecessary qualified names #11292 (findepi)
  • Fix running examples readme #11225 (findepi)
  • Minor: Add ConstExpr::from and use in physical optimizer #11283 (alamb)
  • Implement TPCH substrait integration teset, support tpch_3 #11298 (Lordworms)
  • Implement user defined planner for position #11243 (xinlifoobar)
  • Upgrade to arrow 52.1.0 (and fix clippy issues on main) #11302 (alamb)
  • AggregateExec: Take grouping sets into account for InputOrderMode #11301 (thinkharderdev)
  • Add user_defined_sql_planners(..) to FunctionRegistry #11296 (Omega359)
  • use safe cast in propagate_constraints #11297 (Lordworms)
  • Minor: Remove clone in optimizer #11315 (jayzhan211)
  • minor: Add PhysicalSortExpr::new #11310 (andygrove)
  • Fix data page statistics when all rows are null in a data page #11295 (efredine)
  • Made UserDefinedFunctionPlanner to uniform the usages #11318 (xinlifoobar)
  • Implement user defined planner for create_struct & create_named_struct #11273 (dharanad)
  • Improve stats convert performance for Binary/String/Boolean arrays #11319 (Rachelint)
  • Fix typos in datafusion-examples/datafusion-cli/docs #11259 (lewiszlw)
  • Minor: Fix Failing TPC-DS Test #11331 (berkaysynnada)
  • HashJoin can preserve the right ordering when join type is Right #11276 (berkaysynnada)
  • Update substrait requirement from 0.35.0 to 0.36.0 #11328 (dependabot[bot])
  • Support to uparse logical plans with timestamp cast to string #11326 (sgrebnov)
  • Implement user defined planner for sql_substring_to_expr #11327 (xinlifoobar)
  • Improve volatile expression handling in CommonSubexprEliminate #11265 (peter-toth)
  • Support IS NULL and IS NOT NULL on Unions #11321 (samuelcolvin)
  • Implement TPCH substrait integration test, support tpch_4 and tpch_5 #11311 (Lordworms)
  • Enable clone_on_ref_ptr clippy lint on physical-plan crate #11241 (lewiszlw)
  • Remove any aliases in Filter::try_new rather than erroring #11307 (samuelcolvin)
  • Improve DataFrame Users Guide #11324 (alamb)
  • chore: Rename UserDefinedSQLPlanner to ExprPlanner #11338 (andygrove)
  • Revert "remove derive(Copy) from Operator (#11132)" #11341 (alamb)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    41	Andrew Lamb
    17	Jay Zhan
    12	Lordworms
    12	张林伟
    10	Arttu
     9	Jax Liu
     9	Lorrens Pantelis
     8	Piotr Findeisen
     7	Dharan Aditya
     7	Jonah Gao
     7	Xin Li
     6	Andy Grove
     6	Marvin Lanhenke
     6	Trent Hauck
     5	Alex Huang
     5	Eric Fredine
     5	Mustafa Akur
     5	Oleks V
     5	dependabot[bot]
     4	Adrian Garcia Badaracco
     4	Berkay Şahin
     4	Kevin Su
     4	Peter Toth
     4	Ruihang Xia
     4	Samuel Colvin
     3	Bruce Ritchie
     3	Edmondo Porcu
     3	Emil Ejbyfeldt
     3	Heran Lin
     3	Leonardo Yvens
     3	jcsherin
     3	tmi
     2	Duong Cong Toai
     2	Liang-Chi Hsieh
     2	Max Burke
     2	kamille
     1	Albert Skalt
     1	Andrey Koshchiy
     1	Benjamin Bannier
     1	Bo Lin
     1	Chojan Shang
     1	Chunchun Ye
     1	Dan Harris
     1	Devin D'Angelo
     1	Eduardo Vega
     1	Georgi Krastev
     1	Hector Veiga
     1	Jeffrey Smith II
     1	Kirill Khramkov
     1	Matt Nawara
     1	Mohamed Abdeen
     1	Nga Tran
     1	Nishi
     1	Phillip LeBlanc
     1	R. Tyler Croy
     1	RT_Enzyme
     1	Sava Vranešević
     1	Sergei Grebnov
     1	Weston Pace
     1	Xiangpeng Hao
     1	advancedxy
     1	c8ef
     1	gstvg
     1	yfu

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.