Releases: Eventual-Inc/Daft
Releases · Eventual-Inc/Daft
v0.2.4
Changes
✨ New Features
- [FEAT] show number of truncated columns @samster25 (#1673)
- [FEAT] add retries to s3 credential provider timeouts @samster25 (#1663)
- [FEAT] Dynamic Responsive Printing of Tables, Schema and Series @samster25 (#1662)
- [FEAT] Print the results of a df.show() to stdout if running in non-interactive mode @jaychia (#1655)
- [FEAT] 1606 - Adding hour expression in date util @suriya-ganesh (#1637)
- [FEAT] [CSV Reader] Bulk CSV reader + general CSV reader refactor @clarkzinzow (#1614)
- [FEAT] Use cached preview from
df.collect()
indf.show()
. @clarkzinzow (#1651)
🚀 Performance Improvements
👾 Bug Fixes
- [BUG] Add an allowlist of DataTypes that ColumnRangeStatistics supports and validation of TableStatistics @jaychia (#1632)
- [BUG] favor char indices instead of slicing to deal with unicode @samster25 (#1664)
- [BUG] pass in pyarrow dtype manually into parquet read @samster25 (#1650)
- [CHORE] Fixed bug in ray version @dioptre (#1649)
🧰 Maintenance
- [CHORE] pin pandas for 3.8 @samster25 (#1661)
- [CHORE] pin ray to 2.7.1 if less than 3.8 @samster25 (#1657)
- [CHORE] enable refresh on tqdm total updates @samster25 (#1654)
⬆️ Dependencies
8 changes
- Bump chrono-tz from 0.8.3 to 0.8.4 @dependabot (#1670)
- Bump pytest from 7.4.1 to 7.4.3 @dependabot (#1644)
- Bump pandas from 2.0.3 to 2.1.3 @dependabot (#1643)
- Bump azure-storage-blob from 12.17.0 to 12.19.0 @dependabot (#1645)
- Bump async-compression from 0.4.4 to 0.4.5 @dependabot (#1638)
- Bump serde_json from 1.0.107 to 1.0.108 @dependabot (#1639)
- Bump base64 from 0.21.4 to 0.21.5 @dependabot (#1640)
- Bump dyn-clone from 1.0.14 to 1.0.16 @dependabot (#1642)
v0.2.3
Changes
✨ New Features
- Enabling quote, comment and escape character @suriya-ganesh (#1582)
- [FEAT] Iceberg Scan Operator @samster25 (#1561)
- [FEAT] Enable Progress Bars for PyRunner and RayRunner @samster25 (#1609)
👾 Bug Fixes
- [BUG] Fix CSV roundtrip for decimals (actually an f64->decimal casting bug) @jaychia (#1626)
- [BUG] Filter out size-0 directory marker files during s3 globs @jaychia (#1629)
- [BUG] raise error if non valid parquet file (less than parquet footer size) @samster25 (#1628)
- [BUG] Fix parquet timestamp tz roundtrip inference @jaychia (#1625)
- [BUG] Roundtrip tests for CSVs and Parquet @jaychia (#1616)
- [BUG] Self-concat breaks with the RayRunner @jaychia (#1617)
- [BUG] Add better handling for case where glob of parquet files returns empty @jaychia (#1615)
- [BUG] enable fixed size binary ingest to daft binary @samster25 (#1612)
- [BUG] Manually specify region in tutorial read_json @jaychia (#1608)
- [BUG] remove f strings from logging @samster25 (#1611)
📖 Documentation
🧰 Maintenance
- [CHORE] Fix style lints from #1582 @jaychia (#1635)
- [CHORE] add ray client to deps @samster25 (#1631)
- [CHORE] update fsspecs (s3, gcs, aldfs) in lockstep @samster25 (#1620)
- [CHORE] update azure storage blobs to 0.17.0 @samster25 (#1622)
- [CHORE] delete old rule runners @samster25 (#1619)
- [CHORE] drop ray default dep to make room for Pydantic > 2.0 @samster25 (#1618)
⬆️ Dependencies
- Bump moonrepo/setup-rust from 0 to 1 @dependabot (#1237)
- Bump google-cloud-storage from 0.13.1 to 0.14.0 @dependabot (#1549)
- Bump async-compat from 0.2.2 to 0.2.3 @dependabot (#1567)
v0.2.2
Changes
- [CHORE] Edit 'make-hooks' command to install pre-commit script @colin-ho (#1602)
- [CHORE] Improve error messages when calling aggregation methods on dataframe without input columns @colin-ho (#1587)
✨ New Features
- [FEAT] Add translation of IOConfig to PyArrow filesystem arguments @jaychia (#1592)
- [FEAT] [Scan Operator] Refactor planning and execution code to use shared
Pushdowns
struct. @clarkzinzow (#1595) - [FEAT] [Scan Operator] Add
ChunkSpec
for specifying format-specific per-file row subset selection forScanTask
s. @clarkzinzow (#1590) - [FEAT] [Scan Operator] Integrate
size_bytes
withScanOperator
s @clarkzinzow (#1586) - [FEAT] [Scan Operator] Add Python I/O support (+ JSON) to
MicroPartition
reads @clarkzinzow (#1578) - [FEAT][ScanOperator 1/3] Add MVP e2e
ScanOperator
integration. @clarkzinzow (#1559)
🚀 Performance Improvements
- [PERF][REVERT] Reverts: use pyarrow table for pickling rather than ChunkedArray (#1488) @jaychia (#1605)
- [PERF] Speed Up MicroPartition Ops when we know the result is empty @samster25 (#1604)
👾 Bug Fixes
- [BUG] clean up ray scheduler threads after computing partial results @samster25 (#1597)
- [BUG] Update requirements for typing_extensions @jaychia (#1593)
- [BUG] Fix Deadlock with ScanOperators in
to_physical_plan_scheduler
and show iostats for glob and from_scan_task @samster25 (#1581) - [BUG] add allow threads for io pool operations @samster25 (#1580)
🧰 Maintenance
- [CHORE] delete unused wheel tools @samster25 (#1603)
- [CHORE] add IOStats to all micropartition ops @samster25 (#1584)
- [CHORE] Use DAFT_MICROPARTITIONS as shared feature flag for data catalog support @jaychia (#1579)
- [CHORE] Convert GlobScanOperator to perform streaming into result and take a list of glob paths @jaychia (#1577)
⬆️ Dependencies
- Bump numpy from 1.25.2 to 1.26.2 @dependabot (#1596)
v0.2.1
Changes
- [FEAT] Support disabling using doubled quotes to escape in CSV @ravern (#1544)
- [DOCS]: fix typo in doc @amir-f (#1534)
✨ New Features
- [FEAT] GlobScanOperator @jaychia (#1550)
- [FEAT] [New Query Planner] [2/N] Push partition spec into physical plan, remove Coalesce logical op. @clarkzinzow (#1540)
👾 Bug Fixes
- [BUG] Fix reads of empty parquet files @jaychia (#1555)
- [BUG] Bump Parquet reader max_page_size to 256MB @jaychia (#1553)
- [BUG] add sort after running passes @samster25 (#1545)
- [BUG] Fix credentials issues in colab/CI @jaychia (#1539)
📖 Documentation
🧰 Maintenance
- [CHORE] Fix bad merge conflict in GlobScanOperator wrt CSV schema inference @jaychia (#1556)
- [CHORE] Revert "Bump pandas from 2.0.3 to 2.1.2" @jaychia (#1554)
- [CHORE] [New Query Planner] [1/N] Remove Python query planner. @clarkzinzow (#1538)
- [CHORE] changes to partition field and field creation @samster25 (#1537)
- [CHORE] Move code from daft-csv to daft-decoding @jaychia (#1533)
⬆️ Dependencies
6 changes
- Bump pandas from 2.0.3 to 2.1.2 @dependabot (#1542)
- Bump tempfile from 3.8.0 to 3.8.1 @dependabot (#1548)
- Bump opencv-python from 4.8.0.76 to 4.8.1.78 @dependabot (#1546)
- Bump aws-actions/configure-aws-credentials from 3 to 4 @dependabot (#1384)
- Bump async-trait from 0.1.71 to 0.1.74 @dependabot (#1496)
- Bump serde from 1.0.188 to 1.0.190 @dependabot (#1541)
v0.2.0
Changes
✨ New Features
- [FEAT] Anonymous Scan Operator @samster25 (#1526)
- [FEAT] Micropartition integration and tests @jaychia (#1502)
- [FEAT] Make Binary Type Comparable @samster25 (#1528)
- [FEAT] implement series serde @samster25 (#1519)
- [FEAT] Add streaming + parallel CSV reader, with decompression support. @clarkzinzow (#1501)
- [FEAT] IOStats for Native Reader @samster25 (#1493)
🚀 Performance Improvements
- [PERF] Add "eager mode" to limits and use in .show() @jaychia (#1498)
- [PERF] Micropartition, lazy loading and Column Stats @samster25 (#1470)
- [PERF] Use pyarrow table for pickling rather than ChunkedArray @samster25 (#1488)
- [PERF] Use region from system and leverage cached credentials when making new clients @samster25 (#1490)
- [PERF] Update default max_connections 64->8 because it is now per-io-thread @jaychia (#1485)
- [PERF] Pass-through multithreaded_io flag in read_parquet @jaychia (#1484)
👾 Bug Fixes
- [BUG] Fix timestamp timezone parsing bug in CSVs @jaychia (#1530)
- [BUG] Re-raise exceptions in rayrunner @jaychia (#1522)
- [BUG] [CSV Reader] Fix CSV parsing bugs around nulls and timestamps. @clarkzinzow (#1523)
- [BUG] Fix handling of special characters in S3LikeSource @jaychia (#1495)
- [BUG] Fix local globbing of current directory @jaychia (#1494)
- [BUG] fix script to upload file 1 at a time @samster25 (#1492)
- [CHORE] Add tests and fixes for Azure globbing @jaychia (#1482)
📖 Documentation
🧰 Maintenance
- [CHORE] Allow release-drafter to increment minor version @jaychia (#1532)
- [CHORE] Soft deprecation of fsspec from user-facing APIs @jaychia (#1467)
- [CHORE] bring up fixtures for iceberg @samster25 (#1527)
- [CHORE] Skip IO integration tests if being run from dependabot @jaychia (#1521)
- [CHORE] Better logging for physical plan @jaychia (#1499)
- [CHORE] Refactor logging @jaychia (#1489)
- [CHORE] Add Workflow to build artifacts and upload to S3 @samster25 (#1491)
- [CHORE] Update default num_tries on S3Config to 25 @jaychia (#1487)
- [CHORE] Add tests and fixes for Azure globbing @jaychia (#1482)
v0.1.20
Changes
✨ New Features
- [FEAT] Streaming CSV reads @xcharleslin (#1479)
- [FEAT] [Native I/O] Add a native CSV reader. @clarkzinzow (#1475)
🚀 Performance Improvements
- [PERF] Update number of cores on every iteration @jaychia (#1480)
- [Hotfix] Change to streaming reader for CSV schema inference. @clarkzinzow (#1471)
👾 Bug Fixes
- [BUG] Properly dispatch limited reads in new query planner @xcharleslin (#1476)
- [BUG] Fixes globbing on windows by consolidating on posix-style paths @jaychia (#1472)
🧰 Maintenance
- [CHORE] Create SECURITY.md @samster25 (#1481)
v0.1.19
Changes
✨ New Features
- [FEAT] Native globbing for other backends @jaychia (#1460)
- [FEAT] Native
glob
functionality @jaychia (#1450) - [FEAT] ls/list_dir for AzureBlobStorage @xcharleslin (#1408)
- [FEAT] Add
.str.split()
API for splitting string columns. @clarkzinzow (#1409) - [FEAT] Add local native filesystem globbing. @clarkzinzow (#1449)
- [FEAT] Native listing of http URLs @jaychia (#1405)
🚀 Performance Improvements
- [PERF] Local filesystem parquet reader @samster25 (#1461)
- [PERF] Native globbing early stopping @jaychia (#1452)
👾 Bug Fixes
- [BUG] fix circ import with pythonpath is set @samster25 (#1474)
- [BUG] Don't remove all handles and Only use handler for files in
src/
@samster25 (#1473)
🧰 Maintenance
- [FEAT] Native globbing for other backends @jaychia (#1460)
- [CHORE] update s3 connection defaults @samster25 (#1451)
v0.1.18
Changes
✨ New Features
- [FEAT] Add support for windows in daft @samster25 (#1386)
- [FEAT] Add debug logging to s3 native apis @samster25 (#1414)
- [FEAT] enable path style for s3 custom endpoints by default @samster25 (#1410)
- [FEAT] Native S3 Lister, support trailing slashes and fix panics when connection is dropped for tokio @samster25 (#1404)
- [FEAT] Native Rust listing of GCS @jaychia (#1392)
- [FEAT] [New Query Planner] Enable new query planner by default. @clarkzinzow (#1398)
- [FEAT] Parameter to set num_parallel_tasks for bulk readers @samster25 (#1399)
- [FEAT] Native S3 Client: allow disabling ssl verification or checking hostnames @samster25 (#1395)
- [FEAT] Improved projection folding. @xcharleslin (#1374)
- [FEAT] bulk parquet pyarrow reader @samster25 (#1396)
- [FEAT] Native Recursive File Lister @samster25 (#1353)
- [FEAT] Implement .dt.year/month/day for timestamp types @jaychia (#1385)
- [FEAT] [New Query Planner] Add support for fsspec filesystems to new query planner. @clarkzinzow (#1357)
- [FEAT] Common subexpression elimination in Projection construction @xcharleslin (#1347)
👾 Bug Fixes
- [BUG] Fix num input partitions in coalesce. @clarkzinzow (#1442)
- [BUG] Fix scheme bug in GCS anonymous mode @jaychia (#1443)
- [BUG] Fix runner check at plan execution time for new query planner @clarkzinzow (#1435)
- [BUG] [Docs] Allow source code discovery to fail silently for pyo3-defined classes when generating docs. @clarkzinzow (#1430)
- [BUG] patch workspace version when building wheels @samster25 (#1418)
- [BUG] Anaconda client don't upload src wheels @samster25 (#1415)
- [BUG] Anaconda client needs wildcard for upload @samster25 (#1413)
- [BUG] Fix gs listing to include 0 sized marker files @jaychia (#1412)
- [BUG] force upload of anaconda nightly wheels @samster25 (#1411)
- [BUG] add test cases for bulk minio reading @samster25 (#1402)
- [BUG] Fixes to S3 Native Lister with correct Error propagation @samster25 (#1401)
- [BUG] Fix public API decorator type annotations. @clarkzinzow (#1397)
- [BUG] Fix partition spec bugs from old query planner @xcharleslin (#1372)
📖 Documentation
- [BUG] [Docs] Allow source code discovery to fail silently for pyo3-defined classes when generating docs. @clarkzinzow (#1430)
- [FEAT] Implement .dt.year/month/day for timestamp types @jaychia (#1385)
🧰 Maintenance
- [CHORE] disable windows pytest after building @samster25 (#1420)
- [CHORE] add caching for pip wheels @samster25 (#1419)
- [CHORE] macos xl runners are 0.32/minute not hour... @samster25 (#1417)
- [CHORE] Centralize pyo3 pickling around
__reduce__
+ bincode macro. @clarkzinzow (#1394) - [CHORE] larger macos runner for builds @samster25 (#1403)
- [CHORE] Add stubs and improve comments for pyo3-exposed abstractions, + driveby type/bug fixes. @clarkzinzow (#1377)
- [CHORE] add retries for broken link checker @samster25 (#1378)
- [CHORE] pin azure-storage-blob due to breaking new version @samster25 (#1373)
- [CHORE] [New Query Planner] Misc. user-facing error tweaks to improve UX. @clarkzinzow (#1358)
v0.1.17
Changes
✨ New Features
- [FEAT] Native Parquet Reader into pyarrow directly @samster25 (#1366)
- [FEAT] Add configurable io thread pool size @samster25 (#1363)
- [FEAT] Add flag to limit number of connections to S3 @samster25 (#1360)
- [FEAT] export jemalloc arm64 flag inside container @samster25 (#1362)
🚀 Performance Improvements
- [PERF] Used owned Stream in Parquet Page Iterator @samster25 (#1365)
- [PERF] enable jemalloc with background threads @samster25 (#1361)
- [PERF] Add microbenchmarks for takes @jaychia (#1350)
- [PERF] Optimize filter on nested growables @jaychia (#1349)
👾 Bug Fixes
- [BUG] Respect
multithreaded_io
flag when reading parquet @samster25 (#1359) - [BUG] Schema Display should use dtype Display instead of Debug @jaychia (#1355)
- [BUG] propagate parquet io error instead of panicking @samster25 (#1352)
🧰 Maintenance
- [CHORE] [New Query Planner] Add simple
df.explain()
option; change to fixed-point policy for rule batch @clarkzinzow (#1354) - [CHORE] Add status code to IO integration tests @jaychia (#1356)
- [CHORE] Fix List/FixedSizeList DataType to hold a dtype instead of Field @jaychia (#1351)
- [CHORE] Add Series::full_null/empty/from_arrow to reduce code duplication @jaychia (#1331)
- [CHORE] Add a Growable factory method @jaychia (#1330)
- [CHORE] Add new ListArray @jaychia (#1329)
⬆️ Dependencies
5 changes
- Bump tokio from 1.29.1 to 1.32.0 @dependabot (#1371)
- Bump tempfile from 3.7.1 to 3.8.0 @dependabot (#1285)
- Bump pyo3 from 0.19.1 to 0.19.2 @dependabot (#1312)
- Bump pytest from 7.4.0 to 7.4.1 @dependabot (#1339)
- Bump actions/checkout from 3 to 4 @dependabot (#1337)
v0.1.16
Changes
✨ New Features
- [FEAT] __repr__ for ResourceRequest @xcharleslin (#1343)
- [FEAT] [New Query Planner] Refactor file globbing logic by exposing
FileInfos
to Python @clarkzinzow (#1307) - [FEAT] S3 Native List Impl for a directory @samster25 (#1324)
- [FEAT] [New Query Planner] Add support for
DropRepartition
@clarkzinzow (#1302) - [FEAT] Add all projection optimization rules to new query planner. @xcharleslin (#1288)
- [FEAT] [New Query Planner] Add support for
PushDownLimit
@clarkzinzow (#1300)
👾 Bug Fixes
- [BUG] Fix Table.read_parquet behavior when it encounters arrow_schema @jaychia (#1336)
- [BUG] [New Query Planner] Revert file info partition column names. @clarkzinzow (#1333)
- [BUG] Fix fixed size list array FullNull implementation @jaychia (#1320)
🧰 Maintenance
- [CHORE] install perl before maturin @samster25 (#1345)
- [CHORE] Switch to openssl @samster25 (#1344)
- [CHORE] [New Query Planner] pyo3-agnostic
LogicalPlanBuilder
, op constructor arg orderings @clarkzinzow (#1332) - [CHORE] factor io config into common code @samster25 (#1335)
- [CHORE] [New Query Planner] Remove
ExpressionsProjection
from builder, move validation intoOp::try_new()
@clarkzinzow (#1327) - [CHORE] StructArray refactors @jaychia (#1326)
- [CHORE] drop flag for non native compile for daft profiling @samster25 (#1323)
- [CHORE] pin pyarrow to 12 for ray compat tests @samster25 (#1322)
- [CHORE] Move FixedSizeListArray to array/fixed_size_list_array.rs @jaychia (#1319)
- [CHORE] Add fix for list schema inference tests using PyArrow 13.0.0 @jaychia (#1318)
- [CHORE] Implementations of FixedSizeListArray @jaychia (#1281)
⬆️ Dependencies
- Bump ray[data,default] from 2.6.0 to 2.6.3 @dependabot (#1315)
- Bump orjson from 3.9.4 to 3.9.5 @dependabot (#1316)
- Bump aws-actions/configure-aws-credentials from 2 to 3 @dependabot (#1317)