v0.9 release: stats-based predicate pushdown, scalar index, performance improvement, and bug fixes
Summary
- Stats-based predicate pushdown
- Scalar index
- Tensorflow and PyTorch data loader
- Pre / post filter combined with vector search
- Performance improvement across the stack
Breaking changes:
- Change IVF_PQ algorithm for cosine distance. Requires rebuilding index with cosine distance.
- Bump
pyarrow
version to12.0+
What's Changed
- feat: add a cache for dynamodb schema validation by @chebbyChefNEQ in #1308
- chore: speed up kmean training for cosine by @eddyxu in #1334
- feat: add take() method to RecordBatchExt by @eddyxu in #1337
- feat(python): expose row id in python API by @eddyxu in #1339
- feat: data generation of dbpedia dataset by @eddyxu in #1340
- feat: build ivf partition using disk based shuffler by @eddyxu in #1312
- feat: friendlier error messages in nearest api by @rok in #1336
- chore: index / recall benchmark over dbpedia by @eddyxu in #1348
- feat: support storing page-level stats by @wjones127 in #1316
- fix: pq cosine fast lookup table by @eddyxu in #1354
- chore: compute distance using pytorch and GPU/MPS by @eddyxu in #1351
- feat: train kmean using pytorch by @eddyxu in #1358
- build: use larger runner for doc build by @eddyxu in #1364
- feat(python): gpu based ivf partition training by @eddyxu in #1361
- fix: stop reading latest manifest by @wjones127 in #1365
- chore: improve kmean training performance on CUDA by @eddyxu in #1368
- chore: improve kmean performance on MPS by @eddyxu in #1370
- chore: use torch.index_add to compute new centroids, to improve training performance on MPS by @eddyxu in #1371
- feat: schema::field_by_id() by @eddyxu in #1375
- feat(python): design an image extension type by @rok in #1272
- perf: improve KNN and ANN performance by @wjones127 in #1367
- feat: collect int/float/boolean/date page-level statistics on write by @rok in #1346
- chore: part 1/N of refactoring vector index into separate crate by @eddyxu in #1388
- fix: handle larger arrays in take by @wjones127 in #1383
- chore: cleanup tests to avoid errors when optional components are not present by @westonpace in #1374
- chore: object write trait by @eddyxu in #1389
- refactor: fast path to find fragments for flat scan by @eddyxu in #1394
- refactor: move reader trait to lance-core by @eddyxu in #1393
- refactor: move pq to lance-index by @eddyxu in #1400
- refactor: make pq a batch transformer by @eddyxu in #1401
- feat: run pq portion of ivf_pq in parallel by @westonpace in #1386
- feat: generic shuffler over RecordBatchStream by @eddyxu in #1402
- feat: remap indices on compaction by @westonpace in #1403
- chore: remove unused crate by @eddyxu in #1405
- docs: make overwrite row green by @wjones127 in #1409
- feat: add removed_indices to CreateIndex transaction operation by @eddyxu in #1408
- feat(rust): incremental index update by @eddyxu in #1406
- feat(python): expose index optimization via python by @eddyxu in #1412
- test: add test case to ensure optimize returns to flat KNN by @westonpace in #1416
- fix: fix bug in index remapping when plan contained multiple rewrite groups by @westonpace in #1415
- chore: upgrade to datafusion 32 by @wjones127 in #1391
- ci: cross compile arm wheels by @wjones127 in #1407
- test: add new ann scenarios to the python benchmarks by @westonpace in #1411
- chore: instrument various steps in the ann search by @westonpace in #1404
- refactor: refactor flat search to lance-index by @eddyxu in #1419
- refactor: move encodings to lance-core by @eddyxu in #1425
- feat: expose latest version id api by @chebbyChefNEQ in #1426
- refactor: use function pointers instead of trait objects by @wjones127 in #1424
- feat: automatically convert image to tensors in TF data pipeline by @rok in #1420
- refactor: migrate schema and data types to lance-core by @eddyxu in #1429
- perf: better parallelism in delete vector prefiltering by @westonpace in #1428
- test: fix flaky tests involving tokio::fs::File by @westonpace in #1430
- perf: use selection vector strategy to improve exact knn performance with deletions by @wjones127 in #1418
- chore: use arrow 47 function by @eddyxu in #1439
- refactor: move format definitions to lance-core by @eddyxu in #1440
- refactor: migrate object reader and object writer by @eddyxu in #1442
- fix: fix an issue where the GPU index trainer was taking too much data into memory by @westonpace in #1447
- feat: store a separate tensor blob for IVF centroids by @eddyxu in #1446
- refactor: move all Python operations to the same runtime by @wjones127 in #1445
- chore: bump prost version to latest by @eddyxu in #1449
- chore: update half to 2.3.1 by @jacobBaumbach in #1450
- feat: allow prefiltering to be used with an index by @westonpace in #1435
- feat: benchmark and improve L2 partition compute by @eddyxu in #1453
- chore: increase ivf assignment parallism during indexing by @eddyxu in #1451
- feat: support keyboard interrupt in Python by @wjones127 in #1438
- feat: add parameter to split by file size by @wjones127 in #1444
- ci: fix ARM build due to Ring dependency by @wjones127 in #1462
- refactor: move read and write manifest file to lance-core by @eddyxu in #1467
- feat: create_index take torch.device object by @eddyxu in #1465
- feat: added dataset stats api by @albertlockett in #1452
- refactor: move commit traits to lance-core by @eddyxu in #1469
- chore: use ruff format to replace isort and black by @eddyxu in #1472
- refactor: move ObjectStore, FileReader and FileWriter to lance-core by @eddyxu in #1473
- perf: support multi-threading shuffler by @eddyxu in #1474
- feat: poor man's SIMD lib by @eddyxu in #1478
- perf: use simd lib to implement dot by @eddyxu in #1480
- feat: expose progress on write_fragments and write_dataset by @wjones127 in #1464
- feat: split out datagen utilities, expand them, expose to python by @westonpace in #1315
- chore: remove outdated warnings about prefiltering with a vector index by @westonpace in #1484
- fix: fix L2 computation on GPU by @eddyxu in #1485
- perf: improve kmeans and make pq training multi-threaded by @eddyxu in #1479
- chore: mention GPU support in README by @eddyxu in #1489
- fix: fix PQ training metric type is not appropriately propogated by @eddyxu in #1493
- docs: clarify behaviour of refine_factor by @albertlockett in #1496
- ci: cancel in progress runs on new push by @albertlockett in #1497
- chore: remove unused value settings by @eddyxu in #1494
- feat: provide a f32x16 abstraction to make unrolling 256-bit code easier by @eddyxu in #1495
- fix: remove channel closed messages by @wjones127 in #1502
- perf: dimension-based kernel for L2 and Cosine by @eddyxu in #1503
- feat: add location for all error by @Weijun-H in #1475
- feat: add sorting to the scanner by @westonpace in #1498
- feat: add tf.data APIs for reading batches by @wjones127 in #1488
- feat: experimental avx512 features by @eddyxu in #1506
- feat: add read ahead for take scan by @wjones127 in #1501
- feat: use caller location in error conversion functions by @chebbyChefNEQ in #1510
- chore(rust): reduce debug message log level by @changhiskhan in #1512
- feat: collect page-level statistics on write by @rok in #1335
- feat(rust): simd ops of reduce min, min, find and gather by @eddyxu in #1514
- feat: add btree scalar index by @westonpace in #1476
- feat: support
true
in deletion logic by @Weijun-H in #1515 - fix: make sure we have physical rows by @wjones127 in #1511
- chore: benchmark of large IVF parrtitions by @eddyxu in #1524
- feat: make dot generic to support bf16/f16/f32 with one
dot_distance
interface. by @eddyxu in #1522 - chore: add same target-features to python pyo3 build by @eddyxu in #1527
- feat: expose index cache configure via open dataset API by @eddyxu in #1523
- fix: fix assertion of cosine values by @eddyxu in #1530
- feat: generic cosine code by @eddyxu in #1537
- perf: improve f16 performance for norm L2 on aarch64 by @eddyxu in #1539
- feat: make L2 generic to work with all float numbers by @eddyxu in #1532
- fix: pq index does not handle dot product metric correctly during search by @rok in #1536
- chore: move scalar_index benchmark to break circular dependency by @westonpace in #1540
- feat: safer API for physical_rows by @wjones127 in #1529
- feat: implement datafusion tableprovider trait for
Dataset
by @universalmind303 in #1526 - feat: expose
Dataset.validate()
in Python by @wjones127 in #1538 - fix: add versioning and bypass broken row counts by @wjones127 in #1534
- feat: generic kmeans that supports bf16 and f16 by @eddyxu in #1544
- chore: disable avx512 for now by @eddyxu in #1546
- chore: fix type inference errors in benchmarks by @westonpace in #1556
- chore: provide a trait to dynamically dispatch different pq based on different vector data type by @eddyxu in #1555
- chore: update the CI build to check/build all crates in the workspace and not just the lance crate by @westonpace in #1557
- feat: make it possible to create and load scalar indices for a dataset by @westonpace in #1516
- feat: generic Product Quantizatoin by @eddyxu in #1560
- test: add property-based testing for statistics by @wjones127 in #1554
- feat: ffi to accelerate norm_l2 for f16 if the instruction set is available by @eddyxu in #1562
- feat: extend FSL with sample by @eddyxu in #1572
- feat: allow for more advanced storage options in objectstore by @universalmind303 in #1547
- feat: implement as_slice for bfloat16 array by @eddyxu in #1574
- perf: add bf16 benchmarks by @eddyxu in #1575
- feat: f16 for L2 by @eddyxu in #1577
- chore: update cc dependency to 1.0.83 by @westonpace in #1578
- feat: make IVF model support f16 and bf16 by @eddyxu in #1573
- feat: allow the scanner to take advantage of scalar indices by @westonpace in #1543
- chore: dotprod should be on mac target, not haswell and better randomness for bf16 by @westonpace in #1579
- feat: make Dataset::nearest() accepts arbitrary query type by @eddyxu in #1582
- chore(rust): remove extraneous dbg message by @changhiskhan in #1598
- feat: torch cache-able dataset, with sampling support by @eddyxu in #1591
- fix: tell writer correct schema when writing index file by @wjones127 in #1518
- feat: add support for remapping scalar indices during compaction by @westonpace in #1571
- refactor: switch to using DataFusions physical expr by @wjones127 in #1581
- chore: various fixes for Python benchmarks by @wjones127 in #1513
- feat: adaptive cuda allocation for l2/cosine distance computation by @eddyxu in #1601
- fix: fix a memory leak where a dataset would not be fully deleted by @westonpace in #1606
- fix: google objectstore uses proper gs configuration by @universalmind303 in #1608
- perf: kmean fit uses cached torch dataset by @eddyxu in #1603
- fix: add migration for bad fragment bitmaps by @westonpace in #1611
- feat: allow scalar indices to be updated with new data by @westonpace in #1576
- feat: add python bindings for creating scalar indices by @westonpace in #1592
- fix: handle no max value for string by @wjones127 in #1600
- feat: expose index cache size by @rok in #1587
- feat: track index cache hit rate by @rok in #1586
- feat: serialize arbitrary float type of PQ to protobuf by @eddyxu in #1624
- ci: use M1 runner for now for release by @wjones127 in #1623
- feat: coerce float array for nearest query by @eddyxu in #1618
- chore: expose avx512fp16 feature via main lance crate by @eddyxu in #1626
- feat: make partition calculation parallel by @chebbyChefNEQ in #1625
- feat(rust): simplify object store option API by @wjones127 in #1627
- fix: fix chunk size issue by @wjones127 in #1630
- perf: more efficient treemap implementation for row ids by @wjones127 in #1632
- feat(python): add
index_cache_hit_rate
toindex_stats()
by @rok in #1631 - chore: make lance-linalg benchmark ready to test bf16 data by @eddyxu in #1634
- perf: fast L2 distance table build by @eddyxu in #1639
- fix: correctly avg centroids in update logic in GPU IVF training by @chebbyChefNEQ in #1646
- perf: add a fast path for converting bytes into array when the bytes has the correct alignment by @chebbyChefNEQ in #1652
- fix: prevent OOM when IVF centroids are provided by @wjones127 in #1653
- test: fix for test by @wjones127 in #1644
- perf: minor change to cleanup allowing for size to be collected in parallel by @westonpace in #1649
- perf: add type coersion for in-list expressions by @westonpace in #1655
- chore: minor changes to tracing instrumentation by @westonpace in #1619
- fix: fix error message for invalid nprobes by @albertlockett in #1666
- feat: add support for update queries by @wjones127 in #1585
- fix: support no-op filters again by @wjones127 in #1669
- fix: row_id range fix for index training on gpu by @jerryyifei in #1663
- feat: better warnings when the PQ assignment over cosine distance is wrong by @eddyxu in #1672
- fix: add retries for failed response stream by @wjones127 in #1671
- chore: add utility to compute ground truth for benchmarks by @eddyxu in #1668
- fix: dont use scalar indices unless we are prefiltering by @westonpace in #1678
- fix: lance pytorch dataset parameter to load with row_id by @eddyxu in #1676
- feat: a tensor dataset that shared with the same behavior as Lance torch Dataset by @eddyxu in #1679
- chore: add new python benchmarks for testing scalar indices by @westonpace in #1658
- feat: add option to pass in precomputed row_id -> ivf partiton mapping and compute partiiton on GPU by @chebbyChefNEQ in #1680
- fix: make sure to prefilter the flat portion of a combined knn by @westonpace in #1583
- perf: use datafusion to shuffle index partition data by @wjones127 in #1645
- feat: add batch buffering and async loading to torch.LanceDataset by @chebbyChefNEQ in #1687
- feat: optimized pushdown scanner by @wjones127 in #1328
- fix: add shutdown to async loader by @chebbyChefNEQ in #1690
- fix: use eplison to handle all zero cosine values by @eddyxu in #1696
- fix: prevent stats meta from breaking old readers by @wjones127 in #1699
- fix: add _rowid when
use_stats=False
by @wjones127 in #1700 - perf: revert back to hashmap by @chebbyChefNEQ in #1692
- fix: remove default memory cap for index training by @wjones127 in #1702
- feat: do not use residual vector for cosine similarity by @eddyxu in #1708
- feat: add support for new and deleted data to scalar indices by @westonpace in #1689
- fix: update list_indices to report if an index is vector or scalar by @westonpace in #1710
- perf: allow take to process multiple fragments in parallel by @westonpace in #1713
- feat: turn on argument tracking in tracing by @wjones127 in #1706
- perf: make sure we use multiple threads when scanning by @wjones127 in #1705
- chore: kmeans fit takes pyarrow FixedSizeListArray by @eddyxu in #1714
- revert: use eplison to handle all zero cosine values (#1696) by @eddyxu in #1715
- chore: add ruff copyright check by @eddyxu in #1716
- chore: compute pairwise cosine using pytorch by @eddyxu in #1717
- chore: normalize vector kernel by @eddyxu in #1720
- fix: fix l2 normalize by @eddyxu in #1722
- perf: use an asynchronous open function even for local files by @westonpace in #1721
- perf: small performance fixes for scan by @wjones127 in #1719
- fix: cosine kmeans by @eddyxu in #1723
- fix: cosine kmeans on GPU by @eddyxu in #1726
- fix: pq code for cosine distance by @eddyxu in #1727
- chore: adjust cosine value from l2 distance by @eddyxu in #1730
- fix: various fixes to GPU kmeans by @chebbyChefNEQ in #1731
- feat: handroll ivf partition shuffle by @chebbyChefNEQ in #1729
Full Changelog: v0.8.0...v0.9.0