Skip to content

Ray-1.12.0

Compare
Choose a tag to compare
@jianoaix jianoaix released this 08 Apr 03:05
f18fc31

Highlights

  • Ray AI Runtime (AIR), an open-source toolkit for building end-to-end ML applications on Ray, is now in Alpha. AIR is an effort to unify the experience of using different Ray libraries (Ray Data, Train, Tune, Serve, RLlib). You can find more information on the docs or on the public RFC.
    • Getting involved with Ray AIR. We’ll be holding office hours, development sprints, and other activities as we get closer to the Ray AIR Beta/GA release. Want to join us? Fill out this short form!
  • Ray usage data collection is now off by default. If you have any questions or concerns, please comment on the RFC.
  • New algorithms are added to RLlib: SlateQ & Bandits (for recommender systems use cases) and AlphaStar (multi-agent, multi-GPU w/ league-based self-play)
  • Ray Datasets: new lazy execution model with automatic task fusion and memory-optimizing move semantics; first-class support for Pandas DataFrame blocks; efficient random access datasets.

Ray Autoscaler

🎉 New Features

  • Support cache_stopped_nodes on Azure (#21747)
  • AWS Cloudwatch support (#21523)

💫 Enhancements

  • Improved documentation and standards around built in autoscaler node providers. (#22236, 22237)
  • Improved KubeRay support (#22987, #22847, #22348, #22188)
  • Remove redis requirement (#22083)

🔨 Fixes

  • No longer print infeasible warnings for internal placement group resources. Placement groups which cannot be satisfied by the autoscaler still trigger warnings. (#22235)
  • Default ami’s per AWS region are updated/fixed. (#22506)
  • GCP node termination updated (#23101)
  • Retry legacy k8s operator on monitor failure (#22792)
  • Cap min and max workers for manually managed on-prem clusters (#21710)
  • Fix initialization artifacts (#22570)
  • Ensure initial scaleup with high upscaling_speed isn't limited. (#21953)

Ray Client

🎉 New Features:

  • ray.init has consistent return value in client mode and driver mode #21355

💫Enhancements:

  • Gets and puts are streamed to support arbitrary object sizes #22100, #22327

🔨 Fixes:

  • Fix ray client object ref releasing in wrong context #22025

Ray Core

🎉 New Features

  • RuntimeEnv:
    • Support setting timeout for runtime_env setup. (#23082)
    • Support setting pip_check and pip_version for runtime_env. (#22826, #23306)
    • env_vars will take effect when the pip install command is executed. (temporarily ineffective in conda) (#22730)
    • Support strongly-typed API ray.runtime.RuntimeEnv to define runtime env. (#22522)
    • Introduce virtualenv to isolate the pip type runtime env. (#21801,#22309)
  • Raylet shares fate with the dashboard agent. And the dashboard agent will stay alive when it catches the port conflicts. (#22382,#23024)
  • Enable dashboard in the minimal ray installation (#21896)
  • Add task and object reconstruction status to ray memory cli tools(#22317)

🔨 Fixes

  • Report only memory usage of pinned object copies to improve scaledown. (#22020)
  • Scheduler:
    • No spreading if a node is selected for lease request due to locality. (#22015)
    • Placement group scheduling: Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762)
    • Round robin during spread scheduling (#21303)
  • Object store:
    • Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120)
    • Cleanup handling for nondeterministic object size during transfer (#22639)
    • Fix bug in fusion for spilled objects (#22571)
    • Handle IO worker failures correctly (#20752)
  • Improve ray stop behavior (#22159)
  • Avoid warning when receiving too much logs from a different job (#22102)
  • Gcs resource manager bug fix and clean up. (#22462, #22459)
  • Release GIL when running parallel_memcopy() / memcpy() during serializations. (#22492)
  • Fix registering serializer before initializing Ray. (#23031)

🏗 Architecture refactoring

Ray Data Processing

🎉 New Features

  • Big Performance and Stability Improvements:
    • Add lazy execution mode with automatic stage fusion and optimized memory reclamation via block move semantics (#22233, #22374, #22373, #22476)
    • Support for random access datasets, providing efficient random access to rows via binary search (#22749)
    • Add automatic round-robin load balancing for reading and shuffle reduce tasks, obviating the need for the _spread_resource_prefix hack (#21303)
  • More Efficient Tabular Data Wrangling:
    • Add first-class support for Pandas blocks, removing expensive Arrow <-> Pandas conversion costs (#21894)
    • Expose TableRow API + minimize copies/type-conversions on row-based ops (#22305)
  • Groupby + Aggregations Improvements:
    • Support mapping over groupby groups (#22715)
    • Support ignoring nulls in aggregations (#20787)
  • Improved Dataset Windowing:
    • Support windowing a dataset by bytes instead of number of blocks (#22577)
    • Batch across windows in DatasetPipelines (#22830)
  • Better Text I/O:
    • Support streaming snappy compression for text files (#22486)
    • Allow for custom decoding error handling in read_text() (#21967)
    • Add option for dropping empty lines in read_text() (#22298)
  • New Operations:
    • Add add_column() utility for adding derived columns (#21967)
  • Support for metadata provider callback for read APIs (#22896)
  • Support configuring autoscaling actor pool size (#22574)

🔨 Fixes

  • Force lazy datasource materialization in order to respect DatasetPipeline stage boundaries (#21970)
  • Simplify lifetime of designated block owner actor, and don’t create it if dynamic block splitting is disabled (#22007)
  • Respect 0 CPU resource request when using manual resource-based load balancing (#22017)
  • Remove batch format ambiguity by always converting Arrow batches to Pandas when batch_format=”native” is given (#21566)
  • Fix leaked stats actor handle due to closure capture reference counting bug (#22156)
  • Fix boolean tensor column representation and slicing (#22323)
  • Fix unhandled empty block edge case in shuffle (#22367)
  • Fix unserializable Arrow Partitioning spec (#22477)
  • Fix incorrect iter_epochs() batch format (#22550)
  • Fix infinite iter_epochs() loop on unconsumed epochs (#22572)
  • Fix infinite hang on split() when num_shards < num_rows (#22559)
  • Patch Parquet file fragment serialization to prevent metadata fetching (#22665)
  • Don’t reuse task workers for actors or GPU tasks (#22482)
  • Pin pipeline executor actors to driver node to allow for lineage-based fault tolerance for pipelines (#​​22715)
  • Always use non-empty blocks to determine schema (#22834)
  • API fix bash (#22886)
  • Make label_column optional for to_tf() so it can be used for inference (#22916)
  • Fix schema() for DatasetPipelines (#23032)
  • Fix equalized split when num_splits == num_blocks (#23191)

💫 Enhancements

  • Optimize Parquet metadata serialization via batching (#21963)
  • Optimize metadata read/write for Ray Client (#21939)
  • Add sanity checks for memory utilization (#22642)

🏗 Architecture refactoring

  • Use threadpool to submit DatasetPipeline stages (#22912)

RLlib

🎉 New Features

  • New “AlphaStar” algorithm: A parallelized, multi-agent/multi-GPU learning algorithm, implementing league-based self-play. (#21356, #21649)
  • SlateQ algorithm has been re-tested, upgraded (multi-GPU capable, TensorFlow version), and bug-fixed (added to weekly learning tests). (#22389, #23276, #22544, #22543, #23168, #21827, #22738)
  • Bandit algorithms: Moved into agents folder as first-class citizens, TensorFlow-Version, unified w/ other agents’ APIs. (#22821, #22028, #22427, #22465, #21949, #21773, #21932, #22421)
  • ReplayBuffer API (in progress): Allow users to customize and configure their own replay buffers and use these inside custom or built-in algorithms. (#22114, #22390, #21808)
  • Datasets support for RLlib: Dataset Reader/Writer and documentation. (#21808, #22239, #21948)

🔨 Fixes

🏗 Architecture refactoring

  • A3C: Moved into new training_iteration API (from exeution_plan API). Lead to a ~2.7x performance increase on a Atari + CNN + LSTM benchmark. (#22126, #22316)
  • Make multiagent->policies_to_train more flexible via callable option (alternative to providing a list of policy IDs). (#20735)

💫Enhancements:

  • Env pre-checking module now active by default. (#22191)
  • Callbacks: Added on_sub_environment_created and on_trainer_init callback options. (#21893, #22493)
  • RecSim environment wrappers: Ability to use google’s RecSim for recommender systems more easily w/ RLlib algorithms (3 RLlib-ready example environments). (#22028, #21773, #22211)
  • MARWIL loss function enhancement (exploratory term for stddev). (#21493)

📖Documentation:

Ray Workflow

🎉 New Features:

  • Support skip checkpointing.

🔨 Fixes:

  • Fix an issue where the event loop is not set.

Tune

🎉 New Features:

  • Expose new checkpoint interface to users (#22741)

💫Enhancements:

🔨Fixes:

🏗 Refactoring:

📖Documentation:

Train

🎉 New Features

  • Integration with PyTorch profiler. Easily enable the pytorch profiler with Ray Train to profile training and visualize stats in Tensorboard (#22345).
  • Automatic pipelining of host to device transfer. While training is happening on one batch of data, the next batch of data is concurrently being moved from CPU to GPU (#22716, #22974)
  • Automatic Mixed Precision. Easily enable PyTorch automatic mixed precision during training (#22227).

💫 Enhancements

  • Add utility function to enable reproducibility for Pytorch training (#22851)
  • Add initial support for metrics aggregation (#22099)
  • Add support for trainer.best_checkpoint and Trainer.load_checkpoint_path. You can now directly access the best in memory checkpoint, or load an arbitrary checkpoint path to memory. (#22306)

🔨 Fixes

  • Add a utility function to turn off TF autosharding (#21887)
  • Fix fault tolerance for Tensorflow training (#22508)
  • Train utility methods (train.report(), etc.) can now be called outside of a Train session (#21969)
  • Fix accuracy calculation for CIFAR example (#22292)
  • Better error message for placement group time out (#22845)

📖 Documentation

  • Update docs for ray.train.torch import (#22555)
  • Clarify shuffle documentation in prepare_data_loader (#22876)
  • Denote train.torch.get_device as a Public API (#22024)
  • Minor fixes on Ray Train user guide doc (#22379)

Serve

🎉 New Features

🔨 Fixes

  • Autoscaling algorithm will now relingquish most idle nodes when scaling down (#22669)
  • Serve can now manage Java replicas (#22628)
  • Added a hands-on self-contained MLflow and Ray Serve deployment example (#22192)
  • Added root_path setting to http_options (#21090)
  • Remove shard_key, http_method, and http_headers in ServeHandle (#21590)

Dashboard

🔨Fixes:

  • Update CPU and memory reporting in kubernetes. (#21688)

Thanks

Many thanks to all those who contributed to this release!
@edoakes, @pcmoritz, @jiaodong, @iycheng, @krfricke, @smorad, @kfstorm, @jjyyxx, @rodrigodelazcano, @scv119, @dmatrix, @avnishn, @fyrestone, @clarkzinzow, @wumuzi520, @gramhagen, @XuehaiPan, @iasoon, @birgerbr, @n30111, @tbabej, @Zyiqin-Miranda, @suquark, @pdames, @tupui, @ArturNiederfahrenhorst, @ashione, @ckw017, @siddgoel, @Catch-Bull, @vicyap, @spolcyn, @stephanie-wang, @mopga, @Chong-Li, @jjyao, @raulchen, @sven1977, @nikitavemuri, @jbedorf, @mattip, @bveeramani, @czgdp1807, @dependabot[bot], @Fabien-Couthouis, @willfrey, @mwtian, @SlowShip, @Yard1, @WangTaoTheTonic, @Wendi-anyscale, @kaushikb11, @kennethlien, @acxz, @DmitriGekhtman, @matthewdeng, @mraheja, @orcahmlee, @richardliaw, @dsctt, @yupbank, @Jeffwan, @gjoliver, @jovany-wang, @clay4444, @shrekris-anyscale, @jwyyy, @kyle-chen-uber, @simon-mo, @ericl, @amogkam, @jianoaix, @rkooo567, @maxpumperla, @architkulkarni, @chenk008, @xwjiang2010, @robertnishihara, @qicosmos, @sriram-anyscale, @SongGuyang, @jon-chuang, @wuisawesome, @valiantljk, @simonsays1980, @ijrsvt