Skip to content

Ray-2.1.0

Compare
Choose a tag to compare
@c21 c21 released this 08 Nov 01:55
· 3 commits to releases/2.1.0 since this release

Release Highlights

  • Ray AI Runtime (AIR)
    • Better support for Image-based workloads.
      • Ray Datasets read_images() API for loading data.
      • Numpy-based API for user-defined functions in Preprocessor.
    • Ability to read TFRecord input.
      • Ray Datasets read_tfrecords() API to read TFRecord files.
  • Ray Serve:
    • Add support for gRPC endpoint (alpha release). Instead of using an HTTP server, Ray Serve supports gRPC protocol and users can bring their own schema for their use case.
  • RLlib:
    • Introduce decision transformer (DT) algorithm.
    • New hook for callbacks with on_episode_created().
    • Learning rate schedule to SimpleQ and PG.
  • Ray Core:
    • Ray OOM prevention (alpha release).
    • Support dynamic generators as task return values.
  • Dashboard:
    • Time series metrics support.
    • Export configuration files can be used in Prometheus or Grafana instances.
    • New progress bar in job detail view.

Ray Libraries

Ray AIR

💫Enhancements:

  • Improve readability of training failure output (#27946, #28333, #29143)
  • Auto-enable GPU for Predictors (#26549)
  • Add ability to create TorchCheckpoint from state dict (#27970)
  • Add ability to create TensorflowCheckpoint from saved model/h5 format (#28474)
  • Add attribute to retrieve URI from Checkpoint (#28731)
  • Add all allowable types to WandB Callback (#28888)

🔨 Fixes:

  • Handle nested metrics properly as scoring attribute (#27715)
  • Fix serializability of Checkpoints (#28387, #28895, #28935)

📖Documentation:

🏗 Architecture refactoring:

  • Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#28318)
  • Deprecate legacy train/tune functions in favor of Session (#28856)

Ray Data Processing

🎉 New Features:

  • Add read_images (#29177)
  • Add read_tfrecords (#28430)
  • Add NumPy batch format to Preprocessor and BatchMapper (#28418)
  • Ragged tensor extension type (#27625)
  • Add KBinsDiscretizer Preprocessor (#28389)

💫Enhancements:

  • Simplify to_tf interface (#29028)
  • Add metadata override and inference in Dataset.to_dask() (#28625)
  • Prune unused columns before aggregate (#28556)
  • Add Dataset.default_batch_format (#28434)
  • Add partitioning parameter to read_ functions (#28413)
  • Deprecate "native" batch format in favor of "default" (#28489)
  • Support None partition field name (#28417)
  • Re-enable Parquet sampling and add progress bar (#28021)
  • Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964)
  • Customized serializer for Arrow JSON ParseOptions in read_json (#27911)
  • Optimize groupby/mapgroups performance (#27805)
  • Improve size estimation of image folder data source (#27219)
  • Use detached lifetime for stats actor (#25271)
  • Pin _StatsActor to the driver node (#27765)
  • Better error message for partition filtering if no file found (#27353)
  • Make Concatenator deterministic (#27575)
  • Change FeatureHasher input schema to expect token counts (#27523)
  • Avoid unnecessary reads when truncating a dataset with ds.limit() (#27343)
  • Hide tensor extension from UDFs (#27019)
  • Add repr to AIR classes (#27006)

🔨 Fixes:

  • Add upper bound to pyarrow version check (#29674) (#29744)
  • Fix map_groups to work with different output type (#29184)
  • read_csv not filter out files by default (#29032)
  • Check columns when adding rows to TableBlockBuilder (#29020)
  • Fix the peak memory usage calculation (#28419)
  • Change sampling to use same API as read Parquet (#28258)
  • Fix column assignment in Concatenator for Pandas 1.2. (#27531)
  • Doing partition filtering in reader constructor (#27156)
  • Fix split ownership (#27149)

📖Documentation:

  • Clarify dataset transformation. (#28482)
  • Update map_batches documentation (#28435)
  • Improve docstring and doctest for read_parquet (#28488)
  • Activate dataset doctests (#28395)
  • Document using a different separator for read_csv (#27850)
  • Convert custom datetime column when reading a CSV file (#27854)
  • Improve preprocessor documentation (#27215)
  • Improve limit() and take() docstrings (#27367)
  • Reorganize the tensor data support docs (#26952)
  • Fix nyc_taxi_basic_processing notebook (#26983)

Ray Train

🎉 New Features:

  • Add FullyShardedDataParallel support to TorchTrainer (#28096)

💫Enhancements:

  • Add rich notebook repr for DataParallelTrainer (#26335)
  • Fast fail if training loop raises an error on any worker (#28314)
  • Use torch.encode_data with HorovodTrainer when torch is imported (#28440)
  • Automatically set NCCL_SOCKET_IFNAME to use ethernet (#28633)
  • Don't add Trainer resources when running on Colab (#28822)
  • Support large checkpoints and other arguments (#28826)

🔨 Fixes:

📖Documentation:

  • Clarify LGBM/XGB Trainer documentation (#28122)
  • Improve Hugging Face notebook example (#28121)
  • Update Train API reference and docs (#28192)
  • Mention FSDP in HuggingFaceTrainer docs (#28217)

🏗 Architecture refactoring:

  • Improve Trainer modularity for extensibility (#28650)

Ray Tune

🎉 New Features:

  • Add Tuner.get_results() to retrieve results after restore (#29083)

💫Enhancements:

  • Exclude files in sync_dir_between_nodes, exclude temporary checkpoints (#27174)
  • Add rich notebook output for Tune progress updates (#26263)
  • Add logdir to W&B run config (#28454)
  • Improve readability for long column names in table output (#28764)
  • Add functionality to recover from latest available checkpoint (#29099)
  • Add retry logic for restoring trials (#29086)

🔨 Fixes:

  • Re-enable progress metric detection (#28130)
  • Add timeout to retry_fn to catch hanging syncs (#28155)
  • Correct PB2’s beta_t parameter implementation (#28342)
  • Ignore directory exists errors to tackle race conditions (#28401)
  • Correctly overwrite files on restore (#28404)
  • Disable pytorch-lightning multiprocessing per default (#28335)
  • Raise error if scheduling an empty PlacementGroupFactory#28445
  • Fix trial cleanup after x seconds, set default to 600 (#28449)
  • Fix trial checkpoint syncing after recovery from other node (#28470)
  • Catch empty hyperopt search space, raise better Tuner error message (#28503)
  • Fix and optimize sample search algorithm quantization logic (#28187)
  • Support tune.with_resources for class methods (#28596)
  • Maintain consistent Trial/TrialRunner state when pausing and resuming trial with PBT (#28511)
  • Raise better error for incompatible gcsfs version (#28772)
  • Ensure that exploited in-memory checkpoint is used by trial with PBT (#28509)
  • Fix Tune checkpoint tracking for minimizing metrics (#29145)

📖Documentation:

🏗 Architecture refactoring:

  • Store SyncConfig and CheckpointConfig in Experiment and Trial (#29019)

Ray Serve

🎉 New Features:

  • Added gRPC direct ingress support [alpha version] (#28175)
  • Serve cli can provide kubernetes formatted output (#28918)
  • Serve cli can provide user config output without default value (#28313)

💫Enhancements:

  • Enrich more benchmarks
  • image objection with resnet50 mode with image preprocessing (#29096)
  • gRPC vs HTTP inference performance (#28175)
  • Add health check metrics to reflect the replica health status (#29154)

🔨 Fixes:

  • Fix memory leak issues during inference (#29187)
  • Fix unexpected http options omit warning when using serve cli to start the ray serve (#28257)
  • Fix unexpected long poll exceptions (#28612)

📖Documentation:

  • Add e2e fault tolerance instructions (#28721)
  • Add Direct Ingress instructions (#29149)
  • Bunch of doc improvements on “dev workflow”, “custom resources”, “serve cli” etc (#29147, #28708, #28529, #28527)

RLlib

🎉 New Features:

  • Decision Transformer (DT) Algorithm added (#27890, #27889, #27872, #27829).
  • Callbacks now have a new hook on_episode_created(). (#28600)
  • Added learning rate schedule to SimpleQ and PG. (#28381)

💫Enhancements:

🔨 Fixes:

📖Documentation:

Ray Workflows

🔨 Fixes:

  • Fixed the object loss due to driver exit (#29092)
  • Change the name in step to task_id (#28151)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

  • Ray OOM prevention feature alpha release! If your Ray jobs suffer from OOM issues, please give it a try.
  • Support dynamic generators as task return values. (#29082 #28864 #28291)

💫Enhancements:

  • Fix spread scheduling imbalance issues (#28804 #28551 #28551)
  • Widening range of grpcio versions allowed (#28623)
  • Support encrypted redis connection. (#29109)
  • Upgrade redis from 6.x to 7.0.5. (#28936)
  • Batch ScheduleAndDispatchTasks calls (#28740)

🔨 Fixes:

  • More robust spilled object deletion (#29014)
  • Fix the initialization/destruction order between reference_counter_ and node change subscription (#29108)
  • Suppress the logging error when python exits and actor not deleted (#27300)
  • Mark run_function_on_all_workers as deprecated until we get rid of this (#29062)
  • Remove unused args for default_worker.py (#28177)
  • Don't include script directory in sys.path if it's started via python -m (#28140)
  • Handling edge cases of max_cpu_fraction argument (#27035)
  • Fix out-of-band deserialization of actor handle (#27700)
  • Allow reuse of cluster address if Ray is not running (#27666)
  • Fix a uncaught exception upon deallocation for actors (#27637)
  • Support placement_group=None in PlacementGroupSchedulingStrategy (#27370)

📖Documentation:

Ray Clusters

💫Enhancements:

  • Distinguish Kubernetes deployment stacks (#28490)

📖Documentation:

  • State intent to remove legacy Ray Operator (#29178)
  • Improve KubeRay migration notes (#28672)
  • Add FAQ for cluster multi-tenancy support (#29279)

Dashboard

🎉 New Features:

  • Time series metrics are now built into the dashboard
  • Ray now exports some default configuration files which can be used for your Prometheus or Grafana instances. This includes default metrics which show common information important to your Ray application.
  • New progress bar is shown in the job detail view. You can see how far along your ray job is.

🔨 Fixes:

  • Fix to prometheus exporter producing a slightly incorrect format.
  • Fix several performance issues and memory leaks

📖Documentation:

  • Added additional documentation on the new time series and the metrics page

Many thanks to all those who contributed to this release!

@sihanwang41, @simon-mo, @avnishn, @MyeongKim, @markrogersjr, @christy, @xwjiang2010, @kouroshHakha, @zoltan-fedor, @wumuzi520, @alanwguo, @Yard1, @liuyang-my, @charlesjsun, @DevJake, @matteobettini, @jonathan-conder-sm, @mgerstgrasser, @guidj, @JiahaoYao, @Zyiqin-Miranda, @jvanheugten, @aallahyar, @SongGuyang, @clarng, @architkulkarni, @Rohan138, @heyitsmui, @mattip, @ArturNiederfahrenhorst, @maxpumperla, @vale981, @krfricke, @DmitriGekhtman, @amogkam, @richardliaw, @maldil, @zcin, @jianoaix, @cool-RR, @kira-lin, @gramhagen, @c21, @jiaodong, @sijieamoy, @tupui, @ericl, @anabranch, @se4ml, @suquark, @dmatrix, @jjyao, @clarkzinzow, @smorad, @rkooo567, @jovany-wang, @edoakes, @XiaodongLv, @klieret, @rozsasarpi, @scottsun94, @ijrsvt, @bveeramani, @chengscott, @jbedorf, @kevin85421, @nikitavemuri, @sven1977, @acxz, @stephanie-wang, @PaulFenton, @WangTaoTheTonic, @cadedaniel, @nthai, @wuisawesome, @rickyyx, @artemisart, @peytondmurray, @pingsutw, @olipinski, @davidxia, @stestagg, @yaxife, @scv119, @mwtian, @yuanchi2807, @ntlm1686, @shrekris-anyscale, @cassidylaidlaw, @gjoliver, @ckw017, @hakeemta, @ilee300a, @avivhaber, @matthewdeng, @afarid, @pcmoritz, @Chong-Li, @Catch-Bull, @justinvyu, @iycheng