Skip to content

Releases: ray-project/ray

Ray-2.24.0

06 Jun 18:16
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

  • Allow user to configure timeout for actor pool (#45508)
  • Add override_num_blocks to from_pandas and perform auto-partition (#44937)
  • Upgrade Arrow version to 16 in CI (#45565)

💫 Enhancements:

  • Clarify that num_rows_per_file isn't strict (#45529)
  • Record more telemetry for newly added datasources (#45647)
  • Avoid pickling LanceFragment when creating read tasks for Lance (#45392)

Ray Train

📖 Documentation:

  • [HPU] Add example of Stable Diffusion fine-tuning and serving on Intel Gaudi (#45217)
  • [HPU] Add example of Llama-2 fine-tuning on Intel Gaudi (#44667)

Ray Tune

🏗 Architecture refactoring:

  • Improve excessive syncing warning and deprecate TUNE_RESULT_DIR, RAY_AIR_LOCAL_CACHE_DIR, local_dir (#45210)

Ray Serve

💫 Enhancements:

  • Clean up Serve proxy files (#45486)

📖 Documentation:

  • vllm example to serve llm models (#45430)

RLLib

💫 Enhancements:

  • DreamerV3 on tf: Bug fix, so it can run again with tf==2.11.1 (2.11.0 is not available anymore) (#45419); Added weekly release test for DreamerV3.
  • Added support for multi-agent off-policy algorithms (DQN and SAC) in the new (#45182)
  • Config option for APPO/IMPALA to change number of GPU-loader threads (#45467)

🔨 Fixes:

📖 Documentation:

  • Example script for new API stack: How-to restore 1 of n agents from a checkpoint. (#45462)
  • Example script for new API stack: Autoregressive action module. #45525

Ray Core

💫 Enhancements:

🔨 Fixes:

  • Fix worker crash when getting actor name from runtime context (#45194)
  • log dedup should not dedup number only lines (#45385)

📖 Documentation:

  • Improve doc for --object-store-memory to describe how the default value is set (#45301)

Dashboard

🔨 Fixes:

  • Move Job package uploading to another thread to unblock the event loop. (#45282)

Many thanks to all those who contributed to this release: @maxliuofficial, @simonsays1980, @GeneDer, @dudeperf3ct, @khluu, @justinvyu, @andrewsykim, @Catch-Bull, @zcin, @bveeramani, @rynewang, @angelinalg, @matthewdeng, @jjyao, @kira-lin, @harborn, @hongchaodeng, @peytondmurray, @aslonnie, @timkpaine, @982945902, @maxpumperla, @stephanie-wang, @ruisearch42, @alanwguo, @can-anyscale, @c21, @Atry, @KamenShah, @sven1977, @raulchen

Ray-2.23.0

22 May 23:37
a0947ea
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

  • Add support for using GPUs with map_groups (#45305)
  • Add support for using actors with map_groups (#45310)

💫 Enhancements:

  • Refine exception handling from arrow data conversion (#45294)

🔨 Fixes:

  • Fix Ray databricks UC reader with dynamic Databricks notebook scope token (#45153)
  • Fix bug where you can't return objects and array from UDF (#45287 )
  • Fix bug where map_groups triggers execution during input validation (#45314)

Ray Tune

🔨 Fixes:

  • [tune] Fix PB2 scheduler error resulting from trying to sort by Trial objects (#45161)

Ray Serve

🔨 Fixes:

  • Log application unhealthy errors at error level instead of warning level (#45211)

RLLib

💫 Enhancements:

  • Examples and tuned_examples learning test for new API stack are now “self-executable” (don’t require a third-party script anymore to run them). + WandB support. (#45023)

🔨 Fixes:

  • Fix result dict “spam” (duplicate, deprecated keys, e.g. “sampler_results” dumped into top level). (#45330)

📖 Documentation:

  • Add example for training with fractional GPUs on new API stack. (#45379)
  • Cleanup examples folder and remove deprecated sub directories. (#45327)

Ray Core

💫 Enhancements:

  • [Logs] Add runtime env started logs to job driver (#45255)
  • ray.util.collective support torch.bfloat16 (#39845)
  • [Core] Better propagate node death information (#45128)

🔨 Fixes:

  • [Core] Fix worker process leaks after job finishes (#44214)

Many thanks to all those who contributed to this release: @hongchaodeng, @khluu, @antoni-jamiolkowski, @ameroyer, @bveeramani, @can-anyscale, @WeichenXu123, @peytondmurray, @jackhumphries, @kevin85421, @jjyao, @robcaulk, @rynewang, @scottsun94, @swang, @GeneDer, @zcin, @ruisearch42, @aslonnie, @angelinalg, @raulchen, @ArthurBook, @sven1977, @wuxibin89

Ray-2.22.0

14 May 23:39
a8ab7b8
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

  • Add function to dynamically generate ray_remote_args for Map APIs (#45143)
  • Allow manually setting resource limits for training jobs (#45188)

💫 Enhancements:

  • Introduce abstract interface for data autoscaling (#45002)
  • Add debugging info for SplitCoordinator (#45226)

🔨 Fixes:

  • Don’t show AllToAllOperator progress bar if the disable flag is set (#45136)
  • Don't load Arrow PyExtensionType by default (#45084)
  • Don't raise batch size error if num_gpus=0 (#45202)

Ray Train

💫 Enhancements:

  • [XGBoost][LightGBM] Update RayTrainReportCallback to only save checkpoints on rank 0 (#45083)

Ray Core

🔨 Fixes:

  • Fix the cpu percentage metrics for dashboard process (#45124)

Dashboard

💫 Enhancements:

  • Improvements to log viewer so line numbers do not get selected when copying text.
  • Improvements to the log viewer to avoid unnecessary re-rendering which causes text selection to clear.

Many thanks to all those who contributed to this release: @justinvyu, @simonsays1980, @chris-ray-zhang, @kevin85421, @angelinalg, @rynewang, @brycehuang30, @alanwguo, @jjyao, @shaikhismail, @khluu, @can-anyscale, @bveeramani, @jrosti, @WeichenXu123, @MortalHappiness, @raulchen, @scottjlee, @ruisearch42, @aslonnie, @alexeykudinkin

Ray-2.21.0

08 May 20:34
a912be8
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New features:

  • Add read_lance API to read Lance Dataset (#45106)

🔨 Fixes:

  • Retry RaySystemError application errors (#45079)

📖 Documentation:

  • Fix broken references in data documentation (#44956)

Ray Train

📖 Documentation:

  • Fix broken links in Train documentation (#44953)

Ray Tune

📖 Documentation:

  • Update Hugging Face example to add reference (#42771)

🏗 Architecture refactoring:

  • Remove deprecated ray.air.callbacks modules (#45104)

Ray Serve

💫 Enhancements:

  • Allow methods to pass type @serve.batch type hint (#45004)
  • Allow configuring Serve control loop interval (#45063)

🔨 Fixes:

  • Fix bug with controller failing to recover for autoscaling deployments (#45118)
  • Fix control+c after serve run doesn't shutdown serve components (#45087)
  • Fix lightweight update max ongoing requests (#45006)

RLlib

🎉 New Features:

  • New MetricsLogger API now fully functional on the new API stack (working now also inside Learner classes, i.e. loss functions). (#44995, #45109)

💫 Enhancements:

  • Renamings and cleanups (toward new API stack and more consistent naming schemata): WorkerSet -> EnvRunnerGroup, DEFAULT_POLICY_ID -> DEFAULT_MODULE_ID, config.rollouts() -> config.env_runners(), etc.. (#45022, #44920)
  • Changed behavior of EnvRunnerGroup.foreach_worker… methods to new defaults: mark_healthy=True (used to be False) and healthy_only=True (used to be False). (#44993)
  • Fix get_state()/from_state() methods in SingleAgent- and MultiAgentEpisodes. (#45012)

🔨 Fixes:

📖 Documentation:

  • Example scripts using the MetricsLogger for env rendering and recording w/ WandB: #45073, #45107

Ray Core

🔨 Fixes:

  • Fix ray.init(logging_format) argument is ignored (#45037)
  • Handle unserializable user exception (#44878)
  • Fix dashboard process event loop blocking issues (#45048, #45047)

Dashboard

🔨 Fixes:

  • Fix Nodes page sorting not working correctly.
  • Add back “actors per page” UI control in the actors page.

Many thanks to all those who contributed to this release: @rynewang, @can-anyscale, @scottsun94, @bveeramani, @ceddy4395, @GeneDer, @zcin, @JoshKarpel, @nikitavemuri, @stephanie-wang, @jackhumphries, @matthewdeng, @yash97, @simonsays1980, @peytondmurray, @evalaiyc98, @c21, @alanwguo, @shrekris-anyscale, @kevin85421, @hongchaodeng, @sven1977, @st--, @khluu

Ray-2.20.0

01 May 21:58
5708e75
Compare
Choose a tag to compare

Ray Libraries

Ray Data

💫 Enhancements:

  • Dedupe repeated schema during ParquetDatasource metadata prefetching (#44750)
  • Update map_groups implementation to better handle large outputs (#44862)
  • Deprecate prefetch_batches arg of iter_rows and change default value (#44982)
  • Adding in default behavior to false for creating dirs on s3 writes (#44972)
  • Make internal UDF names more descriptive (#44985)
  • Make name a required argument for AggregateFn (#44880)

📖 Documentation:

  • Add key concepts to and revise "Data Internals" page (#44751)

Ray Train

💫 Enhancements:

  • Setup XGBoost CommunicatorContext automatically (#44883)
  • Track Train Run Info with TrainStateActor (#44585)

📖 Documentation:

  • Add documentation for accelerator_type (#44882)
  • Update Ray Train example titles (#44369)

Ray Tune

💫 Enhancements:

  • Remove trial table when running Ray Train in a Jupyter notebook (#44858)
  • Clean up temporary checkpoint directories for class Trainables (ex: RLlib) (#44366)

📖 Documentation:

  • Fix minor doc format issues (#44865)
  • Remove outdated ScalingConfig references (#44918)

Ray Serve

💫 Enhancements:

  • Handle push metric interval is now configurable with environment variable RAY_SERVE_HANDLE_METRIC_PUSH_INTERVAL_S (#32920)
  • Improve performance of developer API serve.get_app_handle (#44812)

🔨 Fixes:

  • Fix memory leak in handles for autoscaling deployments (the leak happens when
  • RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=1) (#44877)

RLlib

🎉 New Features:

  • Introduce MetricsLogger, a unified API for users of RLlib to log custom metrics and stats in all of RLlib’s components (Algorithm, EnvRunners, and Learners). Rolled out for new API stack for Algorithm (training_step) and EnvRunners (custom callbacks). Learner (custom loss functions) support in progress. #44888, #44442
  • Introduce “inference-only” (slim) mode for RLModules that run inside an EnvRunner (and thus don’t require value-functions or target networks): #44797

💫 Enhancements:

  • MultiAgentEpisodeReplayBuffer for new API stack (preparation for multi-agent support of SAC and DQN): #44450
  • AlgorithmConfig cleanup and renaming of properties and methods for better consistency/transparency: #44896

🔨 Fixes:

Ray Core and Ray Clusters

💫 Enhancements:

  • Report GCS internal pubsub buffer metrics and cap message size (#44749)

🔨 Fixes:

  • Fix task submission never return when network partition happens (#44692)
  • Fix incorrect use of ssh port forward option. (#44973)
  • Make sure dashboard will exit if grpc server fails (#44928)
  • Make sure dashboard agent will exit if grpc server fails (#44899)

Thanks @can-anyscale, @hongchaodeng, @zcin, @marwan116, @khluu, @bewestphal, @scottjlee, @andrewsykim, @anyscalesam, @MortalHappiness, @justinvyu, @JoshKarpel, @woshiyyya, @rynewang, @Abirdcfly, @omatthew98, @sven1977, @marcelocarmona, @rueian, @mattip, @angelinalg, @aslonnie, @matthewdeng, @abizjakpro, @simonsays1980, @jjyao, @terraflops1048576, @hongpeng-guo, @stephanie-wang, @bw-matthew, @bveeramani, @ruisearch42, @kevin85421, @Tongruizhe

Many thanks to all those who contributed to this release!

Ray-2.12.0

25 Apr 21:50
Compare
Choose a tag to compare

Ray Libraries

Ray Data

🎉 New Features:

  • Store Ray Data logs in special subdirectory (#44743)

💫 Enhancements:

  • Add in local_read option to from_torch (#44752)

🔨 Fixes:

  • Fix the config to disable progress bar (#44342)

📖 Documentation:

  • Clarify deprecated Datasource docstrings (#44790)

Ray Train

🔨 Fixes:

  • Disable gathering the full state dict in RayFSDPStrategy for lightning>2.1 (#44569)

Ray Tune

💫 Enhancements:

  • Remove spammy log for "new output engine" (#44824)
  • Enable isort (#44693)

Ray Serve

🔨 Fixes:

  • [Serve] fix getting attributes on stdout during Serve logging redirect (#44787)

RLlib

🎉 New Features:

  • Support of images and video logging in WandB (env rendering example script for the new API stack coming up). (#43356)

💫 Enhancements:

  • Better support and separation-of-concerns for model_config_dict in new API stack. (#44263)
  • Added example script to pre-train an RLModule in single-agent fashion, then bring checkpoint into multi-agent setup and continue training. (#44674)
  • More examples scripts got translated from the old- to the new API stack: Curriculum learning, custom-gym-env, etc..: (#44706, #44707, #44735, #44841)

Ray Core and Ray Clusters

🔨 Fixes:

  • Fix GetAllJobInfo is_running_tasks is not returning the correct value when driver starts ray (#44459)

Thanks

Many thanks to all those who contributed to this release!
@can-anyscale, @hongpeng-guo, @sven1977, @zcin, @shrekris-anyscale, @liuxsh9, @jackhumphries, @GeneDer, @woshiyyya, @simonsays1980, @omatthew98, @andrewsykim, @n30111, @architkulkarni, @bveeramani, @aslonnie, @alexeykudinkin, @WeichenXu123, @rynewang, @matthewdeng, @angelinalg, @c21

Ray-2.11.0

17 Apr 23:31
Compare
Choose a tag to compare

Release Highlights

  • [data] Support reading Avro files with ray.data.read_avro
  • [train] Added experimental support for AWS Trainium (Neuron) and Intel HPU.

Ray Libraries

Ray Data

🎉 New Features:

  • Support reading Avro files with ray.data.read_avro (#43663)

💫 Enhancements:

  • Pin ipywidgets==7.7.2 to enable Data progress bars in VSCode Web (#44398)
  • Change log level for ignored exceptions (#44408)

🔨 Fixes:

  • Change Parquet encoding ratio lower bound from 2 to 1 (#44470)
  • Fix throughput time calculations for metrics (#44138)
  • Fix nested ragged numpy.ndarray (#44236)
  • Fix Ray debugger incompatibility caused by trimmed error stack trace (#44496)

📖 Documentation:

  • Update "Data Loading and Preprocessing" doc (#44165)
  • Move imports into TFPRedictor in batch inference example (#44434)

Ray Train

🎉 New Features:

  • Add experimental support for AWS Trainium (Neuron) (#39130)
  • Add experimental support for Intel HPU (#43343)

💫 Enhancements:

  • Log a deprecation warning for local_dir and related environment variables (#44029)
  • Enforce xgboost>=1.7 for XGBoostTrainer usage (#44269)

🔨 Fixes:

  • Fix ScalingConfig(accelerator_type) to request an appropriate resource amount (#44225)
  • Fix maximum recursion issue when serializing exceptions (#43952)
  • Remove base config deepcopy when initializing the trainer actor (#44611)

🏗 Architecture refactoring:

  • Remove deprecated BatchPredictor (#43934)

Ray Tune

💫 Enhancements:

  • Add support for new style lightning import (#44339)
  • Log a deprecation warning for local_dir and related environment variables (#44029)

🏗 Architecture refactoring:

  • Remove scikit-optimize search algorithm (#43969)

Ray Serve

🔨 Fixes:

  • Dynamically-created applications will no longer be deleted when a config is PUT via the REST API (#44476).
  • Fix _to_object_ref memory leak (#43763)
  • Log warning to reconfigure max_ongoing_requests if max_batch_size is less than max_ongoing_requests (#43840)
  • Deployment fails to start with ModuleNotFoundError in Ray 3.10 (#44329)
    • This was fixed by reverting the original core changes on the sys.path behavior. Revert "[core] If there's working_dir, don't set _py_driver_sys_path." (#44435)
  • The batch_queue_cls parameter is removed from the @serve.batch decorator (#43935)

RLlib

🎉 New Features:

  • New API stack: DQN Rainbow is now available for single-agent (#43196, #43198, #43199)
  • PrioritizedEpisodeReplayBuffer is available for off-policy learning using the EnvRunner API (SingleAgentEnvRunner) and supports random n-step sampling (#42832, #43258, #43458, #43496, #44262)

💫 Enhancements:

  • Restructured examples/ folder; started moving example scripts to the new API stack (#44559, #44067, #44603)
  • Evaluation do-over: Deprecate enable_async_evaluation option (in favor of existing evaluation_parallel_to_training setting). (#43787)
  • Add: module_for API to MultiAgentEpisode (analogous to policy_for API of the old Episode classes). (#44241)
  • All rllib_contrib old stack algorithms have been removed from rllib/algorithms (#43656)

🔨 Fixes:

📖 Documentation:

Ray Core and Ray Clusters

🎉 New Features:

  • Added Ray check-open-ports CLI for checking potential open ports to the public (#44488)

💫 Enhancements:

  • Support nodes sharing the same spilling directory without conflicts. (#44487)
  • Create two subclasses of RayActorError to distinguish between actor died (ActorDiedError) and actor temporarily unavailable (ActorUnavailableError) cases.

🔨 Fixes:

  • Fixed the ModuleNotFound issued introduced in 2.10 (#44435)
  • Fixed an issue where agent process is using too much CPU (#44348)
  • Fixed race condition in multi-threaded actor creation (#44232)
  • Fixed several streaming generator bugs (#44079, #44257, #44197)
  • Fixed an issue where user exception raised from tasks cannot be subclassed (#44379)

Dashboard

💫 Enhancements:

  • Add serve controller metrics to serve system dashboard page (#43797)
  • Add Serve Application rows to Serve top-level deployments details page (#43506)
  • [Actor table page enhancements] Include "NodeId", "CPU", "Memory", "GPU", "GRAM" columns in the actor table page. Add sort functionality to resource utilization columns. Enable searching table by "Class" and "Repr". (#42588) (#42633) (#42788)

🔨 Fixes:

  • Fix default sorting of nodes in Cluster table page to first be by "Alive" nodes, then head nodes, then alphabetical by node ID. (#42929)
  • Fix bug where the Serve Deployment detail page fails to load if the deployment is in "Starting" state (#43279)

Docs

💫 Enhancements:

  • Landing page refreshes its look and feel. (#44251)

Thanks

Many thanks to all those who contributed to this release!

@aslonnie, @brycehuang30, @MortalHappiness, @astron8t-voyagerx, @edoakes, @sven1977, @anyscalesam, @scottjlee, @hongchaodeng, @slfan1989, @hebiao064, @fishbone, @zcin, @GeneDer, @shrekris-anyscale, @kira-lin, @chappidim, @raulchen, @c21, @WeichenXu123, @marian-code, @bveeramani, @can-anyscale, @mjd3, @justinvyu, @jackhumphries, @Bye-legumes, @ashione, @alanwguo, @Dreamsorcerer, @KamenShah, @jjyao, @omatthew98, @autolisis, @Superskyyy, @stephanie-wang, @simonsays1980, @davidxia, @angelinalg, @architkulkarni, @chris-ray-zhang, @kevin85421, @rynewang, @peytondmurray, @zhangyilun, @khluu, @matthewdeng, @ruisearch42, @pcmoritz, @mattip, @jerome-habana, @alexeykudinkin

Ray-2.10.0

21 Mar 19:02
09abba2
Compare
Choose a tag to compare

Release Highlights

Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).

  • [Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
  • [RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
  • [Serve] Added a default autoscaling policy set via num_replicas=”auto” (#42613).
  • [Serve] Added support for active load shedding via max_queued_requests (#42950).
  • [Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
    • This should improve overhead in the Serve proxy and handles.
    • max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
    • If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
  • [Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
    • max_concurrent_queries -> max_ongoing_requests
    • target_num_ongoing_requests_per_replica -> target_ongoing_requests
    • downscale_smoothing_factor -> downscaling_factor
    • upscale_smoothing_factor -> upscaling_factor
  • [Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
  • [Train] Added support for accelerator types via ScalingConfig(accelerator_type).
  • [Train] Revamped the XGBoostTrainer and LightGBMTrainer to no longer depend on xgboost_ray and lightgbm_ray. A new, more flexible API will be released in a future release.
  • [Train/Tune] Refactored local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR.

Ray Libraries

Ray Data

🎉 New Features:

  • Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
  • Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
  • Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
  • Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
  • Allow to specify application-level error to retry for actor task (#42492)
  • Add num_rows_per_file parameter to file-based writes (#42694)
  • Add DataIterator.materialize (#43210)
  • Skip schema call in DataIterator.to_tf if tf.TypeSpec is provided (#42917)
  • Add option to append for Dataset.write_bigquery (#42584)
  • Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)

💫 Enhancements:

  • Restructure stdout logging for better readability (#43360)
  • Add a more performant way to read large TFRecord datasets (#42277)
  • Modify ImageDatasource to use Image.BILINEAR as the default image resampling filter (#43484)
  • Reduce internal stack trace output by default (#43251)
  • Perform incremental writes to Parquet files (#43563)
  • Warn on excessive driver memory usage during shuffle ops (#42574)
  • Distributed reads for ray.data.from_huggingface (#42599)
  • Remove Stage class and related usages (#42685)
  • Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)

🔨 Fixes:

  • Turn off actor locality by default (#44124)
  • Normalize block types before internal multi-block operations (#43764)
  • Fix memory metrics for OutputSplitter (#43740)
  • Fix race condition issue in OpBufferQueue (#43015)
  • Fix early stop for multiple Limit operators. (#42958)
  • Fix deadlocks caused by Dataset.streaming_split for job hanging (#42601)

📖 Documentation:

Ray Train

🎉 New Features:

  • Add support for accelerator types via ScalingConfig(accelerator_type) for improved worker scheduling (#43090)

💫 Enhancements:

  • Add a backend-specific context manager for train_func for setup/teardown logic (#43209)
  • Remove DEFAULT_NCCL_SOCKET_IFNAME to simplify network configuration (#42808)
  • Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)

🔨 Fixes:

  • Enable scheduling workers with memory resource requirements (#42999)
  • Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)
  • [Lightning] Fix resuming from checkpoint when using RayFSDPStrategy (#43594)
  • [Lightning] Fix deadlock in RayTrainReportCallback (#42751)
  • [Transformers] Fix checkpoint reporting behavior when get_latest_checkpoint returns None (#42953)

📖 Documentation:

  • Enhance docstring and user guides for train_loop_config (#43691)
  • Clarify in ray.train.report docstring that it is not a barrier (#42422)
  • Improve documentation for prepare_data_loader shuffle behavior and set_epoch (#41807)

🏗 Architecture refactoring:

  • Simplify XGBoost and LightGBM Trainer integrations. Implemented XGBoostTrainer and LightGBMTrainer as DataParallelTrainer. Removed dependency on xgboost_ray and lightgbm_ray. (#42111, #42767, #43244, #43424)
  • Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
  • Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup (#42314)
  • Refactor restoration configuration to be centered around storage_path (#42853, #43179)
  • Deprecations related to SyncConfig (#42909)
  • Remove deprecated preprocessor argument from Trainers (#43146, #43234)
  • Hard-deprecate MosaicTrainer and remove SklearnTrainer (#42814)

Ray Tune

💫 Enhancements:

  • Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
  • Add support to TBXLogger for logging images (#37822)
  • Improve validation of Experiment(config) to handle RLlib AlgorithmConfig (#42816, #42116)

🔨 Fixes:

  • Fix reuse_actors error on actor cleanup for function trainables (#42951)
  • Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)

📖 Documentation:

🏗 Architecture refactoring:

  • Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
  • Deprecations related to SyncConfig and chdir_to_trial_dir (#42909)
  • Refactor restoration configuration to be centered around storage_path (#42853, #43179)
  • Add back NevergradSearch (#42305)
  • Clean up invalid checkpoint_dir and reporter deprecation notices (#42698)

Ray Serve

🎉 New Features:

  • Added support for active load shedding via max_queued_requests (#42950).
  • Added a default autoscaling policy set via num_replicas=”auto” (#42613).

🏗 API Changes:

  • Renamed the following parameters. Each of the old names will be supported for another release before removal.
    • max_concurrent_queries to max_ongoing_requests
    • target_num_ongoing_requests_per_replica to target_ongoing_requests
    • downscale_smoothing_factor to downscaling_factor
    • upscale_smoothing_factor to upscaling_factor
  • WARNING: the following default values will change in Ray 2.11:
    • Default for max_ongoing_requests will change from 100 to 5.
    • Default for target_ongoing_requests will change from 1 to 2.

💫 Enhancements:

  • Add RAY_SERVE_LOG_ENCODING env to set the global logging behavior for Serve (#42781).
  • Config Serve's gRPC proxy to allow large payload (#43114).
  • Add blocking flag to serve.run() (#43227).
  • Add actor id and worker id to Serve structured logs (#43725).
  • Added replica queue length caching to the DeploymentHandle scheduler (#42943).
    • This should improve overhead in the Serve proxy and handles.
    • max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
    • If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
  • Autoscaling metrics (trackin...
Read more

Ray-2.9.3

22 Feb 19:57
62655e1
Compare
Choose a tag to compare

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

  • Fix protobuf breaking change by adding a compat layer. (#43172)
  • Bump up task failure logs to warnings to make sure failures could be troubleshooted (#43147)
  • Fix placement group leaks (#42942)

Ray Data

🔨 Fixes:

  • Skip schema call in to_tf if tf.TypeSpec is provided (#42917)
  • Skip recording memory spilled stats when get_memory_info_reply is failed (#42824)

Ray Serve

🔨 Fixes:

  • Fixing DeploymentStateManager qualifying replicas as running prematurely (#43075)

Thanks

Many thanks to all those who contributed to this release!

@rynewang, @GeneDer, @alexeykudinkin, @edoakes, @c21, @rkooo567

Ray-2.9.2

06 Feb 01:23
fce7a36
Compare
Choose a tag to compare

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

  • Fix out of disk test on release branch (#42724)

Ray Data

🔨 Fixes:

  • Fix failing huggingface test (#42727)
  • Fix deadlocks caused by streaming_split (#42601) (#42755)
  • Fix locality config not being respected in DataConfig (#42204
    #42204) (#42722)
  • Stability & accuracy improvements for Data+Train benchmark (#42027)
  • Add retry for _sample_fragment during ParquetDatasource._estimate_files_encoding_ratio() (#42759) (#42774)
  • Skip recording memory spilled stats when get_memory_info_reply is failed (#42824) (#42834)

Ray Serve

🔨 Fixes:

  • Pin the fastapi & starlette version to avoid breaking proxy (#42740
    #42740)
  • Fix IS_PYDANTIC_2 logic for pydantic<1.9.0 (#42704) (#42708)
  • fix missing message body for json log formats (#42729) (#42874)

Thanks

Many thanks to all those who contributed to this release!

@c21, @raulchen, @can-anyscale, @edoakes, @peytondmurray, @scottjlee, @aslonnie, @architkulkarni, @GeneDer, @Zandew, @sihanwang41