Releases: ray-project/ray
Ray-2.5.1
The Ray 2.5.1 patch release adds wheels for MacOS for Python 3.11.
It also contains fixes for multiple components, along with fixes for our documentation.
Ray Train
🔨 Fixes:
- Don't error on eventual success when running with auto-recovery (#36266)
Ray Core
🎉 New Features:
- Build Python wheels on Mac OS for Python 3.11 (#36373)
🔨 Fixes:
Ray-2.5.0
The Ray 2.5 release features focus on a number of enhancements and improvements across the Ray ecosystem, including:
- Training LLMs with Ray Train: New support for checkpointing distributed models, and Pytorch Lightning FSDP to enable training large models on Ray Train’s LightningTrainer
- LLM applications with Ray Serve & Core: New support for streaming responses and model multiplexing
- Improvements to Ray Data: In 2.5, strict mode is enabled by default. This means that schemas are required for all Datasets, and standalone Python objects are no longer supported. Also, the default batch format is fixed to NumPy, giving better performance for batch inference.
- RLlib enhancements: New support for multi-gpu training, along with ray-project/rllib-contrib to contain the community contributed algorithms
- Core enhancements: Enable new feature of lightweight resource broadcasting to improve reliability and scalability. Add many enhancements for Core reliability, logging, scheduler, and worker process.
Ray Libraries
Ray AIR
💫Enhancements:
- Experiment restore stress tests (#33706)
- Context-aware output engine
- Add parameter columns to status table (#35388)
- Context-aware output engine: Add docs, experimental feature docs, prepare default on (#35129)
- Fix trial status at end (more info + cut off) (#35128)
- Improve leaked mentions of Tune concepts (#35003)
- Improve passed time display (#34951)
- Use flat metrics in results report, use Trainable._progress_metrics (#35035)
- Print experiment information at experiment start (#34952)
- Print single trial config + results as table (#34788)
- Print out worker ip for distributed train workers. (#33807)
- Minor fix to print configuration on start. (#34575)
- Check
air_verbosity
against None. (#33871) - Better wording for empty config. (#33811)
- Flatten config and metrics before passing to mlflow (#35074)
- Remote_storage: Prefer fsspec filesystems over native pyarrow (#34663)
- Use filesystem wrapper to exclude files from upload (#34102)
- GCE test variants for air_benchmark and air_examples (#34466)
- New storage path configuration
🔨 Fixes:
- Store unflattened metrics in _TrackedCheckpoint (#35658) (#35706)
- Fix
test_tune_torch_get_device_gpu
race condition (#35004) - Deflake test_e2e_train_flow.py (#34308)
- Pin deepspeed version for now to unblock ci. (#34406)
- Fix AIR benchmark configuration link failure. (#34597)
- Fix unused config building function in lightning MNIST example.
📖Documentation:
- Change doc occurrences of ray.data.Dataset to ray.data.Datastream (#34520)
- DreamBooth example: Fix code for batch size > 1 (#34398)
- Synced tabs in AIR getting started (#35170)
- New Ray AIR link for try it out (#34924)
- Correctly Render the Enumerate Numbers in
convert_torch_code_to_ray_air
(#35224)
Ray Data Processing
🎉 New Features:
- Implement Strict Mode and enable it by default.
- Add column API to Dataset (#35241)
- Configure progress bars via DataContext (#34638)
- Support using concurrent actors for ActorPool (#34253)
- Add take_batch API for collecting data in the same format as iter_batches and map_batches (#34217)
💫Enhancements:
- Improve map batches error message for strict mode migration (#35368)
- Improve docstring and warning message for from_huggingface (#35206)
- Improve notebook widget display (#34359)
- Implement some operator fusion logic for the new backend (#35178 #34847)
- Use wait based prefetcher by default (#34871)
- Implement limit physical operator (#34705 #34844)
- Require compute spec to be explicitly spelled out #34610
- Log a warning if the batch size is misconfigured in a way that would grossly reduce parallelism for actor pool. (#34594)
- Add alias parameters to the aggregate function, and add quantile fn (#34358)
- Improve repr for Arrow Table and pandas types (#34286 #34502)
- Defer first block computation when reading a Datasource with schema information in metadata (#34251)
- Improve handling of KeyboardInterrupt (#34441)
- Validate aggregation key in Aggregate LogicalOperator (#34292)
- Add usage tag for which block formats are used (#34384)
- Validate sort key in Sort LogicalOperator (#34282)
- Combine_chunks before chunking pyarrow.Table block into batches (#34352)
- Use read stage name for naming Data-read tasks on Ray Dashboard (#34341)
- Update path expansion warning (#34221)
- Improve state initialization for ActorPoolMapOperator (#34037)
🔨 Fixes:
- Fix ipython representation (#35414)
- Fix bugs in handling of nested ndarrays (and other complex object types) (#35359)
- Capture the context when the dataset is first created (#35239)
- Cooperatively exit producer threads for iter_batches (#34819)
- Autoshutdown executor threads when deleted (#34811)
- Fix backpressure when reading directly from input datasource (#34809)
- Fix backpressure handling of queued actor pool tasks (#34254)
- Fix row count after applying filter (#34372)
- Remove unnecessary setting of global logging level to INFO when using Ray Data (#34347)
- Make sure the tf and tensor iteration work in dataset pipeline (#34248)
- Fix '_unwrap_protocol' for Windows systems (#31296)
📖Documentation:
Ray Train
🎉 New Features:
- Experimental support for distributed checkpointing (#34709)
💫Enhancements:
- LightningTrainer: Enable prog bar (#35350)
- LightningTrainer enable checkpoint full dict with FSDP strategy (#34967)
- Support FSDP Strategy for LightningTrainer (#34148)
🔨 Fixes:
- Fix HuggingFace -> Transformers wrapping logic (#35276, #35284)
- LightningTrainer always resumes from the latest AIR checkpoint during restoration. (#35617) (#35791)
- Fix lightning trainer devices setting (#34419)
- TorchCheckpoint: Specifying pickle_protocol in
torch.save()
(#35615) (#35790)
📖Documentation:
- Improve visibility of Trainer restore and stateful callback restoration (#34350)
- Fix rendering of diff code-blocks (#34355)
- LightningTrainer Dolly V2 FSDP Fine-tuning Example (#34990)
- Update LightningTrainer MNIST example. (#34867)
- LightningTrainer Advanced Example (#34082, #34429)
🏗 Architecture refactoring:
- Restructure
ray.train
HuggingFace modules (#35270) (#35488) - rename _base_dataset to _base_datastream (#34423)
Ray Tune
🎉 New Features:
💫Enhancements:
- Make `Tuner.restore(trainable=...) a required argument (#34982)
- Enable
tune.ExperimentAnalysis
to pull experiment checkpoint files from the cloud if needed (#34461) - Add support for nested hyperparams in PB2 (#31502)
- Release test for durable multifile checkpoints (#34860)
- GCE variants for remaining Tune tests (#34572)
- Add tune frequent pausing release test. (#34501)
- Add PyArrow to ray[tune] dependencies (#34397)
- Fix new execution backend for BOHB (#34828)
- Add tune frequent pausing release test. (#34501)
🔨 Fixes:
- Set config on trial restore (#35000)
- Fix
test_tune_torch_get_device_gpu
race condition (#35004) - Fix a typo in
tune/execution/checkpoint_manager
state serialization. (#34368) - Fix tune_scalability_network_overhead by adding
--smoke-test
. (#34167) - Fix lightning_gpu_tune_.* release test (#35193)
📖Documentation:
🏗 Architecture refactoring:
- Use Ray-provided
tabulate
package (#34789)
Ray Serve
🎉 New Features:
- Add support for json logging format.(#35118)
- Add experimental support for model multiplexing.(#35399, #35326)
- Added experimental support for HTTP StreamingResponses. (#35720)
- Add support for application builders & arguments (#34584)
💫Enhancements:
- Add more bucket size for histogram metrics. (#35242).
- Add route information into the custom metrics. (#35246)
- Add HTTPProxy details to Serve Dashboard UI (#35159)
- Add status_code to http qps & latency (#35134)
- Stream Serve logs across different drivers (#35070)
- Add health checking for http proxy actors (#34944)
- Better surfacing of errors in serve status (#34773)
- Enable TLS on gRPCIngress if RAY_USE_TLS is on (#34403)
- Wait until replicas have finished recovering (with timeout) to broadcast
LongPoll
updates (#34675) - Replace
ClassNode
andFunctionNode
withApplication
in top-level Serve APIs (#34627)
🔨 Fixes:
- Set
app_msg
to empty string by default (#35646) - Fix dead replica counts in the stats. (#34761)
- Add default app name (#34260)
- gRPC Deployment schema check & minor improvements (#34210)
📖Documentation:
- Clean up API reference and various docstrings (#34711)
- Clean up
RayServeHandle
andRayServeSyncHandle
docstrings & typing (#34714)
RLlib
🎉 New Features:
- Migrating approximately ~25 of the 30 algorithms from RLlib into rllib_contrib. You can review the REP here. This release we have covered A3C and MAML.
- The APPO / IMPALA and PPO are all moved to the new Learner and RLModule stack.
- The RLModule now supports Checkpointing.(#34717 #34760)
💫Enhancements:
- Intro...
Ray-2.3.1
The Ray 2.3.1 patch release contains fixes for multiple components:
Ray Data Processing
- Support different number of blocks/rows per block in
zip()
(#32795)
Ray Serve
- Revert
serve run
to use Ray Client instead of Ray Jobs (#32976) - Fix issue with
max_concurrent_queries
being ignored when autoscaling (#32772 and #33022)
Ray Core
- Write Ray address even if Ray node is started with
--block
(#32961) - Fix Ray on Spark running on layered virtualenv python environment (#32996)
Dashboard
- Fix disk metric showing double the actual value (#32674)
Ray-2.3.0
Release Highlights
- The streaming backend for Ray Datasets is in Developer Preview. It is designed to enable terabyte-scale ML inference and training workloads. Please contact us if you'd like to try it out on your workload, or you can find the preview guide here: https://docs.google.com/document/d/1BXd1cGexDnqHAIVoxTnV3BV0sklO9UXqPwSdHukExhY/edit
- New Information Architecture (Beta): We’ve restructured the Ray dashboard to be organized around user personas and workflows instead of entities.
- Ray-on-Spark is now available (Preview)!: You can launch Ray clusters on Databricks and Spark clusters and run Ray applications. Check out the documentation to learn more.
Ray Libraries
Ray AIR
💫Enhancements:
- Add
set_preprocessor
method toCheckpoint
(#31721) - Rename Keras callback and its parameters to be more descriptive (#31627)
- Deprecate MlflowTrainableMixin in favor of setup_mlflow() function (#31295)
- W&B
- Have train_loop_config logged as a config (#31901)
- Allow users to exclude config values with WandbLoggerCallback (#31624)
- Rename WandB
save_checkpoints
toupload_checkpoints
(#31582) - Add hook to get project/group for W&B integration (#31035, 31643)
- Use Ray actors instead of multiprocessing for WandbLoggerCallback (#30847)
- Update
WandbLoggerCallback
example (#31625)
- Predictor
- Checkpoints
🔨 Fixes:
- Fix and improve support for HDFS remote storage. (#31940)
- Use specified Preprocessor configs when using stream API. (#31725)
- Support nested Chain in BatchPredictor (#31407)
📖Documentation:
- Restructure API References (#32535)
- API Deprecations (#31777, #31867)
- Various fixes to docstrings, documentation, and examples (#30782, #30791)
🏗 Architecture refactoring:
- Use NodeAffinitySchedulingPolicy for scheduling (#32016)
- Internal resource management refactor (#30777, #30016)
Ray Data Processing
🎉 New Features:
- Lazy execution by default (#31286)
- Introduce streaming execution backend (#31579)
- Introduce DatasetIterator (#31470)
- Add per-epoch preprocessor (#31739)
- Add TorchVisionPreprocessor (#30578)
- Persist Dataset statistics automatically to log file (#30557)
💫Enhancements:
- Async batch fetching for map_batches (#31576)
- Add informative progress bar names to map_batches (#31526)
- Provide an size bytes estimate for mongodb block (#31930)
- Add support for dynamic block splitting to actor pool (#31715)
- Improve str/repr of Dataset to include execution plan (#31604)
- Deal with nested Chain in BatchPredictor (#31407)
- Allow MultiHotEncoder to encode arrays (#31365)
- Allow specify batch_size when reading Parquet file (#31165)
- Add zero-copy batch API for
ds.map_batches()
(#30000) - Text dataset should save texts in ArrowTable format (#30963)
- Return ndarray dicts for single-column tabular datasets (#30448)
- Execute randomize_block_order eagerly if it's the last stage for ds.schema() (#30804)
🔨 Fixes:
- Don't drop first dataset when peeking DatasetPipeline (#31513)
- Handle np.array(dtype=object) constructor for ragged ndarrays (#31670)
- Emit warning when starting Dataset execution with no CPU resources available (#31574)
- Fix the bug of eagerly clearing up input blocks (#31459)
- Fix Imputer failing with categorical dtype (#31435)
- Fix schema unification for Datasets with ragged Arrow arrays (#31076)
- Fix Discretizers transforming ignored cols (#31404)
- Fix to_tf when the input feature_columns is a list. (#31228)
- Raise error message if user calls Dataset.iter (#30575)
📖Documentation:
Ray Train
🎉 New Features:
- Add option for per-epoch preprocessor (#31739)
💫Enhancements:
- Change default
NCCL_SOCKET_IFNAME
to blacklistveth
(#31824) - Introduce DatasetIterator for bulk and streaming ingest (#31470)
- Clarify which
RunConfig
is used when there are multiple places to specify it (#31959) - Change
ScalingConfig
to be optional forDataParallelTrainer
s if already in Tunerparam_space
(#30920)
🔨 Fixes:
- Use specified
Preprocessor
configs when using stream API. (#31725) - Fix off-by-one AIR Trainer checkpoint ID indexing on restore (#31423)
- Force GBDTTrainer to use distributed loading for Ray Datasets (#31079)
- Fix bad case in ScalingConfig->RayParams (#30977)
- Don't raise TuneError on
fail_fast="raise"
(#30817) - Report only once in
SklearnTrainer
(#30593) - Ensure GBDT PGFs match passed ScalingConfig (#30470)
📖Documentation:
- Restructure API References (#32535)
- Remove Ray Client references from Train docs/examples (#32321)
- Various fixes to docstrings, documentation, and examples (#29463, #30492, #30543, #30571, #30782, #31692, #31735)
🏗 Architecture refactoring:
- API Deprecations (#31763)
Ray Tune
💫Enhancements:
- Improve trainable serialization error (#31070)
- Add support for Nevergrad optimizer with extra parameters (#31015)
- Add timeout for experiment checkpoint syncing to cloud (#30855)
- Move
validate_upload_dir
to Syncer (#30869) - Enable experiment restore from moved cloud uri (#31669)
- Save and restore stateful callbacks as part of experiment checkpoint (#31957)
🔨 Fixes:
- Do not default to reuse_actors=True when mixins are used (#31999)
- Only keep cached actors if search has not ended (#31974)
- Fix best trial in ProgressReporter with nan (#31276)
- Make ResultGrid return cloud checkpoints (#31437)
- Wait for final experiment checkpoint sync to finish (#31131)
- Fix CheckpointConfig validation for function trainables (#31255)
- Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable (#31231)
- Fix
AxSearch
save and nan/inf result handling (#31147) - Fix
AxSearch
search space conversion for fixed list hyperparameters (#31088) - Restore searcher and scheduler properly on
Tuner.restore
(#30893) - Fix progress reporter
sort_by_metric
with nested metrics (#30906) - Don't raise TuneError on
fail_fast="raise"
(#30817) - Fix duplicate printing when trial is done (#30597)
📖Documentation:
- Restructure API references (#32449)
- Remove Ray Client references from Tune docs/examples (#32321)
- Various fixes to docstrings, documentation, and examples (#29581, #30782, #30571, #31045, #31793, #32505)
🏗 Architecture refactoring:
- Deprecate passing a custom trial executor (#31792)
- Move signal handling into separate method (#31004)
- Update staged resources in a fixed counter for faster lookup (#32087)
- Rename
overwrite_trainable
argument in Tuner restore totrainable
(#32059)
Ray Serve
🎉 New Features:
- Serve python API to support multi application (#31589)
💫Enhancements:
- Add exponential backoff when retrying replicas (#31436)
- Enable Log Rotation on Serve (#31844)
- Use tasks/futures for asyncio.wait (#31608)
- Change target_num_ongoing_requests_per_replica to positive float (#31378)
🔨 Fixes:
- Upgrade deprecated calls (#31839)
- Change Gradio integration to take a builder function to avoid serialization issues (#31619)
- Add initial health check before marking a replica as RUNNING (#31189)
📖Documentation:
RLlib
🎉 New Features:
- Gymnasium is now supported. (Notes)
- Connectors are now activated by default (#31693, 30388, 31618, 31444, 31092)
- Contribution of LeelaChessZero algorithm for playing chess in a MultiAgent env. (#31480)
💫Enhancements:
- [RLlib] Error out if action_dict is empty in MultiAgentEnv. (#32129)
- [RLlib] Upgrade tf eager code to no longer use
experimental_relax_shapes
(butreduce_retracing
instead). (#29214) - [RLlib] Reduce SampleBatch counting complexity (#30936)
- [RLlib] Use PyTorch vectorized max() and sum() in SampleBatch.init when possible (#28388)
- [RLlib] Support multi-gpu CQL for torch (tf already supported). (#31466)
- [RLlib] Introduce IMPALA off_policyness test with GPU (#31485)
- [RLlib] Properly serialize and restore StateBufferConnector states for policy stashing (#31372)
- [RLlib] Clean up deprecated concat_samples calls (#31391)
- [RLlib] Better support MultiBinary spaces by treating Tuples as superset of them in ComplexInputNet. (#28900)
- [RLlib] Add backward compatibility to MeanStdFilter to restore from older checkpoints. (#30439)
- [RLlib] Clean up some signatures for compute_actions. (#31241)
- [RLlib] Simplify logging configuration. (#30863)
- [RLlib] Remove native Keras Models. (#30986)
- [RLlib] Convert PolicySpec to a readable format when converting to_dict(). (#31146)
- [RLlib] Issue 30394: Add proper
__str__()
method to PolicyMap. (#31098) - [RLlib] Issue 30840: Option to only checkpoint policies that are trainable. (#31133)
- [RLlib] Deprecate (delete)
contrib
folder. (#30992) - [RLlib] Better behavior if user does not specify stopping condition in RLLib CLI. (#31078)
- ...
Ray-2.2.0
Release Highlights
Ray 2.2 is a stability-focused release, featuring stability improvements across many Ray components.
- Ray Jobs API is now GA. The Ray Jobs API allows you to submit locally developed applications to a remote Ray Cluster for execution. It simplifies the experience of packaging, deploying, and managing a Ray application.
- Ray Dashboard has received a number of improvements, such as the ability to see cpu flame graphs of your Ray workers and new metrics for memory usage.
- The Out-Of-Memory (OOM) Monitor is now enabled by default. This will increase the stability of memory-intensive applications on top of Ray.
- [Ray Data] we’ve heard numerous users report that when files are too large, Ray Data can have out-of-memory or performance issues. In this release, we’re enabling dynamic block splitting by default, which will address the above issues by avoiding holding too much data in memory.
Ray Libraries
Ray AIR
🎉 New Features:
- Add a NumPy first path for Torch and TensorFlow Predictors (#28917)
💫Enhancements:
- Suppress "NumPy array is not writable" error in torch conversion (#29808)
- Add node rank and local world size info to session (#29919)
🔨 Fixes:
- Fix MLflow database integrity error (#29794)
- Fix ResourceChangingScheduler dropping PlacementGroupFactory args (#30304)
- Fix bug passing 'raise' to FailureConfig (#30814)
- Fix reserved CPU warning if no CPUs are used (#30598)
📖Documentation:
- Fix examples and docs to specify batch_format in BatchMapper (#30438)
🏗 Architecture refactoring:
- Deprecate Wandb mixin (#29828)
- Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#30365)
Ray Data Processing
🎉 New Features:
- Support all PyArrow versions released by Apache Arrow (#29993, #29999)
- Add
select_columns()
to select a subset of columns (#29081) - Add
write_tfrecords()
to write TFRecord files (#29448) - Support MongoDB data source (#28550)
- Enable dynamic block splitting by default (#30284)
- Add
from_torch()
to create dataset from Torch dataset (#29588) - Add
from_tf()
to create dataset from TensorFlow dataset (#29591) - Allow to set
batch_size
inBatchMapper
(#29193) - Support read/write from/to local node file system (#29565)
💫Enhancements:
- Add
include_paths
inread_images()
to return image file path (#30007) - Print out Dataset statistics automatically after execution (#29876)
- Cast tensor extension type to opaque object dtype in
to_pandas()
andto_dask()
(#29417) - Encode number of dimensions in variable-shaped tensor extension type (#29281)
- Fuse AllToAllStage and OneToOneStage with compatible remote args (#29561)
- Change
read_tfrecords()
output from Pandas to Arrow format (#30390) - Handle all Ray errors in task compute strategy (#30696)
- Allow nested Chain preprocessors (#29706)
- Warn user if missing columns and support
str
exclude inConcatenator
(#29443) - Raise ValueError if preprocessor column doesn't exist (#29643)
🔨 Fixes:
- Support custom resource with remote args for
random_shuffle()
(#29276) - Support custom resource with remote args for
random_shuffle_each_window()
(#29482) - Add PublicAPI annotation to preprocessors (#29434)
- Tensor extension column concatenation fixes (#29479)
- Fix
iter_batches()
to not return empty batch (#29638) - Change
map_batches()
to fetch input blocks on-demand (#29289) - Change
take_all()
to not accept limit argument (#29746) - Convert between block and batch correctly for
map_groups()
(#30172) - Fix
stats()
call causing Dataset schema to be unset (#29635) - Raise error when
batch_format
is not specified forBatchMapper
(#30366) - Fix ndarray representation of single-element ragged tensor slices (#30514)
📖Documentation:
- Improve
map_batches()
documentation about execution model and UDF pickle-ability requirement (#29233) - Improve
to_tf()
docstring (#29464)
Ray Train
🎉 New Features:
💫Enhancements:
🔨 Fixes:
- Propagate DatasetContext to training workers (#29192)
- Show correct error message on training failure (#29908)
- Fix prepare_data_loader with enable_reproducibility (#30266)
- Fix usage of NCCL_BLOCKING_WAIT (#29562)
📖Documentation:
- Deduplicate Train examples (#29667)
🏗 Architecture refactoring:
- Hard deprecate train.report (#29613)
- Remove deprecated Train modules (#29960)
- Deprecate old prepare_model DDP args #30364
Ray Tune
🎉 New Features:
- Make
Tuner.restore
work with relative experiment paths (#30363) Tuner.restore
from a local directory that has moved (#29920)
💫Enhancements:
with_resources
takes in aScalingConfig
(#30259)- Keep resource specifications when nesting
with_resources
inwith_parameters
(#29740) - Add
trial_name_creator
andtrial_dirname_creator
toTuneConfig
(#30123) - Add option to not override the working directory (#29258)
- Only convert a
BaseTrainer
toTrainable
once in the Tuner (#30355) - Dynamically identify PyTorch Lightning Callback hooks (#30045)
- Make
remote_checkpoint_dir
work with query strings (#30125) - Make cloud checkpointing retry configurable (#30111)
- Sync experiment-checkpoints more often (#30187)
- Update generate_id algorithm (#29900)
🔨 Fixes:
- Catch SyncerCallback failure with dead node (#29438)
- Do not warn in BayesOpt w/ Uniform sampler (#30350)
- Fix
ResourceChangingScheduler
dropping PGF args (#30304) - Fix Jupyter output with Ray Client and
Tuner
(#29956) - Fix tests related to
TUNE_ORIG_WORKING_DIR
env variable (#30134)
📖Documentation:
- Add user guide for analyzing results (using
ResultGrid
andResult
) (#29072) - Tune checkpointing and Tuner restore docfix (#29411)
- Fix and clean up PBT examples (#29060)
- Fix TrialTerminationReporter in docs (#29254)
🏗 Architecture refactoring:
- Remove hard deprecated SyncClient/Syncer (#30253)
- Deprecate Wandb mixin, move to
setup_wandb()
function (#29828)
Ray Serve
🎉 New Features:
💫Enhancements:
🔨 Fixes:
- Fix log format error (#28760)
- Inherit previous deployment num_replicas (29686)
- Polish serve run deploy message (#29897)
- Remove calling of get_event_loop from python 3.10
RLlib
🎉 New Features:
- Fault tolerant, elastic WorkerSets: An asynchronous Ray Actor manager class is now used inside all of RLlib’s Algorithms, adding fully flexible fault tolerance to rollout workers and workers used for evaluation. If one or more workers (which are Ray actors) fails - e.g. due to a SPOT instance going down - the RLlib Algorithm will now flexibly wait it out and periodically try to recreate the failed workers. In the meantime, only the remaining healthy workers are used for sampling and evaluation. (#29938, #30118, #30334, #30252, #29703, #30183, #30327, #29953)
💫Enhancements:
- RLlib CLI: A new and enhanced RLlib command line interface (CLI) has been added, allowing for automatically downloading example configuration files, python-based config files (defining an AlgorithmConfig object to use), better interoperability between training and evaluation runs, and many more. For a detailed overview of what has changed, check out the new CLI documentation. (#29204, #29459, #30526, #29661, #29972)
- Checkpoint overhaul: Algorithm checkpoints and Policy checkpoints are now more cohesive and transparent. All checkpoints are now characterized by a directory (with files and maybe sub-directories), rather than a single pickle file; Both Algorithm and Policy classes now have a utility static method (
from_checkpoint()
) for directly instantiating instances from a checkpoint directory w/o knowing the original configuration used or any other information (having the checkpoint is sufficient). For a detailed overview, see here. (#28812, #29772, #29370, #29520, #29328) - A new metric for APPO/IMPALA/PPO has been added that measures off-policy’ness: The difference in number of grad-updates the sampler policy has received thus far vs the trained policy’s number of grad-updates thus far. (#29983)
🏗 Architecture refactoring:
- AlgorithmConfig classes: All of RLlib’s Algorithms, RolloutWorkers, and other important classes now use AlgorithmConfig objects under the hood, instead of python config dicts. It is no longer recommended (however, still supported) to create a new algorithm (or a Tune+RLlib experiment) using a python dict as configuration. For more details on how to convert your scripts to the new AlgorithmConfig design, see here. (#29796, #30020, #29700, #29799, #30096, #29395, #29755, #30053, #29974, #29854, #29546, #30042, #29544, #30079, #30486, #30361)
- Major progress was made on the new Connector API and making sure it can be used (tentatively) with the “config.rollouts(enable_connectors=True)” flag. Will be fully supported, across all of RLlib’s algorithms, in Ray 2.3. (#30307, #30434, #30459, #303...
Ray-2.1.0
Release Highlights
- Ray AI Runtime (AIR)
- Better support for Image-based workloads.
- Ray Datasets
read_images()
API for loading data. - Numpy-based API for user-defined functions in Preprocessor.
- Ray Datasets
- Ability to read TFRecord input.
- Ray Datasets
read_tfrecords()
API to read TFRecord files.
- Ray Datasets
- Better support for Image-based workloads.
- Ray Serve:
- Add support for gRPC endpoint (alpha release). Instead of using an HTTP server, Ray Serve supports gRPC protocol and users can bring their own schema for their use case.
- RLlib:
- Introduce decision transformer (DT) algorithm.
- New hook for callbacks with
on_episode_created()
. - Learning rate schedule to SimpleQ and PG.
- Ray Core:
- Ray OOM prevention (alpha release).
- Support dynamic generators as task return values.
- Dashboard:
- Time series metrics support.
- Export configuration files can be used in Prometheus or Grafana instances.
- New progress bar in job detail view.
Ray Libraries
Ray AIR
💫Enhancements:
- Improve readability of training failure output (#27946, #28333, #29143)
- Auto-enable GPU for Predictors (#26549)
- Add ability to create TorchCheckpoint from state dict (#27970)
- Add ability to create TensorflowCheckpoint from saved model/h5 format (#28474)
- Add attribute to retrieve URI from Checkpoint (#28731)
- Add all allowable types to WandB Callback (#28888)
🔨 Fixes:
- Handle nested metrics properly as scoring attribute (#27715)
- Fix serializability of Checkpoints (#28387, #28895, #28935)
📖Documentation:
- Miscellaneous updates to documentation and examples (#28067, #28002, #28189, #28306, #28361, #28364, #28631, #28800)
🏗 Architecture refactoring:
- Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#28318)
- Deprecate legacy train/tune functions in favor of Session (#28856)
Ray Data Processing
🎉 New Features:
- Add read_images (#29177)
- Add read_tfrecords (#28430)
- Add NumPy batch format to Preprocessor and
BatchMapper
(#28418) - Ragged tensor extension type (#27625)
- Add KBinsDiscretizer Preprocessor (#28389)
💫Enhancements:
- Simplify to_tf interface (#29028)
- Add metadata override and inference in
Dataset.to_dask()
(#28625) - Prune unused columns before aggregate (#28556)
- Add Dataset.default_batch_format (#28434)
- Add partitioning parameter to read_ functions (#28413)
- Deprecate "native" batch format in favor of "default" (#28489)
- Support None partition field name (#28417)
- Re-enable Parquet sampling and add progress bar (#28021)
- Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964)
- Customized serializer for Arrow JSON ParseOptions in read_json (#27911)
- Optimize groupby/mapgroups performance (#27805)
- Improve size estimation of image folder data source (#27219)
- Use detached lifetime for stats actor (#25271)
- Pin _StatsActor to the driver node (#27765)
- Better error message for partition filtering if no file found (#27353)
- Make Concatenator deterministic (#27575)
- Change FeatureHasher input schema to expect token counts (#27523)
- Avoid unnecessary reads when truncating a dataset with
ds.limit()
(#27343) - Hide tensor extension from UDFs (#27019)
- Add repr to AIR classes (#27006)
🔨 Fixes:
- Add upper bound to pyarrow version check (#29674) (#29744)
- Fix map_groups to work with different output type (#29184)
- read_csv not filter out files by default (#29032)
- Check columns when adding rows to TableBlockBuilder (#29020)
- Fix the peak memory usage calculation (#28419)
- Change sampling to use same API as read Parquet (#28258)
- Fix column assignment in Concatenator for Pandas 1.2. (#27531)
- Doing partition filtering in reader constructor (#27156)
- Fix split ownership (#27149)
📖Documentation:
- Clarify dataset transformation. (#28482)
- Update map_batches documentation (#28435)
- Improve docstring and doctest for read_parquet (#28488)
- Activate dataset doctests (#28395)
- Document using a different separator for read_csv (#27850)
- Convert custom datetime column when reading a CSV file (#27854)
- Improve preprocessor documentation (#27215)
- Improve
limit()
andtake()
docstrings (#27367) - Reorganize the tensor data support docs (#26952)
- Fix nyc_taxi_basic_processing notebook (#26983)
Ray Train
🎉 New Features:
- Add FullyShardedDataParallel support to TorchTrainer (#28096)
💫Enhancements:
- Add rich notebook repr for DataParallelTrainer (#26335)
- Fast fail if training loop raises an error on any worker (#28314)
- Use torch.encode_data with HorovodTrainer when torch is imported (#28440)
- Automatically set NCCL_SOCKET_IFNAME to use ethernet (#28633)
- Don't add Trainer resources when running on Colab (#28822)
- Support large checkpoints and other arguments (#28826)
🔨 Fixes:
- Fix and improve HuggingFaceTrainer (#27875, #28154, #28170, #28308, #28052)
- Maintain dtype info in LightGBMPredictor (#28673)
- Fix prepare_model (#29104)
- Fix
train.torch.get_device()
(#28659)
📖Documentation:
- Clarify LGBM/XGB Trainer documentation (#28122)
- Improve Hugging Face notebook example (#28121)
- Update Train API reference and docs (#28192)
- Mention FSDP in HuggingFaceTrainer docs (#28217)
🏗 Architecture refactoring:
- Improve Trainer modularity for extensibility (#28650)
Ray Tune
🎉 New Features:
- Add
Tuner.get_results()
to retrieve results after restore (#29083)
💫Enhancements:
- Exclude files in sync_dir_between_nodes, exclude temporary checkpoints (#27174)
- Add rich notebook output for Tune progress updates (#26263)
- Add logdir to W&B run config (#28454)
- Improve readability for long column names in table output (#28764)
- Add functionality to recover from latest available checkpoint (#29099)
- Add retry logic for restoring trials (#29086)
🔨 Fixes:
- Re-enable progress metric detection (#28130)
- Add timeout to retry_fn to catch hanging syncs (#28155)
- Correct PB2’s beta_t parameter implementation (#28342)
- Ignore directory exists errors to tackle race conditions (#28401)
- Correctly overwrite files on restore (#28404)
- Disable pytorch-lightning multiprocessing per default (#28335)
- Raise error if scheduling an empty PlacementGroupFactory#28445
- Fix trial cleanup after x seconds, set default to 600 (#28449)
- Fix trial checkpoint syncing after recovery from other node (#28470)
- Catch empty hyperopt search space, raise better Tuner error message (#28503)
- Fix and optimize sample search algorithm quantization logic (#28187)
- Support tune.with_resources for class methods (#28596)
- Maintain consistent Trial/TrialRunner state when pausing and resuming trial with PBT (#28511)
- Raise better error for incompatible gcsfs version (#28772)
- Ensure that exploited in-memory checkpoint is used by trial with PBT (#28509)
- Fix Tune checkpoint tracking for minimizing metrics (#29145)
📖Documentation:
- Miscelleanous documentation fixes (#27117, #28131, #28210, #28400, #28068, #28809)
- Add documentation around trial/experiment checkpoint (#28303)
- Add basic parallel execution guide for Tune (#28677)
- Add example PBT notebook (#28519)
🏗 Architecture refactoring:
- Store SyncConfig and CheckpointConfig in Experiment and Trial (#29019)
Ray Serve
🎉 New Features:
- Added gRPC direct ingress support [alpha version] (#28175)
- Serve cli can provide kubernetes formatted output (#28918)
- Serve cli can provide user config output without default value (#28313)
💫Enhancements:
- Enrich more benchmarks
- image objection with resnet50 mode with image preprocessing (#29096)
- gRPC vs HTTP inference performance (#28175)
- Add health check metrics to reflect the replica health status (#29154)
🔨 Fixes:
- Fix memory leak issues during inference (#29187)
- Fix unexpected http options omit warning when using serve cli to start the ray serve (#28257)
- Fix unexpected long poll exceptions (#28612)
📖Documentation:
- Add e2e fault tolerance instructions (#28721)
- Add Direct Ingress instructions (#29149)
- Bunch of doc improvements on “dev workflow”, “custom resources”, “serve cli” etc (#29147, #28708, #28529, #28527)
RLlib
🎉 New Features:
- Decision Transformer (DT) Algorithm added (#27890, #27889, #27872, #27829).
- Callbacks now have a new hook
on_episode_created()
. (#28600) - Added learning rate schedule to SimpleQ and PG. (#28381)
💫Enhancements:
- Soft target network update is now supported by all off-policy algorithms (e.g DQN, DDPG, etc.) (#28135)
- Stop RLlib from "silently" selecting atari preprocessors. (#29011)
- Improved offline RL and off-policy evaluation performance (#28837, #28834, #28593, #28420, #28136, #28013, #27356, #27161, #27451).
- Escalated old deprecation warnings to errors (#28807, #28795, #28733, #28697).
- Others: #27619, #27087.
🔨 Fixes:
- Various bug fixes: #29077, #28811, #28637, #27785, #28703, #28422, #28405, #28358, #27540, #28325, #28357, #28334, #27090, #28133, #27981, #27980, #26666, #27390, #27791, #27741, #27424, #27544, #27459, #27572, #27255, #27304, #26629, #28166, #27864, #28938, #28845, #28588, #28202, #28201, #27806
📖Documentation:
Ray Workflows
🔨 Fixes:
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Ray OOM prevention feature alpha release! If your Ray jobs suffer from OOM issues, please give it a try.
- Support dynamic generators as task return values. (#29082 #28864 #28291)
💫Enhancements:
Ray-2.0.1
The Ray 2.0.1 patch release contains dependency upgrades and fixes for multiple components:
- Upgrade grpcio version to 1.32 (#28025)
- Upgrade redis version to 7.0.5 (#28936)
- Fix segfault when using runtime environments (#28409)
- Increase RPC timeout for dashboard (#28330)
- Set correct path when using
python -m
(#28140) - [Autoscaler] Fix autoscaling for 0 CPU head node (#26813)
- [Serve] Allow code in private remote Git URIs to be imported (#28250)
- [Serve] Allow
host
andport
in Serve config (#27026) - [RLlib] Evaluation supports asynchronous rollout (single slow eval worker will not block the overall evaluation progress). (#27390)
- [Tune] Fix hang during checkpoint synchronization (#28155)
- [Tune] Fix trial restoration from different IP (#28470)
- [Tune] Fix custom synchronizer serialization (#28699)
- [Workflows] Replace deprecated
name
option withtask_id
(#28151)
Ray-2.0.0
Release Highlights
Ray 2.0 is an exciting release with enhancements to all libraries in the Ray ecosystem. With this major release, we take strides towards our goal of making distributed computing scalable, unified, and open.
Towards these goals, Ray 2.0 features new capabilities for unifying the machine learning (ML) ecosystem, improving Ray's production support, and making it easier than ever for ML practitioners to use Ray's libraries.
Highlights:
- Ray AIR, a scalable and unified toolkit for ML applications, is now in Beta.
- Ray now supports natively shuffling 100TB or more of data with the Ray Datasets library.
- KubeRay, a toolkit for running Ray on Kubernetes, is now in Beta. This replaces the legacy Python-based Ray operator.
- Ray Serve’s Deployment Graph API is a new and easier way to build, test, and deploy an inference graph of deployments. This is released as Beta in 2.0.
A migration guide for all the different libraries can be found here: Ray 2.0 Migration Guide.
Ray Libraries
Ray AIR
Ray AIR is now in beta. Ray AIR builds upon Ray’s libraries to enable end-to-end machine learning workflows and applications on Ray. You can install all dependencies needed for Ray AIR via pip install -u "ray[air]"
.
🎉 New Features:
- Predictors:
- BatchPredictors now have support for scalable inference on GPUs.
- All Predictors can now be constructed from pre-trained models, allowing you to easily scale batch inference with trained models from common ML frameworks.
- ray.ml.predictors has been moved to the Ray Train namespace (ray.train).
- Preprocessing: New preprocessors and API changes on Ray Datasets now make feature processing easier to do on AIR. See the Ray Data release notes for more details.
- New features for Datasets/Train/Tune/Serve can be found in the corresponding library release notes for more details.
💫 Enhancements:
- Major package refactoring is included in this release.
- ray.ml is renamed to ray.air.
- ray.ml.preprocessors have been moved to ray.data.
- train_test_split is now a new method of ray.data.Dataset (#27065)
- ray.ml.trainers have been moved to ray.train (#25570)
- ray.ml.predictors has been moved to ray.train.
- ray.ml.config has been moved to ray.air.config (#25712).
- Checkpoints are now framework-specific -- meaning that each Trainer generates its own Framework-specific Checkpoint class. See Ray Train for more details.
- ModelWrappers have been renamed to PredictorDeployments.
- API stability annotations have been added (#25485)
- Train/Tune now have the same reporting and checkpointing API -- see the Train notes for more details (#26303)
- ScalingConfigs are now Dataclasses not Dict types
- Many AIR examples, benchmarks, and documentation pages were added in this release. The Ray AIR documentation will cover breadth of usage (end to end workflows across different libraries) while library-specific documentation will cover depth (specific features of a specific library).
🔨 Fixes:
- Many documentation examples were previously untested. This release fixes those examples and adds them to the CI.
- Predictors:
- Torch/Tensorflow Predictors have correctness fixes (#25199, #25190, #25138, #25136)
- Update
KerasCallback
to work withTensorflowPredictor
(#26089) - Add streaming BatchPredictor support (#25693)
- Add
predict_pandas
implementation (#25534) - Add
_predict_arrow
interface for Predictor (#25579) - Allow creating Predictor directly from a UDF (#26603)
- Execute GPU inference in a separate stage in BatchPredictor (#26616, #27232, #27398)
- Accessors for preprocessor in Predictor class (#26600)
- [AIR] Predictor
call_model
API for unsupported output types (#26845)
Ray Data Processing
🎉 New Features:
- Add ImageFolderDatasource (#24641)
- Add the NumPy batch format for batch mapping and batch consumption (#24870)
- Add iter_torch_batches() and iter_tf_batches() APIs (#26689)
- Add local shuffling API to iterators (#26094)
- Add drop_columns() API (#26200)
- Add randomize_block_order() API (#25568)
- Add random_sample() API (#24492)
- Add support for len(Dataset) (#25152)
- Add UDF passthrough args to map_batches() (#25613)
- Add Concatenator preprocessor (#26526)
- Change range_arrow() API to range_table() (#24704)
💫 Enhancements:
- Autodetect dataset parallelism based on available resources and data size (#25883)
- Use polars for sorting (#25454)
- Support tensor columns in to_tf() and to_torch() (#24752)
- Add explicit resource allocation option via a top-level scheduling strategy (#24438)
- Spread actor pool actors evenly across the cluster by default (#25705)
- Add ray_remote_args to read_text() (#23764)
- Add max_epoch argument to iter_epochs() (#25263)
- Add Pandas-native groupby and sorting (#26313)
- Support push-based shuffle in groupby operations (#25910)
- More aggressive memory releasing for Dataset and DatasetPipeline (#25461, #25820, #26902, #26650)
- Automatically cast tensor columns on Pandas UDF outputs (#26924)
- Better error messages when reading from S3 (#26619, #26669, #26789)
- Make dataset splitting more efficient and stable (#26641, #26768, #26778)
- Use sampling to estimate in-memory data size for Parquet data source (#26868)
- De-experimentalized lazy execution mode (#26934)
🔨 Fixes:
- Fix pipeline pre-repeat caching (#25265)
- Fix stats construction for from_*() APIs (#25601)
- Fixes label tensor squeezing in to_tf() (#25553)
- Fix stage fusion between equivalent resource args (fixes BatchPredictor) (#25706)
- Fix tensor extension string formatting (repr) (#25768)
- Workaround for unserializable Arrow JSON ReadOptions (#25821)
- Make ActorPoolStrategy kill pool of actors if exception is raised (#25803)
- Fix max number of actors for default actor pool strategy (#26266)
- Fix byte size calculation for non-trivial tensors (#25264)
Ray Train
Ray Train has received a major expansion of scope with Ray 2.0.
In particular, the Ray Train module now contains:
- Trainers
- Predictors
- Checkpoints
for common different ML frameworks including Pytorch, Tensorflow, XGBoost, LightGBM, HuggingFace, and Scikit-Learn. These API help provide end-to-end usage of Ray libraries in Ray AIR workflows.
🎉 New Features:
- The Trainer API is now deprecated for the new Ray AIR Trainers API. Trainers for Pytorch, Tensorflow, Horovod, XGBoost, and LightGBM are now in Beta. (#25570)
- ML framework-specific Predictors have been moved into the
ray.train
namespace. This provides streamlined API for offline and online inference of Pytorch, Tensorflow, XGBoost models and more. (#25769 #26215, #26251, #26451, #26531, #26600, #26603, #26616, #26845) - ML framework-specific checkpoints are introduced. Checkpoints are consumed by Predictors to load model weights and information. (#26777, #25940, #26532, #26534)
💫 Enhancements:
- Train and Tune now use the same reporting and checkpointing API (#24772, #25558)
- Add tunable ScalingConfig dataclass (#25712)
- Randomize block order by default to avoid hotspots (#25870)
- Improve checkpoint configurability and extend results (#25943)
- Improve prepare_data_loader to support multiple batch data types (#26386)
- Discard returns of train loops in Trainers (#26448)
- Clean up logs, reprs, warning s(#26259, #26906, #26988, #27228, #27519)
📖 Documentation:
- Update documentation to use new Train API (#25735)
- Update documentation to use session API (#26051, #26303)
- Add Trainer user guide and update Trainer docs (#27570, #27644, #27685)
- Add Predictor documentation (#25833)
- Replace to_torch with iter_torch_batches (#27656)
- Replace to_tf with iter_tf_batches (#27768)
- Minor doc fixes (#25773, #27955)
🏗 Architecture refactoring:
🔨 Fixes:
- An issue with GPU ID detection and assignment was fixed. (#26493)
- Fix AMP for models with a custom
__getstate__
method (#25335) - Fix transformers example for multi-gpu (#24832)
- Fix ScalingConfig key validation (#25549)
- Fix ResourceChangingScheduler integration (#26307)
- Fix auto_transfer cuda device (#26819)
- Fix BatchPredictor.predict_pipelined not working with GPU stage (#27398)
- Remove rllib dependency from tensorflow_predictor (#27688)
Ray Tune
🎉 New Features:
- The Tuner API is the new way of running Ray Tune experiments. (#26987, #26987, #26961, #26931, #26884, #26930)
- Ray Tune and Ray Train now have the same API for reporting (#25558)
- Introduce tune.with_resources() to specify function trainable resources (#26830)
- Add Tune benchmark for AIR (#26763, #26564)
- Allow Tuner().restore() from cloud URIs (#26963)
- Add top-level imports for Tuner, TuneConfig, move CheckpointConfig (#26882)
- Add resume experiment options to Tuner.restore() (#26826)
- Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig (#26661)
- Add more config arguments to Tuner (#26656)
- Better error message for Tune nested tasks / actors (#25241)
- Allow iterators in tune.grid_search (#25220)
- Add
get_dataframe()
method to result grid, fix config flattening (#24686)
💫 Enhancements:
- Expose number of errored/terminated trials in ResultGrid (#26655)
- remove f...
Ray-1.13.0
Highlights:
- Python 3.10 support is now in alpha.
- Ray usage stats collection is now on by default (guarded by an opt-out prompt).
- Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)
- Ray Workflow comes with a new API and is integrated with Ray DAG.
Ray Autoscaler
💫Enhancements:
- CI tests for KubeRay autoscaler integration (#23365, #23383, #24195)
- Stability enhancements for KubeRay autoscaler integration (#23428)
🔨 Fixes:
- Improved GPU support in KubeRay autoscaler integration (#23383)
- Resources scheduled with the node affinity strategy are not reported to the autoscaler (#24250)
Ray Client
💫Enhancements:
- Add option to configure ray.get with >2 sec timeout (#22165)
- Return
None
from internal KV for non-existent keys (#24058)
🔨 Fixes:
- Fix deadlock by switching to
SimpleQueue
on Python 3.7 and newer in asyncdataclient
(#23995)
Ray Core
🎉 New Features:
- Ray usage stats collection is now on by default (guarded by an opt-out prompt)
- Alpha support for python 3.10 (on Linux and Mac)
- Node affinity scheduling strategy (#23381)
- Add metrics for disk and network I/O (#23546)
- Improve exponential backoff when connecting to the redis (#24150)
- Add the ability to inject a setup hook for customization of runtime_env on init (#24036)
- Add a utility to check GCS / Ray cluster health (#23382)
🔨 Fixes:
- Fixed internal storage S3 bugs (#24167)
- Ensure "get_if_exists" takes effect in the decorator. (#24287)
- Reduce memory usage for Pubsub channels that do not require total memory cap (#23985)
- Add memory buffer limit in publisher for each subscribed entity (#23707)
- Use gRPC instead of socket for GCS client health check (#23939)
- Trim size of Reference struct (#23853)
- Enable debugging into pickle backend (#23854)
🏗 Architecture refactoring:
- Gcs storage interfaces unification (#24211)
- Cleanup pickle5 version check (#23885)
- Simplify options handling (#23882)
- Moved function and actor importer away from pubsub (#24132)
- Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173)
- Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (#23872)
- Save task spec in separate table (#22650)
Ray Datasets
🎉 New Features:
- Performance improvement: the aggregation computation is vectorized (#23478)
- Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (#23179)
- Performance improvement: more efficient move semantics for Datasets block processing (#24127)
- Supports Datasets lineage serialization (aka out-of-band serialization) (#23821, #23931, #23932)
- Supports native Tensor views in map processing for pure-tensor datasets (#24812)
- Implemented push-based shuffle (#24281)
🔨 Fixes:
- Documentation improvement: Getting Started page (#24860)
- Documentation improvement: FAQ (#24932)
- Documentation improvement: End to end examples (#24874)
- Documentation improvement: Feature guide - Creating Datasets (#24831)
- Documentation improvement: Feature guide - Saving Datasets (#24987)
- Documentation improvement: Feature guide - Transforming Datasets (#25033)
- Documentation improvement: Datasets APIs docstrings (#24949)
- Performance: fixed block prefetching (#23952)
- Fixed zip() for Pandas dataset (#23532)
🏗 Architecture refactoring:
- Refactored LazyBlockList (#23624)
- Added path-partitioning support for all content types (#23624)
- Added fast metadata provider and refactored Parquet datasource (#24094)
RLlib
🎉 New Features:
- Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (#24164, #22842, #23523, #23586)
🏗 Architecture refactoring:
- More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (#22937, #23420, #23673, #24164, #24151, #23735, #24157, #23798, #23906, #24118, #22842, #24166, #23712). This will be fully completed and documented with Ray 2.0.
- Make RolloutWorkers (optionally) recoverable after failure via the new
recreate_failed_workers=True
config flag. (#23739) - POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (#24295, #23491)
- Hard-deprecate
build_trainer()
(trainer_templates.py): All custom Trainers should now sub-class from any existingTrainer
class. (#23488)
💫Enhancements:
- Add support for complex observations in CQL. (#23332)
- Bandit support for tf2. (#22838)
- Make actions sent by RLlib to the env immutable. (#24262)
- Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412)
- Enable DD-PPO to run on Windows. (#23673)
🔨 Fixes:
- APPO eager fix (APPOTFPolicy gets wrapped
as_eager()
twice by mistake). (#24268) - CQL gets stuck when deprecated
timesteps_per_iteration
is used (usemin_train_timesteps_per_reporting
instead). (#24345) - SlateQ runs on GPU (torch). (#23464)
- Other bug fixes: #24016, #22050, #23814, #24025, #23740, #23741, #24006, #24005, #24273, #22010, #24271, #23690, #24343, #23419, #23830, #24335, #24148, #21735, #24214, #23818, #24429
Ray Workflow
🎉 New Features:
🔨 Fixes:
- Fix one bug where max_retries is not aligned with ray core’s max_retries. (#22903)
🏗 Architecture refactoring:
- Integrate ray storage in workflow (#24120)
Tune
🎉 New Features:
- Add RemoteTask based sync client (#23605) (rsync not required anymore!)
- Chunk file transfers in cross-node checkpoint syncing (#23804)
- Also interrupt training when SIGUSR1 received (#24015)
- reuse_actors per default for function trainables (#24040)
- Enable AsyncHyperband to continue training for last trials after max_t (#24222)
💫Enhancements:
- Improve testing (#23229
- Improve docstrings (#23375)
- Improve documentation (#23477, #23924)
- Simplify trial executor logic (#23396
- Make
MLflowLoggerUtil
copyable (#23333) - Use new Checkpoint interface internally (#22801)
- Beautify Optional typehints (#23692)
- Improve missing search dependency info (#23691)
- Skip tmp checkpoints in analysis and read iteration from metadata (#23859)
- Treat checkpoints with nan value as worst (#23862)
- Clean up base ProgressReporter API (#24010)
- De-clutter log outputs in trial runner (#24257)
- hyperopt searcher to support tune.choice([[1,2],[3,4]]). (#24181)
🔨Fixes:
- Optuna should ignore additional results after trial termination (#23495)
- Fix PTL multi GPU link (#23589)
- Improve Tune cloud release tests for durable storage (#23277)
- Fix tensorflow distributed trainable docstring (#23590)
- Simplify experiment tag formatting, clean directory names (#23672)
- Don't include nan metrics for best checkpoint (#23820)
- Fix syncing between nodes in placement groups (#23864)
- Fix memory resources for head bundle (#23861)
- Fix empty CSV headers on trial restart (#23860)
- Fix checkpoint sorting with nan values (#23909)
- Make Timeout stopper work after restoring in the future (#24217)
- Small fixes to tune-distributed for new restore modes (#24220)
Train
Most distributed training enhancements will be captured in the new Ray AIR category!
🔨Fixes:
- Copy resources_per_worker to avoid modifying user input
- Fix
train.torch.get_device()
for fractional GPU or multiple GPU per worker case (#23763) - Fix multi node horovod bug (#22564)
- Fully deprecate Ray SGD v1 (#24038)
- Improvements to fault tolerance (#22511)
- MLflow start run under correct experiment (#23662)
- Raise helpful error when required backend isn't installed (#23583)
- Warn pending deprecation for
ray.train.Trainer
andray.tune
DistributedTrainableCreators (#24056)
📖Documentation:
- add FAQ (#22757)
Ray AIR
🎉 New Features:
HuggingFaceTrainer
&HuggingFacePredictor
(#23615, #23876)SklearnTrainer
&SklearnPredictor
(#23803, #23850)HorovodTrainer
(#23437)RLTrainer
&RLPredictor
(#23465, #24172)BatchMapper
preprocessor (#23700)Categorizer
preprocessor (#24180)BatchPredictor
(#23808)
💫Enhancements:
- Add
Checkpoint.as_directory()
for efficient checkpoint fs processing (#23908) - Add
config
toResult
, extendResultGrid.get_best_config
(#23698) - Add Scaling Config validation (#23889)
- Add tuner test. (#23364)
- Move storage handling to pyarrow.fs.FileSystem (#23370)
- Refactor
_get_unique_value_indices
(#24144) - Refactor
most_frequent
SimpleImputer
(#23706) - Set name of Trainable to match with Trainer #23697
- Use checkpoint.as_directory() instead of cleaning up manually (#24113)
- Improve file packing/unpacking (#23621)
- Make Dataset ingest configurable (#24066)
- Remove postprocess_checkpoint (#24297)
🔨Fixes:
- Better exception handling (#23695)
- Do not deepcopy RunConfig (#23499)
- reduce unnecessary stacktrace (#23475)
- Tuner should use
run_config
from Trainer per default (#24079) - Use custom fsspec handler for GS (#24008)
📖Documentation:
Serve
🎉 New Features:
- Serve logging system was revamped! Access log is now turned on by default. (#23558)
- New Gradio notebook example for Ray Serve deployments (#23494)
- Serve now includes full traceback in deployment update error message (#23752)
💫Enhancements:
- Serve Deployment Graph was...
Ray-1.12.1
Patch release with the following fixes:
- Ray now works on Google Colab again! The bug with memory limit fetching when running Ray in a container is now fixed (#23922).
ray-ml
Docker images for CPU will start being built again after they were stopped in Ray 1.9 (#24266).- [Train/Tune] Start MLflow run under the correct experiment for Ray Train and Ray Tune integrations (#23662).
- [RLlib] Fix for APPO in eager mode (#24268).
- [RLlib] Fix Alphastar for TF2 and tracing enabled (c5502b2).
- [Serve] Fix replica leak in anonymous namespaces (#24311).