Skip to content

Commit

Permalink
Sync Upstream master (#50)
Browse files Browse the repository at this point in the history
* [core] Pull Manager exponential backoff (#13024)

* [RLlib] Issue 12789: RLlib throws the warning "The given NumPy array is not writeable" (#12793)

* [release tests] test_many_tasks fix (#12984)

* Add "beta" documentation for enabling object spilling manually (#13047)

* [Serve] Handle Bug Fixes (#12971)

* [Dashboard] Add GET /logical/actors API (#12913)

* [GCS]Decouple gcs resource manager and gcs node manager (#13012)

* [ray_client]: Insert decorators into the real ray module to allow for client mode (#13031)

* [GCS] Delete redis gcs client and redis_xxx_accessor (#12996)

* [RLlib] Fix broken unity3d_env import in example server script. (#13040)

* [RLlib] TorchPolicies: Accessing "infos" dict in train_batch causes `TypeError`. (#13039)

* [joblib] Fix flaky joblib test. (#13046)

* [Tune]Add integer loguniform support (#12994)

* Add integer quantization and loguniform support

* Fix hyperopt qloguniform not being np.log'd first

* Add tests, __init__

* Try to fix tests, better exceptions

* Tweak docstrings

* Type checks in SearchSpaceTest

* Update docs

* Lint, tests

* Update doc/source/tune/api_docs/search_space.rst

Co-authored-by: Kai Fricke <[email protected]>

Co-authored-by: Kai Fricke <[email protected]>

* [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)

* Add index for tasks to dispatch

* Task dependency manager interface

* Unsubscribe dependencies and tests

* NodeManager

* Revert "Add index for tasks to dispatch"

This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.

* tmp

* Move back to waiting if args not ready

* update

* Update to new form of brew cask install command

* [Autoscaler] New output log format (#12772)

* Fix typo RMSProp -> RMSprop (#13063)

* [serve] Centralize HTTP-related logic in HTTPState (#13020)

* Remove suppress output to see why wheel is not building

* Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)

* New dependency manager

* Switch raylet to new DependencyManager

* PullManager accepts bundles

* Cleanup, remove old task dependency manager

* x

* PullManager unit tests

* lint

* Unit tests

* Rename

* lint

* test

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <[email protected]>

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <[email protected]>

* x

* lint

Co-authored-by: SangBin Cho <[email protected]>

* [docs] Fix args + kwargs instead of docstrings (#13068)

* functools wraps

* Fix typo (functoools -> functools)

* Fix OS X Wheel Build - Update brew cask install (#13062)

Co-authored-by: Richard Liaw <[email protected]>

* speed up local mode object store get (#13052)

Co-authored-by: senlin.zsl <[email protected]>

* [RLlib] Execution Annotation (#13036)

* [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943)

* [C++ API] Added reference counting to ObjectRef (#13058)

* Added reference counting to ObjectRef

* Addressed the comments

* [Core] Remove cuda support in plasma store (#13070)

* remove cuda support in plasma store

* [Core] Remote outdated external store (#13080)

* remove outdated external store

* [GCS] Move resource usage info to gcs resource manager (#13059)

* [RLlib] JAXPolicy prep. PR #1. (#13077)

* [RLlib] Preprocessor fixes (multi-discrete) and tests. (#13083)

* [RLlib] BC/MARWIL/recurrent nets minor cleanups and bug fixes. (#13064)

* [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching (#12935)

* other collectives all work

* auto-linting

* mannual linting #1

* mannual linting 2

* bugfix

* add send/recv point-to-point calls

* add some initial code for communicator caching

* auto linting

* optimize imports

* minor fix

* fix unpassed tests

* support more dtypes

* rerun some distributed tests for send/recv

* linting

* [Serve] [Doc] Front page update (#13032)

* Deprecate experimental / dynamic resources (#13019)

* [docs] fix wandb url (#13094)

* [Serve] Implement Graceful Shutdown (#13028)

* [Serve] Use ServeHandle in HTTP proxy (#12523)

* [Java] Format ray java code (#13056)

* [docker] Fix restart behavior with Docker (#12898)

Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: ijrsvt <[email protected]>

* Disable broken streaming tests (#13095)

* [autoscaler] Make placement groups bypass max launch limit (#13089)

* Serve metrics docs (#13096)

* [RLlib] run_regression_tests.py: --framework flag (instead of --torch). (#13097)

* [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035)

* [Doc] Fix Sphinx.add_stylesheet deprecation (#13067)

* Fix streaming ci failure (#12830)

* [RLlib] New Offline RL Algorithm: CQL (based on SAC) (#13118)

* [Bugfix][Dashboard] Fix undefined logCount, errorCount UI crash (#13113)

* [RLlib] Deflake test case: 2-step game MADDPG. (#13121)

* [RLlib] Trajectory view API docs. (#12718)

* Job module without submission (#13081)

Co-authored-by: 刘宝 <[email protected]>

* [RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compatibly), minor fixes and preparations). (#13091)

* [Java] Avoid failure of serializing a user-defined unserializable exception. (#13119)

* [Tune] Update URL to fix 403 not found error in PBT tranformers test case (#13131)

* [serve] Async controller (#13111)

* [dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948)

* [Serve] Use a small object to track requests (#13125)

* [docs][kubernetes][minor] Update K8s examples in doce (#13129)

* [RLlib] Support easy `use_attention=True` flag for using the GTrXL model. (#11698)

* [docs] Documentation + example for the C++ language API (#13138)

* [Java] Support `wasCurrentActorRestarted` in actor task. (#13120)

* Remove check.

* Add test

* fix lint

* lint

* Fix spotless lint

* Address comments.

* Fix lint

Co-authored-by: Qing Wang <[email protected]>

* [docs] Minor change to formating C++ docs. (#13151)

* Deprecate setResource java api (#13117)

* [docs] Small fix in C++ documentation. (#13154)

* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: root <[email protected]>

* [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127)

* [kubernetes][docs][minor] Kubernetes version warning (#13161)

* [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)

* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <[email protected]>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <[email protected]>

* Enabling the cancellation of non-actor tasks in a worker's queue (#12117)

* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting

* [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061)

* [Release] Update Release Process Documentation (#13123)

* [Core] Remove Arrow dependencies (#13157)

* remove arrow ubsan

* remove arrow build depend

* remove arrow buffer

* [XGboost] Update Documentation (#13017)

Co-authored-by: Richard Liaw <[email protected]>

* [SGD] Fix Docstring for `as_trainable` (#13173)

* Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)

This reverts commit b4d688b4a64c595a071e8c7380b653e0bfea4ad2.

* Surface object store spilling statistics in `ray memory` (#13124)

* [ray_client]: Move from experimental to util (#13176)

Change-Id: I9f054881f0429092d265cd6944d89804cce9d946

* Remove unused file(object_manager_integration_test.cc) (#12989)

* Notify listeners after registered node stored (#13069)

* [build]Update description and add some keywords (#13163)

* [Collective][PR 2/6] Driver program declarative interfaces (#12874)

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* add a Backend class to make Backend string more robust

* add several useful APIs

* add some tests

* added allreduce test

* fix typos

* fix several bugs found via unittests

* fix and update torch test

* changed back actor

* rearange a bit before importing distributed test

* add distributed test

* remove scratch code

* auto-linting

* linting 2

* linting 2

* linting 3

* linting 4

* linting 5

* linting 6

* 2.1 2.2

* fix small bugs

* minor updates

* linting again

* auto linting

* linting 2

* final linting

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <[email protected]>

* added actor test

* lint

* remove local sh

* address most of richard's comments

* minor update

* remove the actor.option() interface to avoid changes in ray core

* minor updates

Co-authored-by: YLJALDC <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>

* [serve] Merge ActorReconciler and BackendState (#13139)

* [tune] better signature check for `tune.sample_from` (#13171)

* [tune] better signature check for `tune.sample_from`

* Update python/ray/tune/sample.py

Co-authored-by: Sumanth Ratna <[email protected]>

Co-authored-by: Sumanth Ratna <[email protected]>

* Disable atexit test on windows (#13207)

* [serve] Move controller state into separate files (#13204)

* Update multi_agent_independent_learning.py (#13196)

pettingzoo.utils.error.DeprecatedEnv: waterworld_v0 is now depreciated, use waterworld_v2 instead

* [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162)

* [Tune] Fix PBT Transformers Example (#13174)

* [Serve] HTTPOptions for deployment modes (#13142)

* [tests] Fix Autoscaler Test failure on Windows (#13211)

* skip create_or_update tests

* Update python/ray/tests/test_autoscaler.py

Co-authored-by: Ameer Haj Ali <[email protected]>

Co-authored-by: Ameer Haj Ali <[email protected]>

* [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158)

* [GCS]Fix TestActorSubscribeAll bug (#13193)

* [Metrics] Record per node and raylet cpu / mem usage (#12982)

* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.

* [Tune] Fix tune serve integration example (#13233)

* [Redis] Note that each Redis Connect retry takes two minutes (#12183)

* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.

* [Log] fix spdlog init race (#12973)

* fix spdlog init race

* use global logger

* refine logger name and constructor

* [Release] Add 1.1.0 release test logs (#13054)

* Add microbenchmark to release logs

* check in many_tasks stress test result

* Add results of placement group stress test for 1.1.0

* Add result for test_dead_actors test and correct the name of test_many_tasks.txt

* Add rllib regression test result

* Add pytorch test results for rllib

* remove extraneous log entries

* [Core] Fix incorrect comment (#13228)

* [Serialization] Fix cloudpickle (#13242)

* [GCS]Fix gcs table storage `GetAll` and `GetByJobId` api bug (#13195)

* Start ray client server with 'ray start' (#13217)

* [GCS]Add gcs actor schedule strategy (#13156)

* Publish job/worker info with Hex format instead of Binary (#13235)

* [RLlib] SquashedGaussians should throw error when entropy or kl are called. (#13126)

* [Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247)

Now that `HeadOnly` becomes the new default HTTP location, we can
re-enable the long running tests to use local multi-clusters.
(also fixed the controller's API to match up to date, we should
have caught these, I will open issues for this.)

* Update autoscaler-cluster yaml files for release tests (#13114)

* [Release] Use ray-ml image for logn running test (#13267)

* [RLlib] Fix missing "info_batch" arg (None) in `compute_actions` calls. (#13237)

* [Tune] Improve error message for Session Detection (#13255)

* Improve error message

* log once

* [Tune] Pin Tune Dependencies (#13027)

Co-authored-by: Ian <[email protected]>

* [Dependabot] Add Dependabot (#13278)

Co-authored-by: Ian <[email protected]>

* [docker] Pull if image is not present (#13136)

* [GCS] Remove old lightweight resource usage report code path (#13192)

* [Dashboard] Add GET /log_proxy API (#13165)

* Fix a crash problem caused by GetActorHandle in ActorManager (#13164)

* [ray_client] Add metadata to gRPC requests (#13167)

* [RLlib] Preparatory PR for: Documentation on Model Building. (#13260)

* [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286)

* [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287)

* Remove top-level ray.connect() and ray.disconnect() APIs (#13273)

* [Pull manager] Only pull once per retry period (#13245)

* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <[email protected]>

* [Cancellation] Make Test Cancel Easier to Debug (#13243)

* first commit

* lint-fix

* [ray_client]: first draft of documentation (#13216)

* Do not give an error if both `RAY_ADDRESS` and `address` is specified on initialization (#13305)

* Finalize handling of RAY_ADDRESS

* lint

* [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215)

* [RLlib] SlateQ Documentation (#13266)

* [RLlib] Add more detailed Documentation on Model building API (#13261)

* [tune] convert search spaces: parse spec before flattening (#12785)

* Parse spec before flattening

* flatten after parse

* Test for ValueError if grid search is passed to search algorithms

* remove empty extras streaming deps (#12933)

* add the method annotation and a comment explaining what's happening (#13306)

Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a

* Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210)

* [RLlib] Issue 13330: No TF installed causes crash in `ModelCatalog.get_action_shape()` (#13332)

* [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298)

* fix removal of task dependencies (#13333)

Co-authored-by: senlin.zsl <[email protected]>

* [Serve] Support Starlette streaming response (#13328)

* [RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)

* [client] Report number of currently active clients on connect (#13326)

* wip

* update

* update

* reset worker

* fix conn

* fix

* disable pycodestyle

* Implement internal kv in ray client (#13344)

* kv internal

* fix

* [Tune] Rename MLFlow to MLflow (#13301)

* Forgot overwrite parameter in Ray client internal kv

* Fix typo in Tune Docs (Checkpointing) (#13348)

See issue #13299

* [Kubernetes][Docs] GPU usage (#13325)

* gpu-note

* gpu-note

* More info

* lint?

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* Update doc/source/cluster/kubernetes.rst

Co-authored-by: Richard Liaw <[email protected]>

* GKE->Kubernetes

Co-authored-by: Richard Liaw <[email protected]>

* Revert "[RLlib] Make TFModelV2 behave more like TorchModelV2: Obsolete register_variables. Unify variable dicts. (#13339)" (#13361)

This reverts commit e2b2abb88b82c0c2402a338bba51e5dbd1739419.

* [Dependabot] [CI] Re-configure Dependabot and disable duplicate builds (#13359)

* [tune] buffer trainable results (#13236)

* Working prototype

* Pass buffer length, fix tests

* Don't buffer per default

* Dispatch and process save in one go, added tests

* Fix tests

* Pass adaptive seconds to train_buffered, stop result processing after STOP decision

* Fix tests, add release test

* Update tests

* Added detailed logs for slow operations

* Update python/ray/tune/trial_runner.py

Co-authored-by: Richard Liaw <[email protected]>

* Apply suggestions from code review

* Revert tests and go back to old tuning loop

* nit

Co-authored-by: Richard Liaw <[email protected]>

* [Serve] Add dependency management support for driver not running in a conda env (#13269)

* [RLlib] Add `__len__()` method to SampleBatch (#13371)

* [Serve] Backend state unit tests (#13319)

* trigger doc build for serve updates (#13373)

* [Object Spilling] Long running object spilling test (#13331)

* done.

* formatting.

* Remove unimplemented GetAll method in actor info accessor (#13362)

* [Doc] Remove trailing whitespaces (#13390)

* Enable Ray client server by default (#13350)

* update

* fix

* fix test

* update

* [RLlib] Trajectory View API: Atari framestacking. (#13315)

* [ray_client]: Wait for ready and retry on ray.connect() (#13376)

* [ray_client]: wait until connection ready

Change-Id: Ie443be60c33ab7d6da406b3dcaa57fbb7ba57dd6

* lint

Change-Id: I30f8e870bbd5f8859a9f11ae244e210f077cedd0

* docs and retry minimum

Change-Id: I43f5378322029267ddd69f518ce8206876e2129d

* [Dashboard] Fix missing actor pid (#13229)

* [ray_client]: Fix multiple attempts at checking connection (#13422)

* Plumb retries update (#13411)

* [Serve] [Doc] Improve batching doc (#13389)

* [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514)

* Fix Serve release test (#13385)

* Add bazel logs upload to GHA (#13251)

* [tune] Fix f-string in error message (#13423)

* [serve] Pull out goal management logic into AsyncGoalManager class (#13341)

* Make request_resources() use internal kv instead of redis pub sub (#13410)

* Remove unused handler methods (#13394)

* [Tune] Pin Transitive Dependencies (#13358)

* Split out the part of get_node_ip_address for which the docstring is correct (#12796)

* Fix raylet::MockWorker::GetProcess crashes (#13440)

Co-authored-by: 刘宝 <[email protected]>

* Revert "Enable Ray client server by default (#13350)" (#13429)

This reverts commit 912d0cbbf912d5b52d6176155bdff02f504b657d.

* Fix linter error (#13451)

* [GCS]Add gcs resource scheduler (#13072)

* [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363)

* [Core]Fix raylet scheduling bug (#13452)

* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <[email protected]>

* [joblib] joblib strikes again but this time on windows (#13212)

* [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424)

* [kubernetes][minor] Operator garbage collection fix (#13392)

* [Core][CLI] `ray status` and `ray memory` no longer starts a new job (#13391)

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Job 38482.1 should now pass

* Resolve merge conflict

* [RLlib] Deflake 2x remote & local inference tests (external env). (#13459)

* [docs] Add more guideline on using ray in slurm cluster (#12819)

Co-authored-by: Sumanth Ratna <[email protected]>
Co-authored-by: PENG Zhenghao <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>

* [Dashboard] Fix GPU resource rendering issue (#13388)

* [Release] Fix Serve release test (#13303)

The Docker image we were using now uses `ray` users so we have to call
sudo.

* [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460)

* Fix getting runtime context dict in driver (#13417)

* [xgb] re-enable xgboost_ray tests (#13416)

* re-enable

* fix

* update xgb_ray version

* [Serialization] New custom serialization API (#13291)

* new serialization API with doc & test

* add more notes

* refine notes

* doc

* [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)

* Added owned object reference before Plasma put on Create() + Seal() path.

* Consolidated location table and reference table in reference counter.

* Restore type in definition.

* Clean up owned reference on failed Seal().

* Added RemoveOwnedObject test for reference counter.

* Guard against ref going out of scope before location RPCs.

* Add 'owner must have ref in scope' precondition to documentation for object location methods.

* Move to separate Create() + Seal() methods for existing objects.

* Clearer distinction between Create() and Seal() methods.

* Make it clear that references will normally be cleaned up by reference counting.

* [ray_client]: Support runtime_context as metadata (#13428)

* [GCS]Remove unused class variable (#13454)

* [Object Spilling] Dedup restore objects (#13470)

* done.

* Addressed code review.

* [CI] Enable Dashboard tests for master (#13425)

* [docker/dashboard] Fix ray dashboard (#12899)

* [CI] Fix Windows Bazel Upload (#13436)

* Return version info from Ray client connect, to allow for discovering version mismatches

* Update ID specification doc (#13356)

* [ray_client]: fix wrong reference in server_pickler (#13474)

Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf

* Bump dev branch to 2.0 to avoid endless version bump toil (#13497)

* wip

* fix

* fix

* Remove an unnecessary file (#13499)

* [Tests] Skip failing windows tests (#13495)

* skip failing windows tests

* skip more

* remove

* updates

* [tune] fix small docs typo (#13355)

Signed-off-by: Richard Liaw <[email protected]>

* move message to debug (#13472)

* Minimal version of piping autoscaler events to driver logs (#13434)

* sync write internal config in gcs (#13197)

* Refactor node manager to eliminate `new_scheduler_enabled_` (#12936)

* [GCS]Only publish changed field when node dead (#13364)

* Only update changed field when node dead

* node_id missed

* [CI] Buildkite PR Environment for Simple Tests (#13130)

* [GCS] Remove task info publish as nowhere uses it (#13509)

* Remove task info publish as nowhere uses it

* simplify right publish channel

* [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467)

* [tune] placement group support (#13370)

* [Serve] Allow ObjectRef for Composition (#12592)

* Add Dashboard Python Test to Buildkite (#13530)

* Add ability to not start Monitor when calling `ray start` (#13505)

* [tune] support experiment checkpointing for grid search (#13357)

* Fix typo (#13098)

* Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544)

* [RLlib] MARWIL loss function test case and cleanup. (#13455)

* [RLlib] Deprecate `vf_share_layers` in top-level PPO/MAML/MB-MPO configs. (#13397)

* [RLlib] Env directory cleanup and tests. (#13082)

* [RLlib] Issue 9071 A3C w/ RNN not working due to VF assuming no RNN. (#13238)

* Fix passing env on windows (#13253)

* [Object Spilling] Remove retries and use a timer instead. (#13175)

* [metrics] Better validation for tags (#13421)

* [Tune] MLflow Credentials (#13533)

* Make AWSNodeProvider.create_node return nodes created (#13498)

* Make AWSNodeProvider.create_node return node config

* return-dict

* Node provider interface create node return type Any

* Type clarification.

* Delete debug code

* Oops reset example-full changes

* Return type specified. GCP create node returns None.

* Article

* Fix Docker Permission for Serve release test again (#13543)

* Pipe monitor.err logs to driver

* Debug info to GCS pub sub (#13564)

* Fix restoration request dedup issues. (#13546)

* [core] refactor disconnect message processing and enrich WorkExitType (#13527)

* [core] refactor disconnect message processing and enrich WorkExitType

add changes from refactor pr

fix type typo

fix typo

fix

* address comments

* also update WorkerTableData

* fix tests

* [GCS]Only publish fileds used by sub clients in WorkerTableData (#13508)

* Revert "Pipe monitor.err logs to driver" (#13574)

This reverts commit a0d08c2cc638c1766a08e2030642c9b434609efa.

* [tune] wandb - WandbLogger now also accepts wandb.data_types.Video (#13169)

* [tune] Allow actor reuse for new trials (#13549)

* Allow actor reuse for new trials

* Fix tests and update conf when starting new trial

* Move magic config to `reset_trial`

* [Core] add thread name to help performance profiling (#13506)

* Extra fix ray client newline (#13577)

* [xgboost] Add XGBoost release tests (#13456)

* Add XGBoost release tests

* Add more xgboost release tests

* Use failure state manager

* Add release test documentation

* Fix wording

* Automate fault tolerance tests

* Fix for operator role definition to add raycluster/finalizer (#13567)

* [metrics] Check that all tag_keys are set when recording (#13420)

* [Core] Remove 'PlasmaBuffer' in the buffer header (#13188)

* Sync Bonsai Changes in 1.1.0 (#49)

* [autoscaler/AWS] Updated AWS Node Provider threading logic (#11422)

* [autoscaler] Add rsync_exclude and rsync_filter options to cluster config (#11512)

* Add --worker-port-list option to ray start (#11481)

* [hotfix] Pin node version (fix linux wheel build) (#11532)

Co-authored-by: Max Fitton <[email protected]>

* [Core] Allow creating tasks/actors in a detached actor when driver has exited (#11493)

* Allow creating tasks/actors in a detached actor when driver has exited

* lint

* Address comment

* [Autoscaler] Do not count unmanaged nodes in load metrics (#11458)

* fixedd

* lint

* fixed other test case

* .

Co-authored-by: Alex Wu <[email protected]>

* [RaySGD] Docs for SGD+Tune usage (#11479)

* Clean up release tests (#11420)

* [tune] a tiny ptl example (#11497)

* [yaml] HotFix for correct example full (#11584)

* [releng]: Quiet Docker Push (and explain why) (#11623)

* [release] Do not tag docker latest on release builds  (#11694)

* fix

* Added comment

Co-authored-by: Alex Wu <[email protected]>

* [tune] fixed validation for search metrics (#11583)

* fixed validation for search metrics

* formatting

* made error report better

* if only one metric is missing extract it from list

* any can take a generator

* Fix asyncio plasma integration in cluster mode (#11665)

* [tune] PB2 (#11466)

Co-authored-by: Sumanth Ratna <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>

* Version bump 1.0.1

* Disable validation of cluster config on the cluster to allow for cluster configs with new properties. (#11693)

* [Hotfix] Pin Pydantic Version (#11622)

* [docker] Fix docker regex (#11726)

Co-authored-by: Alex Wu <[email protected]>

* [GCS]Decouple node failure detector with resoure related operations (#11465)

* [Placement Group] Placement group automatic cleanup. (#11546)

* In progress. Done with all placement group manager code.

* It is working with job.

* Finished detached actor implementation.

* Fix minor issue.

* In progress.

* Addressed code review.

* Addressed code review.

* Addressed code reivew.

* Fix a build error.

* [docker] Push to DockerHub in CI (#11442)

* [docker] Disable Readme push to avoid errors (#11770)

* Release testing things

* rllib regression results

* [Metrics] Implement basic metrics changes (#11769)

* Implement basic metrics changes

* Addressed code review.

* Fix build issue.

* Fix build issue.

* [Core] Fix ray start failure to due to bug of redis address detection (#11735)

* Fix ray start failure to due redis address detection bug

* Address comment

* [Test] Ignore setproctitle for local mode (#11819)

* [Dashboard] Patch issue in 1.0.1 release where worker stats are not present for a node (#12062)

* [autoscaler] Add the cluster_name to docker file mounts directory prefix to make it more unique (#11600)

* Set version to 1.0.1.post1

* Sync Bonsai Changes in 1.0.1 (#47)

* Bump up the version to 0.8.6

* Linting fix.

* Add release test runnning full asan python test (#8836)

* [MERGE TO MASTER] Add microbenchmark result.

* Fix asyncio re-entry error message (#8842)

* Change os.uname()[1] and socket.gethostname() to the portable and faster platform.node_ip() (#8839)

Co-authored-by: Mehrdad <[email protected]>

* [serve] Fix long running failure test (#8863)

* [Serve] Serve long running test fix (#8864)

* Replace ps call with psutil (#8851)

* Replace ps call with psutil

* Minor formatting

Co-authored-by: Mehrdad <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>

* [Core] Fix a detached actor bug fix when GCS actor management is off. (#8843)

* [Testing] Fix LINT/sphinx errors. (#8874)

* Node failure test fix (#8882)

* [core] Check that port is unused before assigning to worker (#8773)

* [rllib] Set framework to tf by default and remove import checks; "Auto" option (#8748)

* tf by default

* Update rllib/agents/trainer.py

Co-authored-by: Sven Mika <[email protected]>

* remove it

* fix

* remove

* fix

* lint

Co-authored-by: Sven Mika <[email protected]>

* [RLlib] Issue 8889: action clipping bug ppo not learning mujoco (#8898)

* Fix Windows build (#8905)

Co-authored-by: Mehrdad <[email protected]>

* Use no_restart=False for ray.kill in Serve failure test (#8952)

* Display GPU Utilization in the Dashboard (#8564)

* Update incorrect detached actor docs (#8930)

* [Dashboard] Dashboard pubsub hotfix. (#8944)

* [CI] Fix Conda Permission on MacOS Github Action(#9004)


Co-authored-by: Mehrdad <[email protected]>

* Update pandas to 1.0.5 (#9065)

Co-authored-by: Mehrdad <[email protected]>

* Do not add reference count when it is local mode. (#8979)

* [Dashboard] Update the Ray dashboard documentation to explain memory view. (#8945)

* Windows compatibility (#93)

Co-authored-by: mehrdadn <[email protected]>
Co-authored-by: Mehrdad <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>

* Preparing 0.8.6 (#26)

* Updated Version to 0.8.5.

* Formatting.

* Fix Serve long running test (#8223)

* Fix release 0.8.5 tests for PPO torch Breakout. (#8226)

* Remove logging (#8211)

* [BRING BACK TO MASTER] Fix cluster.yaml config.

* [rllib] Copy plasma memory before adding data to replay buffer

* [sgd] Resource limit lift for GPU test (#8238)

* Fix resource_ids_ data race (#8253)

* [rllib] [hotfix] Remove assert that trips on pytorch multiagent (#8241)

* [BRING BACK TO MASTER] add torch download for rllib regresstion test.

* [serve] Master actor fault tolerance (#8116)

* [serve] Add delete_backend call (#8252)

* Fix resource_ids_ data race (#8253)

* [serve] Add delete_endpoint call (#8256)

* [serve] Refactor BackendConfig (#8202)

* Delete example files.

* Fix serve long running test (#8268)

* [tune] Avoid breakage - soft deprecation warning for search algs (#8258)

* [tune] Hotfix Ax breakage when fixing backwards-compat (#8285)

* Async actor microbenchmark Script (#8275)

* [core] Disable GCS actor management (#8271)

* Pin redis-py version (#8290)

* [BRING BACK TO MASTER] add pip install upgrade to the command.

* Add ipython as dependency for autoscaler container (#8297)

Co-authored-by: rbusche <[email protected]>

* Revert "Async actor microbenchmark Script (#8275)"

This reverts commit 6a6eead1fe45c774ce75da0d5f90f443ac3748ec.

* Docs and LINT.

* [RLlib] Increasing reusability v0 (#8)

* Set up CI with Azure Pipelines

Specifically, we are setting a
travis like ADO pipeline following
what is already present in the .travis.yml
file in the root of the repo.

* Separating travis like pipeline from main pipeline

* Adding Jenkings jobs equivalent

* Making some improvements

* Adding validation of the upstream CI

* Disabling Tune and large memory tests

* Changing threshold for simple reservoir sampling test

* Addressing comments

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with more travis updates

* Updating CI with new cpp worker tests

* Setting code owners

* Fixing the version number generation

* Making main pipeline also our release pipeline

* Updating Azure Pipelines with travis updates

* Fixing wheels test

* Fixing codeowners

* Updating Azure Pipelines with travis updates

* Bumping up MACOSX_DEPLOYMENT_TARGET

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Updating Azure Pipelines with travis updates

* Disabling Serve tests

* Making explicit which branches GitHubActions workflows should watch

* Desabling Ray serve tests

* Installing numpy explicitly

* consolidating Ray test steps in one yml

* Making worker set, apex and ppo a little bit more reusable for custom agents

* Making Dynamic TF policy more reusable

* Allow the actions dict carry user data defined for the episodes

* Forcing RLlib tests to run always

* Making SAC model more extensible

* Adapting exploration API

* Reverting the random worker index change

* Making epsilon configurable

* Fixing method doc

* Fixing aguments check in reset_schedule

* Fixing per worker epsilon greedy

* Activating logs for failing test

* Making original_space check more roboust

* Allow normalized actions rescaling happend outside RLlib

* Passing infos values from agents to callbacks

* Installing node js using a task

* Adding kwargs in TFModels

* Fixing npm and node in mac

* Fixing the num workers value passed

* Forcing RLlib tests

* Merging 0.8.5

* Running some RLlib test in custom agent

* Adding echo bazelisk

* Force CI

* Force CI

* Relaxing an installation

* Using container jobs

* Fixing container jobs

* Change base image for container job

* Install with sude

* Exec with sudo

* Test

* Changing agent pool

* Remove python selection

* Fix version replacement

* Fix version replacement

* Trying Bazel

* Installing node with sudo

* Run all install as sudo

* Reverting sudo -s

* Fixing omitted param

* install python manually

* Fixing missing param

* Making NVM available

* Fix nvm installation

* Fix copye-paste

* renaming to req file

* fix typo

* Install JDK 8

* Install req in other jobs

* Install JDK with sudo

* Removing docker clean up

* Install Docker

* fix installation issue

* Adding azure package source

* Fix docker permissions

* Install jq

* downloading with sudo

* Install llvm as root

* Skiping flaky test

* copy artifacts as sudo

* Fix Bazel build in MacOS (#23)

* Fixing mac os building issue

* Bazelisk check

* Increase bazel version

* Fixing typos

* Update hash

* Include unzip

* Improved compilation and convergence tests

Added compilation tests that follow proper PyTest conventions.
These tests use parametrized settings, and allow for multiple algorithms to be
tested with a single test.
I've commented out tests these two tests can replace, to show the improvement.
Only about half of the algorithms have been transitioned to the new tests in
interest of keeping the PR small.

* Increasing bazel version

* Increasing bazel version only mac pipelines

* Printing system info in Ubuntu wheels pipeline

* making docker install optional

* Compilation and convergence tests for more algos

Added compilation and convergence tests for Apex DQN, Apex DDPG
Added convergence tests for SAC
Removed old (commented out) compilation test code from
`rllib.agents.dqn.tests.test_apex`

* Clean up

Deleted old (commented out) test code

* Updated BUILD file

Split tests into test_compilation and test_learning.py to work with BAZEL build files.

* Updated BUILD file

Fixed bug in BUILD - wrong files passed in.

* BugFix: Improper imports causing test failures

* BugFix: Improper imports causing test failures

* Removed test_appo from BUILD file

* Fixing copy-paste error

* Applying some bazel fixes

* Fixing installation issues

* Update hash

* Fixing NVM/NODE installation

* Applying latest changes in travis.yml

* Fixing fixture data exclusions

* Disable some java tests

* Adgudime/apex sac (#25)

* WIP: Compilation tests work

* Fixed bugs with Apex SAC continuous action spaces

* Bugfix: Bad imports

* Fixing PyArrow issue

* Fixing guava check

* Fix datetime java format

* Fixing Bazel issues finding or loading conftest

* Fixing pytest module loading order

* Trying different approach to pickle check

* Installing latest pickle5 explicitly

* Fixing conftest resolution

* Temporarily disabling pickle5 validation

* Fixing fixture data exclusions

* Fixing data files treated as src

* Disable some java tests

Co-authored-by: Edilmo Palencia <[email protected]>

* Fix multiple CI errors

* Update hash

* Fixing more build issues

* Fixing more build issues

* Fix pipeline cache path

* More fixes

* Fix cache

* Fixing bazel test command

* Fix bazel test

* Allowing custom sumarize episodes

* Adding custom metrics ops in exec plan

* Apex SAC exploration should be stochastic

* Leting DQN deal with rechaping for Discrete spaces

* Commenting the cache

Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: ijrsvt <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Rüdiger Busche <[email protected]>
Co-authored-by: rbusche <[email protected]>
Co-authored-by: sven1977 <[email protected]>
Co-authored-by: Aditya Gudimella <[email protected]>

* Fix system info step (#29)

* Fix system info step (#30)

* adding testing framework (#28)

* adding testing framework

* install kubebuilder for testing

* adding crrect hash

Co-authored-by: Ali Kanso <[email protected]>

* add shared mem max flag

* change readme

* Tuned hyperparams for ApexSAC

* Bugfix for exploration config.

* Allowing PPO to handle async sampling (#34)

* Making ppo ParallelRollouts mode configurable

* Making dqn ParallelRollouts mode configurable

* Making RolloutWorker generator function public

* Missing argument

* Stop iteration if round robim proportion is not met

* fixing wheels parsing

* Improving iter union stop-iteration conditions

* Fixing DDPG

* Fixing MADDPG

* Fix tflite compat issue (#35)

* Fix tflite compat issue

* Fixing iter corner case

* Manual stride with elipsis

* Fix unecesary stop iteration

* Allow replay ops to stop if they are unhealthy (#36)

* Allow the replay ops to stop if they are unhealthy

* Allowing to configure dqn execution plan consistently

* Making configurable concurrency mode in DQN and metric collection in Apex (#37)

* Fixing concurrency op in dqn (#38)

* Replaced Prioritized Experience Replay with normal Experience replay to create AsyncSAC.

* Setting prioritized_replay in config now uses PrioritizedReplay correctly.

* Renamed LocalAsyncReplayBuffer and AsyncReplayActor to better reflect usage

* Added test with prioritized_replay set to True

* Cleaned up code.

* Fixing manual slicing (#40)

* Fixing manual slicing

* Handling the Box space explicitly

* Including the force stop in gather_async (#41)

* Including the force stop in gather_async

* Fix missing bar

* Fix for gather across shards

* Fix for gather async extreme case

* Making env-runner an explicit iterator and Local Iterator regenerable  (#42)

* Making env-runner an explicit iterator
And also making the LocalIterator able to regenerate.

* Fix multi agent test

* Fix union

* Making infinite sequence explicit

For the sake of the parallel iterators, one that hold a infinite sequence, could be called again after a stop iteration message.
In other words, an StopIteration for a infinite sequence must be seen as a "no items available" message.

* Fix unexpected error

* Fixing gym version

* Update hash

* Addressing comments

* Improve gathering async and by shards (#44)

* Improve gathering async and by shards

* Making ParallelIteratorWorker an explicit Iterator in all cases

* Making ParallelIteratorWorker an explicit Iterator in all cases

* Fixing inverted condition

* Removing ForceStopIteration

* Make seeding possible even if env cannot be seeded.

* Fix grep versions (#46)

* Fix grep versions

* Spliting the stages

* Using pool for all rllib

* Update hash

* fixing path permissions

* Changing node version

* Reverting some OS changes

* Fixing compilation errors

* More compilation errors

* More compilations errors

* Fix node installation

* Fixing some package versions

* Using right bazel version

* Fix mac os version in wheels

* Fix mac os version in wheels

* Some minor fixes

* Force the target mac os

* Fix path

* Disable stress test temporarily

* Fixing gitignore

* Fixing Sampler merge mistakes

* Fixing epsilon greddy merge mistakes and requirements versions

* Fix merge error

* Apply changes in travis.yml

* Fix several issues

* Fixing more compatibility bugs

* Fix more incompatibilities

* More incompatibilities

* Fixing more compat issues

* Disable tune horovod torch tests

* Fixing more tests

Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: mehrdadn <[email protected]>
Co-authored-by: Mehrdad <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: Ian Rodney <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Rüdiger Busche <[email protected]>
Co-authored-by: rbusche <[email protected]>
Co-authored-by: sven1977 <[email protected]>
Co-authored-by: Aditya Gudimella <[email protected]>
Co-authored-by: Ali Kanso <[email protected]>
Co-authored-by: Ali Kanso <[email protected]>

* Applying travis.yml changes

* Use latest pip

* Update the hash

* Fix rllib issues

* Fix rllib issues 2

* Fix tune errors

* Fix ray issues

* Remove old operator

* revert some rllib test deletions

* revert changes on release folder

* Revert more changes

* Logging dashboard building

* Use previous docker image

* Use centos docker image

* more logging

* Comment step

* hash

* installing node 14

* Fix hash

Co-authored-by: Gekho457 <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Kai Yang <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Barak Michener <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Ian Rodney <[email protected]>
Co-authored-by: Raoul Khouri <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Jack Parker-Holder <[email protected]>
Co-authored-by: Sumanth Ratna <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: mehrdadn <[email protected]>
Co-authored-by: Mehrdad <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Rüdiger Busche <[email protected]>
Co-authored-by: rbusche <[email protected]>
Co-authored-by: sven1977 <[email protected]>
Co-authored-by: Aditya Gudimella <[email protected]>
Co-authored-by: Ali Kanso <[email protected]>
Co-authored-by: Ali Kanso <[email protected]>

* Apply changes in travis.yml

* Apply changes in travis.yml

* Fix hash

* Fix sampler

* node 14

* Fix sampler 2

* Disable flaky test

* Fix tune test

Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: Eric Liang <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: fyrestone <[email protected]>
Co-authored-by: fangfengbin <[email protected]>
Co-authored-by: Barak Michener <[email protected]>
Co-authored-by: DK.Pino <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Antoni Baum <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Corey Lowman <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Sumanth Ratna <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: ZhuSenlin <[email protected]>
Co-authored-by: senlin.zsl <[email protected]>
Co-authored-by: Michael Luo <[email protected]>
Co-authored-by: Alind Khare <[email protected]>
Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]>
Co-authored-by: Hao Zhang <[email protected]>
Co-authored-by: architkulkarni <[email protected]>
Co-authored-by: Lavanya Shukla <[email protected]>
Co-authored-by: chaokunyang <[email protected]>
Co-authored-by: Ian Rodney <[email protected]>
Co-authored-by: ijrsvt <[email protected]>
Co-authored-by: 刘宝 <[email protected]>
Co-authored-by: Qing Wang <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Dmitri Gekhtman <[email protected]>
Co-authored-by: Qing Wang <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Clark Zinzow <[email protected]>
Co-authored-by: Gabriele Oliaro <[email protected]>
Co-authored-by: Raed Shabbir <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: YLJALDC <[email protected]>
Co-authored-by: Basu Jindal <[email protected]>
Co-authored-by: Ameer Haj Ali <[email protected]>
Co-authored-by: dHannasch <[email protected]>
Co-authored-by: Lingxuan Zuo <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: Hao Chen <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alex <[email protected]>
Co-authored-by: Akash Patel <[email protected]>
Co-authored-by: Edwin Goh <[email protected]>
Co-authored-by: Maltimore <[email protected]>
Co-authored-by: 灵洵 <[email protected]>
Co-authored-by: Micah Yong <[email protected]>
Co-authored-by: PENG Zhenghao <[email protected]>
Co-authored-by: SameerF <[email protected]>
Co-authored-by: Todd A. Anderson <[email protected]>
Co-authored-by: Keqiu Hu <[email protected]>
Co-authored-by: Daan Klijn <[email protected]>
Co-authored-by: dmatch01 <[email protected]>
Co-authored-by: Gekho457 <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Kai Yang <[email protected]>
Co-authored-by: Raoul Khouri <[email protected]>
Co-authored-by: Jack Parker-Holder <[email protected]>
Co-authored-by: Amog Kamsetty <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: Tao Wang <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: mehrdadn <[email protected]>
Co-authored-by: Mehrdad <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>
Co-authored-by: Max Fitton <[email protected]>
Co-authored-by: Rüdiger Busche <[email protected]>
Co-authored-by: rbusche <[email protected]>
Co-authored-by: sven1977 <[email protected]>
Co-authored-by: Aditya Gudimella <[email protected]>
Co-authored-by: Ali Kanso <[email protected]>
Co-authored-by: Ali Kanso <[email protected]>
  • Loading branch information
Show file tree
Hide file tree
Showing 2,833 changed files with 224,652 additions and 108,079 deletions.
33 changes: 25 additions & 8 deletions .bazelrc
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,22 @@ build --enable_platform_specific_config
build --action_env=PATH
# For --compilation_mode=dbg, consider enabling checks in the standard library as well (below).
build --compilation_mode=opt
build --experimental_ui_deduplicate
#build --cxxopt="-std=c++11"
# This workaround is needed to prevent Bazel from compiling the same file twice (once PIC and once not).
build:linux --force_pic
build:macos --force_pic
build:clang-cl --compiler=clang-cl
build:msvc --compiler=msvc-cl
# `LC_ALL` and `LANG` is needed for cpp worker tests, because they will call "ray start".
# If we don't add them, python's `click` library will raise an error.
test --action_env=LC_ALL
test --action_env=LANG
# Allow C++ worker tests to execute "ray start" with the correct version of Python.
test --action_env=VIRTUAL_ENV
test --action_env=PYENV_VIRTUAL_ENV
test --action_env=PYENV_VERSION
test --action_env=PYENV_SHELL
test --action_env=RAY_ENABLE_NEW_SCHEDULER
# This is needed for some core tests to run correctly
test:windows --enable_runfiles
# TODO(mehrdadn): Revert the "-\\.(asm|S)$" exclusion when this Bazel bug
Expand All @@ -26,9 +36,11 @@ build:msvc --per_file_copt="-\\.(asm|S)$@-WX"
# Ignore warnings for protobuf generated files and external projects.
build --per_file_copt="\\.pb\\.cc$@-w"
build --per_file_copt="-\\.(asm|S)$,external/.*@-w"
#build --per_file_copt="external/.*@-Wno-unused-result"
# Ignore minor warnings for host tools, which we generally can't control
build:clang-cl --host_copt="-Wno-inconsistent-missing-override"
build:clang-cl --host_copt="-Wno-microsoft-unqualified-friend"
build:clang-cl --host_copt="-Wno-range-loop-analysis"
# This workaround is needed due to https://github.com/bazelbuild/bazel/issues/4341
build --per_file_copt="-\\.(asm|S)$,external/com_github_grpc_grpc/.*@-DGRPC_BAZEL_BUILD"
# Don't generate warnings about kernel features we don't need https://github.com/ray-project/ray/issues/6832
Expand All @@ -51,6 +63,10 @@ build:windows --color=yes
build:clang-cl --per_file_copt="-\\.(asm|S)$@-fansi-escape-codes"
build:clang-cl --per_file_copt="-\\.(asm|S)$@-fcolor-diagnostics"

build:manylinux2010 --copt="-Wno-unused-result"
build:manylinux2010 --linkopt="-lrt"


# Debug build flags. Uncomment in '-c dbg' builds to enable checks in the C++ standard library:
#build:linux --cxxopt="-D_GLIBCXX_DEBUG=1"
#build:linux --cxxopt="-D_GLIBCXX_DEBUG_PEDANTIC=1"
Expand Down Expand Up @@ -86,21 +102,22 @@ aquery:ci --color=no
aquery:ci --noshow_progress
build:ci --color=yes
build:ci --curses=no
build:ci --disk_cache=~/ray-bazel-cache
build:ci --remote_cache="https://storage.googleapis.com/ray-bazel-cache"
build:ci --keep_going
build:ci --progress_report_interval=100
build:ci --show_progress_rate_limit=15
build:ci --show_task_finish
build:ci --ui_actions_shown=1024
build:ci-travis --show_timestamps # Travis doesn't have an option to show timestamps, but GitHub Actions does
# GitHub Actions has low disk space, so prefer hardlinks there.
build:ci-github --experimental_repository_cache_hardlinks
test:ci --flaky_test_attempts=5
build:ci --show_timestamps
build:ci-travis --disk_cache=~/ray-bazel-cache
build:ci-travis --remote_cache="https://storage.googleapis.com/ray-bazel-cache"
build:ci-github --experimental_repository_cache_hardlinks # GitHub Actions has low disk space, so prefer hardlinks there.
build:ci-github --disk_cache=~/ray-bazel-cache
build:ci-github --remote_cache="https://storage.googleapis.com/ray-bazel-cache"
test:ci --flaky_test_attempts=3
test:ci --nocache_test_results
test:ci --spawn_strategy=local
test:ci --test_output=errors
test:ci --test_verbose_timeout_warnings
test:ci --test_env=RAY_GCS_ACTOR_SERVICE_ENABLED

aquery:get-toolchain --include_commandline=false
aquery:get-toolchain --noimplicit_deps
Expand Down
2 changes: 1 addition & 1 deletion .bazelversion
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.3.0
3.4.1
29 changes: 29 additions & 0 deletions .buildkite/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
FROM ubuntu:focal

ARG REMOTE_CACHE_URL
ARG BUILDKITE_PULL_REQUEST

ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=America/Los_Angeles
ENV BUILDKITE=true
ENV CI=true
ENV PYTHON=3.6

RUN apt-get update -qq
RUN apt-get install -y -qq \
curl python-is-python3 git build-essential \
sudo unzip apt-utils dialog tzdata wget
RUN locale -a

# Setup Bazel caches
RUN (echo "build --remote_cache=${REMOTE_CACHE_URL}" >> /root/.bazelrc); \
(if [ ${BUILDKITE_PULL_REQUEST} != "false" ]; then (echo "build --remote_upload_local_results=false" >> /root/.bazelrc); fi); \
cat /root/.bazelrc

RUN mkdir /ray
WORKDIR /ray

# Below should be re-run each time
COPY . .
RUN ./ci/travis/ci.sh init
RUN bash --login -i ./ci/travis/ci.sh build
6 changes: 6 additions & 0 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- label: "Ray Core Tests (:buildkite: Experimental)"
commands:
- bazel test --config=ci $(./scripts/bazel_export_options) --build_tests_only -- //:all -rllib/...
- label: "Ray Dashboard Tests"
commands:
- bazel test --config=ci $(./scripts/bazel_export_options) python/ray/new_dashboard/...
25 changes: 25 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[flake8]
exclude =
python/ray/core/generated/
streaming/python/generated
doc/source/conf.py
python/ray/cloudpickle/
python/ray/thirdparty_files/
python/build/
python/.eggs/
max-line-length = 79
inline-quotes = "
ignore =
C408
E121
E123
E126
E226
E24
E704
W503
W504
W605
I
N
avoid-escape = no
6 changes: 3 additions & 3 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ assignees: ''
*Ray version and other system information (Python version, TensorFlow version, OS):*

### Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have **no external library dependencies** (i.e., use fake or mock data / environments):
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have **no external library dependencies** (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

- [ ] I have verified my script runs in a clean environment and reproduces the issue.
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/latest/installation.html).
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).
8 changes: 5 additions & 3 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
---
name: Feature request
about: Suggest an idea for Ray, Tune, RLlib, etc.
name: Feature request/Question
about: For feature requests or questions, post on https://discuss.ray.io/ instead!
title: ''
labels: enhancement, triage
labels: enhancement
assignees: ''

---

<!--Please include [tune], [rllib], [autoscaler] etc. in the issue title if relevant-->

### Describe your feature request

For feature requests or questions, post on our Discussion page instead: https://discuss.ray.io/
14 changes: 0 additions & 14 deletions .github/ISSUE_TEMPLATE/question.md

This file was deleted.

8 changes: 5 additions & 3 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this solves. -->
Expand All @@ -11,9 +13,9 @@
## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/latest/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested (please justify below)
- [ ] This PR is not tested :(
23 changes: 23 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
version: 2
updates:
# Tune/SGD/Doc requirements
- package-ecosystem: "pip"
# The requirements base directory currently only contains tune requirements.
# If we want to add more requirements here (Core, RLlib, etc.), then we should make subdirectories for each one.
directory: "/python/requirements"
schedule:
# Automatic upgrade checks Saturday at 12 AM.
# Dependabot updates can still be manually triggered via Github at any time.
interval: "weekly"
day: "saturday"
# 12 AM
time: "00:00"
# Use Pacific Standard Time
timezone: "America/Los_Angeles"
commit-message:
prefix: "[tune]"
include: "scope"
# Only 3 upgrade PRs open at a time.
open-pull-requests-limit: 3
reviewers:
- "ray-project/ray-tune"
81 changes: 81 additions & 0 deletions .github/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Configuration for probot-stale - https://github.com/probot/stale

# Number of days of inactivity before an Issue or Pull Request becomes stale
daysUntilStale: 120

# Number of days of inactivity before an Issue or Pull Request with the stale label is closed.
# Set to false to disable. If disabled, issues still need to be closed manually, but will remain marked as stale.
daysUntilClose: 14

# Only issues or pull requests with all of these labels are check if stale. Defaults to `[]` (disabled)
onlyLabels: []

# Issues or Pull Requests with these labels will never be considered stale. Set to `[]` to disable
exemptLabels:
- P0
- P1
- P2
- P3
- good first issue
- release-blocker
- fix-docs
- regression
- fix-error-msg

# Set to true to ignore issues in a project (defaults to false)
exemptProjects: false

# Set to true to ignore issues in a milestone (defaults to false)
exemptMilestones: true

# Set to true to ignore issues with an assignee (defaults to false)
exemptAssignees: false

# Label to use when marking as stale
staleLabel: stale

# Comment to post when marking as stale. Set to `false` to disable
markComment: |
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
- If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
- If you'd like to get more attention to the issue, please tag one of Ray's contributors.
You can always ask for help on our [discussion forum](https://discuss.ray.io/) or [Ray's public slack channel](https://github.com/ray-project/ray#getting-involved).
# Comment to post when removing the stale label.
# unmarkComment: >
# Your comment here.

# Comment to post when closing a stale Issue or Pull Request.
closeComment: |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our [discussion forum](https://discuss.ray.io) or [Ray's public slack channel](https://github.com/ray-project/ray#getting-involved).
Thanks again for opening the issue!
# Limit the number of actions per hour, from 1-30. Default is 30
# It will check 120 issues per day.
limitPerRun: 5

# Limit to only `issues` or `pulls`
only: issues

# Optionally, specify configuration settings that are specific to just 'issues' or 'pulls':
# pulls:
# daysUntilStale: 30
# markComment: >
# This pull request has been automatically marked as stale because it has not had
# recent activity. It will be closed if no further activity occurs. Thank you
# for your contributions.

# issues:
# exemptLabels:
# - confirmed
Loading

0 comments on commit 21b289e

Please sign in to comment.