Skip to content

Commit

Permalink
Releasing 1.8.3 (#15757)
Browse files Browse the repository at this point in the history
* chlog update

* Fix typo in script name (#15724)

(cherry picked from commit d925077)

* Torch inference mode for prediction (#15719)

torch inference mode for prediction

(cherry picked from commit 08d14ec)

* [App] Update multi-node examples (#15700)

Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
(cherry picked from commit 8306797)

* feature(docs/app/lit_tabs): add works (#15731)

(cherry picked from commit 1a31d13)

* [App] Fix VSCode IDE debugger (#15747)

(cherry picked from commit 6714ca7)

* Update tensorboard requirement from <2.11.0,>=2.9.1 to >=2.9.1,<2.12.0 in /requirements (#15746)

Update tensorboard requirement in /requirements

Updates the requirements on [tensorboard](https://github.com/tensorflow/tensorboard) to permit the latest version.
- [Release notes](https://github.com/tensorflow/tensorboard/releases)
- [Changelog](https://github.com/tensorflow/tensorboard/blob/master/RELEASE.md)
- [Commits](tensorflow/tensorboard@2.9.1...2.11.0)

---
updated-dependencies:
- dependency-name: tensorboard
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
(cherry picked from commit 0b58b69)

* Update beautifulsoup4 requirement from <=4.8.2 to <4.11.2 in /requirements (#15745)

* Update beautifulsoup4 requirement in /requirements

Updates the requirements on [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/) to permit the latest version.

---
updated-dependencies:
- dependency-name: beautifulsoup4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Apply suggestions from code review

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <[email protected]>
(cherry picked from commit 1ffbe1b)

* [App] Fix multi-node pytorch example CI (#15753)

(cherry picked from commit bc797fd)

* [App] Improve `LightningTrainerScript` start-up time (#15751)

(cherry picked from commit c2c1974)

* Enable Probot CheckGroup v5 (#15670)

(cherry picked from commit 6c8ee01)

* [App] Enable properties for the Lightning flow (#15750)

(cherry picked from commit 5cfb176)

* test for Enable setting property (#15755)

Co-authored-by: thomas chaton <[email protected]>
Co-authored-by: Ethan Harris <[email protected]>
(cherry picked from commit ba14038)

* Move s3fs to cloud extras (#15729)

Co-authored-by: Luca Antiga <[email protected]>
(cherry picked from commit dd75906)

* Revert new Hydra launch behavior (#15737)

* revert new hydra cwd behavior
* remove debug statements
* changelog

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>

(cherry picked from commit 88b2e5a)

* FCCV Docs (#15598)

* add custom data iter docs

* add custom data iter docs

* Update docs/source-pytorch/data/custom_data_iterables.rst

* remove ToDevice

* nit

* Update docs/source-pytorch/data/custom_data_iterables.rst

Co-authored-by: Luca Antiga <[email protected]>

* clarification for @lantiga

* typo

* Update docs/source-pytorch/data/custom_data_iterables.rst

* Update docs/source-pytorch/data/custom_data_iterables.rst

* Update docs/source-pytorch/data/custom_data_iterables.rst

Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Akihiro Nitta <[email protected]>
Co-authored-by: Luca Antiga <[email protected]>
(cherry picked from commit 006fde9)

* Switch from tensorboard to tensorboardx in logger (#15728)

* Switch from tensorboard to tensorboardx in logger
* Warn if log_graph is set to True but tensorboard is not installed
* Fix warning message formatting
* Apply suggestions from code review
* simplify for TBX as required pkg
* docs example
* chlog
* tbx 2.2

Co-authored-by: Luca Antiga <[email protected]>
Co-authored-by: William Falcon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Jirka <[email protected]>

(cherry picked from commit 9c2eb52)

* resolve conflicts

* Fix azure path excludes (#15756)

Co-authored-by: Jirka Borovec <[email protected]>
(cherry picked from commit aef94ce)

* Disable XSRF protection in StreamlitFrontend to support upload in localhost (#15684)

* Enable CORS in StreamlitFrontend to support upload
* Only disable XSRF when running on localhost
* Update test
* Use utility fn to detect if localhost

Co-authored-by: Luca Antiga <[email protected]>
(cherry picked from commit ed3eef0)

* Enable Probot CheckGroup v5.1 (#15763)

(cherry picked from commit c55f80f)

* Bump pytest from 7.1.3 to 7.2.0 in /requirements (#15677)

Bumps [pytest](https://github.com/pytest-dev/pytest) from 7.1.3 to 7.2.0.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](pytest-dev/pytest@7.1.3...7.2.0)

---
updated-dependencies:
- dependency-name: pytest
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
(cherry picked from commit cfb27bd)

* Fix the `examples/app_dag` App (#14359)

* Fix app dag example
* Add test
* Update doc
* Update tests/tests_app_examples/test_app_dag.py

Co-authored-by: Sherin Thomas <[email protected]>
(cherry picked from commit 2b61c92)

* mergify: drop ready for draft (#15766)

(cherry picked from commit 1a07a9c)

* lightning delete cluster CLI command help text update (#15760)

* updated the lighting delete cluster CLI command help text output
* updated changelog
* typo fix
* Apply suggestions from code review

Co-authored-by: Jirka Borovec <[email protected]>
(cherry picked from commit 75b0573)

* Deduplicate top level lighting CLI command groups (#15761)

* unify remove and delete command groups & the add and delete command groups
* added changelog
* fix tests
* Apply suggestions from code review

Co-authored-by: Jirka Borovec <[email protected]>
(cherry picked from commit 7b2788e)

* releasing 1.8.3

* CI: lite on GPU

* Fix App Docs for lightning ssh-keys command (#15773)

fixed ssh-keys docs

(cherry picked from commit 317591d)

Co-authored-by: thomas chaton <[email protected]>
Co-authored-by: yiftachbeer <[email protected]>
Co-authored-by: Sherin Thomas <[email protected]>
Co-authored-by: Ethan Harris <[email protected]>
Co-authored-by: Yurij Mikhalevich <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Luca Antiga <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Justus Schock <[email protected]>
Co-authored-by: Kaushik B <[email protected]>
Co-authored-by: Rick Izzo <[email protected]>
  • Loading branch information
13 people authored Nov 23, 2022
1 parent 8bea72b commit 7d6cfb1
Show file tree
Hide file tree
Showing 60 changed files with 440 additions and 352 deletions.
7 changes: 4 additions & 3 deletions .azure/app-cloud-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,10 @@ pr:
- "tests/tests_app_examples/**"
- "setup.py"
- ".actions/**"
- "!requirements/app/docs.txt"
- "!*.md"
- "!**/*.md"
exclude:
- "requirements/app/docs.txt"
- "*.md"
- "**/*.md"

# variables are automatically exported as environment variables so this will override pip's default cache dir
variables:
Expand Down
7 changes: 4 additions & 3 deletions .azure/gpu-benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,10 @@ pr:
- ".azure/gpu-benchmark.yml"
- "tests/tests_pytorch/benchmarks/**"
- "requirements/pytorch/**"
- "!requirements/pytorch/docs.txt"
- "!*.md"
- "!**/*.md"
exclude:
- "requirements/pytorch/docs.txt"
- "*.md"
- "**/*.md"

schedules:
- cron: "0 0 * * *" # At the end of every day
Expand Down
8 changes: 4 additions & 4 deletions .azure/gpu-tests-lite.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,10 @@ pr:
- "tests/tests_lite/**"
- "setup.cfg" # includes pytest config
- ".actions/**"
- "!requirements/lite/docs.txt"
- "!*.md"
- "!**/*.md"
exclude:
- "requirements/lite/docs.txt"
- "*.md"
- "**/*.md"

jobs:
- job: testing
Expand Down Expand Up @@ -74,7 +75,6 @@ jobs:
- bash: |
PYTORCH_VERSION=$(python -c "import torch; print(torch.__version__.split('+')[0])")
python ./requirements/pytorch/adjust-versions.py requirements/lite/base.txt ${PYTORCH_VERSION}
python ./requirements/pytorch/adjust-versions.py requirements/lite/examples.txt ${PYTORCH_VERSION}
displayName: 'Adjust dependencies'
- bash: |
Expand Down
7 changes: 4 additions & 3 deletions .azure/gpu-tests-pytorch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,10 @@ pr:
- "requirements/lite/**"
- "src/lightning_lite/**"
- ".actions/**"
- "!requirements/**/docs.txt"
- "!*.md"
- "!**/*.md"
exclude:
- "requirements/**/docs.txt"
- "*.md"
- "**/*.md"

jobs:
- job: testing
Expand Down
7 changes: 4 additions & 3 deletions .azure/hpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@ pr:
- "tests/tests_pytorch/**"
- "setup.cfg" # includes pytest config
- ".actions/**"
- "!requirements/**/docs.txt"
- "!*.md"
- "!**/*.md"
exclude:
- "requirements/**/docs.txt"
- "*.md"
- "**/*.md"

jobs:
- job: testing
Expand Down
7 changes: 4 additions & 3 deletions .azure/ipu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,10 @@ pr:
- "tests/tests_pytorch/**"
- "setup.cfg" # includes pytest config
- ".actions/**"
- "!requirements/**/docs.txt"
- "!*.md"
- "!**/*.md"
exclude:
- "requirements/**/docs.txt"
- "*.md"
- "**/*.md"

variables:
- name: poplar_sdk
Expand Down
2 changes: 1 addition & 1 deletion .github/checkgroup.yml
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ subprojects:
- ".github/workflows/ci-app-examples.yml"
- "src/lightning_app/**"
- "tests/tests_app_examples/**"
- "examples/app_*"
- "examples/app_*/**"
- "requirements/app/**"
- "setup.py"
- ".actions/**"
Expand Down
1 change: 1 addition & 0 deletions .github/mergify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ pull_request_rules:
- name: Not ready yet
conditions:
- or:
- draft # filter-out GH draft PRs
- label="has conflicts"
- "#approved-reviews-by=0" # number of review approvals
- "#changes-requested-reviews-by>=1" # no requested changes
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ci-app-examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ on:
- ".github/workflows/ci-app-examples.yml"
- "src/lightning_app/**"
- "tests/tests_app_examples/**"
- "examples/app_*"
- "examples/app_*/**"
- "requirements/app/**"
- "setup.py"
- ".actions/**"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ci-app-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ on:
- ".github/workflows/ci-app-tests.yml"
- "src/lightning_app/**"
- "tests/tests_app/**"
- "examples/app_*" # some tests_app tests call examples files
- "examples/app_*/**" # some tests_app tests call examples files
- "requirements/app/**"
- "setup.py"
- ".actions/**"
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/probot-check-group.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@ jobs:
if: github.event.pull_request.draft == false
timeout-minutes: 61 # in case something is wrong with the internal timeout
steps:
- uses: Lightning-AI/probot@v4
- uses: Lightning-AI/probot@v5.1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
job: check-group
interval: 180 # seconds
timeout: 60 # minutes
maintainers: '@Lightning-AI/lai-frameworks'
owner: '@carmocca'
maintainers: 'Lightning-AI/lai-frameworks'
owner: 'carmocca'
5 changes: 2 additions & 3 deletions docs/source-app/examples/dag/dag_from_scratch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,9 @@ First, let's define the component we need:
:lines: 55-79

And its run method executes the steps described above.
Additionally, ``work.stop`` is used to reduce cost when running in the cloud.

.. literalinclude:: ../../../examples/app_dag/app.py
:lines: 81-108
:lines: 80-103

----

Expand All @@ -51,4 +50,4 @@ Step 2: Define the scheduling
*****************************

.. literalinclude:: ../../../examples/app_dag/app.py
:lines: 109-137
:lines: 106-135
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@


class LightningTrainerDistributed(L.LightningWork):
@staticmethod
def run():
def run(self):
model = BoringModel()
trainer = L.Trainer(max_epochs=10, strategy="ddp")
trainer.fit(model)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@ def distributed_train(local_rank: int, main_address: str, main_port: int, num_no
# 2. PREPARE DISTRIBUTED MODEL
model = torch.nn.Linear(32, 2)
device = torch.device(f"cuda:{local_rank}") if torch.cuda.is_available() else torch.device("cpu")
device_ids = device if torch.cuda.is_available() else None
model = DistributedDataParallel(model, device_ids=device_ids).to(device)
model = DistributedDataParallel(model, device_ids=[local_rank] if torch.cuda.is_available() else None).to(device)

# 3. SETUP LOSS AND OPTIMIZER
criterion = torch.nn.MSELoss()
Expand Down
3 changes: 2 additions & 1 deletion docs/source-app/levels/basic/hero_components.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
.. lit_tabs::
:titles: Hello world; Hello GPU world; PyTorch & ⚡⚡⚡ Trainer (1+ cloud GPUs); Train PyTorch (cloud GPU); Train PyTorch (32 cloud GPUs); Deploy a model on cloud GPUs; Run a model script; XGBoost; Streamlit demo
:code_files: /levels/basic/hello_components/hello_world.py; /levels/basic/hello_components/hello_world_gpu.py; /levels/basic/hello_components/pl_multinode.py; /levels/basic/hello_components/train_pytorch.py; /levels/basic/hello_components/pt_multinode.py; /levels/basic/hello_components/deploy_model.py; /levels/basic/hello_components/run_ptl_script.py; /levels/basic/hello_components/xgboost.py; /levels/basic/hello_components/streamlit_demo.py
:highlights: 7; 10, 11; 10-12, 17, 18; 4, 8, 12, 18-19, 26; 5, 10, 22, 28, 32, 42, 58-60; 3, 11-12, 25, 29; 7, 10; 15, 21; 9, 15, 24
:highlights: 7; 10, 11; 9-11, 16, 17; 4, 8, 12, 18-19, 26; 5, 10, 22, 27, 31, 41, 57-59; 3, 11-12, 25, 29; 7, 10; 15, 21; 9, 15, 24
:works: [{"name":"root.work","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"default","preemptible":false,"shmSize":0},"networkConfig":[{"name":"dzodf","port":61304}]}}];[{"name":"root.work","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"qnlgd","port":61516}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu","preemptible":false,"shmSize":0}}}];[{"name":"root.ws.0","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"ajfrc","port":61553}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.1","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"ttyqc","port":61554}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.2","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"svyej","port":61555}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.3","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"parme","port":61556}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}}];[{"name":"root.work","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"cutdu","port":61584}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu","preemptible":false,"shmSize":0}}}];[{"name":"root.ws.0","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"whhby","port":61613}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.1","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"yhjtf","port":61614}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.2","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"rqwkt","port":61615}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.3","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"pjdsj","port":61616}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.4","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"efdor","port":61617}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.5","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"pxmso","port":61618}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.6","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"feevy","port":61619}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}},{"name":"root.ws.7","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"tbmse","port":61620}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu-fast-multi","preemptible":false,"shmSize":0}}}];[{"name":"root.work","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"umqqg","port":7777}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"gpu","preemptible":false,"shmSize":0}}}];[];[{"name":"root.work","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"tggba","port":61729}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"default","preemptible":false,"shmSize":0}}}];[{"name":"root.work","spec":{"buildSpec":{"commands":[],"pythonDependencies":{"packageManager":"PACKAGE_MANAGER_PIP","packages":""}},"drives":[],"networkConfig":[{"name":"hpyaz","port":61763}],"userRequestedComputeConfig":{"count":1,"diskSize":0,"name":"default","preemptible":false,"shmSize":0}}}]
:enable_run: true
:tab_rows: 3
:height: 620px
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ or cloud GPUs without code changes.
.. lit_tabs::
:descriptions: import Lightning; We're using a demo LightningModule; Move your training code here (usually your main.py); Pass your component to the multi-node executor (it works on CPU or single GPUs also); Select the number of machines (nodes). Here we choose 2.; Choose from over 15+ machine types. This one has 4 v100 GPUs.; Initialize the App object that executes the component logic.
:code_files: /levels/basic/hello_components/pl_multinode.py; /levels/basic/hello_components/pl_multinode.py; /levels/basic/hello_components/pl_multinode.py; /levels/basic/hello_components/pl_multinode.py; /levels/basic/hello_components/pl_multinode.py; /levels/basic/hello_components/pl_multinode.py; /levels/basic/hello_components/pl_multinode.py;
:highlights: 2; 4; 10-12; 15-18; 17; 18; 20
:highlights: 2; 4; 9-11; 14-17; 16; 17; 19
:enable_run: true
:tab_rows: 5
:height: 420px
Expand Down
2 changes: 1 addition & 1 deletion docs/source-app/workflows/ssh/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ You can add SSH keys using Lightning.ai website (Lightning.ai > Profile > Keys)

.. code:: bash
$ lightning add ssh-key --public-key ~/.ssh/id_ed25519.pub
$ lightning create ssh-key --public-key ~/.ssh/id_ed25519.pub
You are now ready to access your Lightning Flow and Work containers.

Expand Down
2 changes: 1 addition & 1 deletion docs/source-pytorch/cli/lightning_cli_intermediate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ Which prints out:

.. code:: bash
usage: a.py [-h] [-c CONFIG] [--print_config [={comments,skip_null,skip_default}+]]
usage: main.py [-h] [-c CONFIG] [--print_config [={comments,skip_null,skip_default}+]]
{fit,validate,test,predict,tune} ...
pytorch-lightning trainer command line tool
Expand Down
122 changes: 122 additions & 0 deletions docs/source-pytorch/data/custom_data_iterables.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
.. _dataiters:

##################################
Injecting 3rd Party Data Iterables
##################################

When training a model on a specific task, data loading and preprocessing might become a bottleneck.
Lightning does not enforce a specific data loading approach nor does it try to control it.
The only assumption Lightning makes is that the data is returned as an iterable of batches.

For PyTorch-based programs, these iterables are typically instances of :class:`~torch.utils.data.DataLoader`.

However, Lightning also supports other data types such as plain list of batches, generators or other custom iterables.

.. code-block:: python
# random list of batches
data = [(torch.rand(32, 3, 32, 32), torch.randint(0, 10, (32,))) for _ in range(100)]
model = LitClassifier()
trainer = Trainer()
trainer.fit(model, data)
Examples for custom iterables include `NVIDIA DALI <https://github.com/NVIDIA/DALI>`__ or `FFCV <https://github.com/libffcv/ffcv>`__ for computer vision.
Both libraries offer support for custom data loading and preprocessing (also hardware accelerated) and can be used with Lightning.


For example, taking the example from FFCV's readme, we can use it with Lightning by just removing the hardcoded ``ToDevice(0)``
as Lightning takes care of GPU placement. In case you want to use some data transformations on GPUs, change the
``ToDevice(0)`` to ``ToDevice(self.trainer.local_rank)`` to correctly map to the desired GPU in your pipeline.

.. code-block:: python
from ffcv.loader import Loader, OrderOption
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage, Cutout
from ffcv.fields.decoders import IntDecoder, RandomResizedCropRGBImageDecoder
class CustomClassifier(LitClassifier):
def train_dataloader(self):
# Random resized crop
decoder = RandomResizedCropRGBImageDecoder((224, 224))
# Data decoding and augmentation
image_pipeline = [decoder, Cutout(), ToTensor(), ToTorchImage()]
label_pipeline = [IntDecoder(), ToTensor()]
# Pipeline for each data field
pipelines = {"image": image_pipeline, "label": label_pipeline}
# Replaces PyTorch data loader (`torch.utils.data.Dataloader`)
loader = Loader(
write_path, batch_size=bs, num_workers=num_workers, order=OrderOption.RANDOM, pipelines=pipelines
)
return loader
When moving data to a specific device, you can always refer to ``self.trainer.local_rank`` to get the accelerator
used by the current process.

By just changing ``device_id=0`` to ``device_id=self.trainer.local_rank`` we can also leverage DALI's GPU decoding:

.. code-block:: python
from nvidia.dali.pipeline import pipeline_def
import nvidia.dali.types as types
import nvidia.dali.fn as fn
from nvidia.dali.plugin.pytorch import DALIGenericIterator
import os
class CustomLitClassifier(LitClassifier):
def train_dataloader(self):
# To run with different data, see documentation of nvidia.dali.fn.readers.file
# points to https://github.com/NVIDIA/DALI_extra
data_root_dir = os.environ["DALI_EXTRA_PATH"]
images_dir = os.path.join(data_root_dir, "db", "single", "jpeg")
@pipeline_def(num_threads=4, device_id=self.trainer.local_rank)
def get_dali_pipeline():
images, labels = fn.readers.file(file_root=images_dir, random_shuffle=True, name="Reader")
# decode data on the GPU
images = fn.decoders.image_random_crop(images, device="mixed", output_type=types.RGB)
# the rest of processing happens on the GPU as well
images = fn.resize(images, resize_x=256, resize_y=256)
images = fn.crop_mirror_normalize(
images,
crop_h=224,
crop_w=224,
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
mirror=fn.random.coin_flip(),
)
return images, labels
train_data = DALIGenericIterator(
[get_dali_pipeline(batch_size=16)],
["data", "label"],
reader_name="Reader",
)
return train_data
Limitations
------------
Lightning works with all kinds of custom data iterables as shown above. There are, however, a few features that cannot
be supported this way. These restrictions come from the fact that for their support,
Lightning needs to know a lot on the internals of these iterables.

- In a distributed multi-GPU setting (ddp),
Lightning automatically replaces the DataLoader's sampler with its distributed counterpart.
This makes sure that each GPU sees a different part of the dataset.
As sampling can be implemented in arbitrary ways with custom iterables,
there is no way for Lightning to know, how to replace the sampler.

- When training fails for some reason, Lightning is able to extract all of the relevant data from the model,
optimizers, trainer and dataloader to resume it at the exact same batch it crashed.
This feature is called fault-tolerance and is limited to PyTorch DataLoaders.
Lighning needs to know a lot about sampling, fast forwarding and random number handling to enable fault tolerance,
meaning that it cannot be supported for arbitrary iterables.
1 change: 1 addition & 0 deletions docs/source-pytorch/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,7 @@ Current Lightning Users
Train on single or multiple TPUs <accelerators/tpu>
Train on MPS <accelerators/mps>
Use a pretrained model <advanced/pretrained>
Inject Custom Data Iterables <data/custom_data_iterables>
model/own_your_loop

.. toctree::
Expand Down
Loading

0 comments on commit 7d6cfb1

Please sign in to comment.