Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pascal voc12 example #1125

Merged
merged 9 commits into from
Jun 16, 2020
Merged

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Jun 12, 2020

Description:

  • Updated pascal voc12 example to idist API
  • refactored training files
    • single training
    • add exp_tracking module to dispatch calls to specific implementations: mlflow, plx, trains
    • updated docs/notes

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@vfdev-5 vfdev-5 marked this pull request as draft June 12, 2020 22:11
@vfdev-5 vfdev-5 marked this pull request as ready for review June 12, 2020 22:27
@vfdev-5 vfdev-5 requested a review from sdesrozis June 12, 2020 22:28
Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but it looks good 😉

Training script contains `run` method required by [py_config_runner](https://github.com/vfdev-5/py_config_runner) to
run a script with a configuration.

The split between training script and configuration python file is the following.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if gin-config or hydra could not help to compose cfg script and run one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I let you explore this option :)

- other parameters: device, number of epochs, etc

Training script uses these components to setup and run training and validation loops. By default,
processing group with "nccl" backend is initialized for distributed configuration (even for a single GPU).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could think about full cpu training on a cluster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO very limited interest of that. Would it be one of your use-cases ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not matter for now 😉

@vfdev-5 vfdev-5 merged commit f833124 into pytorch:master Jun 16, 2020
@vfdev-5 vfdev-5 deleted the update/pascal-voc12 branch June 16, 2020 13:59
vfdev-5 added a commit that referenced this pull request Jun 24, 2020
* Added Windows/MacOSX CI for py3.7 only (#1113)

* [WIP] Added Windows CI for py2.7 only

* Excluded examples from windows ci

* Update unittests.yml

* Update unittests.yml

* Fixed shell bash as suggested

* Fixed failing tests on Win32
- added MNIST test for Win32 in Github actions
- added tests on macosx in Github actions

* Fixed isort

* Fixes tests with IterableDataset

* Skipped slow deterministic tests on win32

* skip failing timer tests on macos

* fix macos platform name

* fix _test_setup_logging

* skip frequency tests on win platform

* skip time tests on macos

* fix flake8

* fix isort

* Skip distrib tests for Win32

* skip time test for macos

* Updated github actions yaml

* skip modules for macos

* Fixed bad skip of deterministic tests, reduced time for slow tests

* Do not run dist tests on macosx

Co-authored-by: Sylvain Desroziers <[email protected]>

* [FR] Parallel helper tools (#1014) (#1116)

* [FR] Parallel helper tools (#1014)

* [WIP] auto and parallel dist modules

* [WIP] auto optim

* Added xla optimizer wrapper
- other code updates

* Updated auto and cifar10 example

* - Fixed resume from
- other cosmetics

* Fixed bug with _XLADistributedOptimizer
- updated default LR

* autopep8 fix

* Updated README and minor fixes

* autopep8 fix

* - Removed mnist distributed example
- Reverted unintended modifications

* Tests of auto methods

* autopep8 fix

* Tests, docs and code updates

* autopep8 fix

* Up code, test, cifar10 example and docs

* Added option to stop the training
- updated ci

* Updated readme and fixed ci configs

* - Updated code, README and remove old cifar10

Co-authored-by: vfdev-5 <[email protected]>
Co-authored-by: AutoPEP8 <>

* Fixes failing tests

* Minor updates

* Other minor updates

* Example readme update and minor fixes

* Added test on load_objects ddp to improve coverage

* Added more tests for parallel launcher

* Replaced pbars by logger

* Updated link to cifar10 example

* Fixes codecov upload

* Updated coverage report type for gpu/tpu

Co-authored-by: Sylvain Desroziers <[email protected]>

* Fixes #1120 (#1122)

* Fixes #1120
- Aligned idist args and method names to torch.distributed.launch

* replace missing num_procs_per_node

* black format

* fix bug

* replace num_nodes by nnodes

Co-authored-by: Desroziers <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>

* reverse order of remove/save in Checkpoint handling (#1117)

* reverse order of remove/save so there is never an n+1 checkpoint situation.

* evict if new item is better than candidate for eviction.

* swap order of updating saved and saving to ensure consistency of state

* remove redundant method.

Co-authored-by: vfdev <[email protected]>

* Fix test auto tpu (#1126)

* Fixed failing tpu tests

* Updated docstring of cifar10 example

* Auto pin_memory (#1129)

* Auto pin_memory

* autopep8 fix

Co-authored-by: AutoPEP8 <>

* fix auto pin_memory : idist.device().type should be used (#1131)

* fix auto pin_memory : idist.device().type should be used

* fix cuda in device

* fix test

* use idist.device().type to test

* add missing ()

Co-authored-by: Desroziers <[email protected]>

* Update pascal voc12 example (#1125)

* [WIP][Pascal-VOC12] Update/refactor example

* [WIP][Pascal-VOC12] Update/refactor example 2

* [WIP] Updated mlflow files

* Removed unused files

* Fixed flake and black

* Removed unused import and fixed version for mlflow

Co-authored-by: Sylvain Desroziers <[email protected]>

* fix cifar10 model : num_classes missing (#1134)

Co-authored-by: Desroziers <[email protected]>

* Accuracy MultiLabel Handling and Error Message (#1132)

* Updated check for multilabel and error message

* Updated docstring and error message

* Updated error message formatting

Co-authored-by: Sylvain Desroziers <[email protected]>

* Updated ImageNet example (#1138)

* [WIP] Updated ImageNet example
- minor fixes for Pascal VOC12

* Fixed flake8

* Updated pytorch-version-tests.yml to run cron every day at 00:00 UTC (#1141)

Co-authored-by: Sylvain Desroziers <[email protected]>

* Added check_compute_fn argument to EpochMetric and related metrics (#1140)

* Added check_compute_fn argument to EpochMetric and related functions.

* Updated docstrings

* Added check_compute_fn to _BaseRegressionEpoch

* Adding typing hints for check_compute_fn

* Update roc_auc.py

Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: vfdev <[email protected]>

* Docs cosmetics (#1142)

* Updated docs, replaced single quote by double quote if is code
- fixed missing link to Engine
- cosmetics

* More doc updates

* More updates

* Fix batch size calculation error (#1137)

* Fix batch size calculation error

* Add tests for fixed batch size calculation

* Fix tests

* Test for num_workers

* Fix nproc comparison

* Improve docs

* Fixed docstring

Co-authored-by: vfdev <[email protected]>

* Docs updates (#1139)

* [WIP] Added teaser gif

* [WIP] Updated README

* [WIP] Updated README

* [WIP] Updated docs

* Reverted unintended pyproject.toml edits

* Updated README and examples parts

* More updates of README

* Added badge to check pytorch/python compatible versions

* Updated README

* Added ref to blog "Using Optuna to Optimize PyTorch Ignite Hyperparameters"

* Update README.md

* Fixed bad internal link in examples

* Updated README

* Fixes docs (#1147)

* Fixed bad link on teaser

* Added manual_seed into docs

* Issue #1115 : pbar persists due to specific rule in tqdm (notebook) when n < total (#1145)

* Issue #1115
pbar persists in notebook due to specific rules when n < total

* close pbar doesn't rise danger bar

* fix when pbar.total is None

Co-authored-by: vfdev <[email protected]>
Co-authored-by: Desroziers <[email protected]>

* Updated codebase such that torch>=1.3 (#1150)

Co-authored-by: vfdev <[email protected]>

* add wandb (#1152)

wandb integration already exists, just adding it to the requirements file

* Fixed typo and missing part of "Where to go next" (#1151)

* Fixes #1153 (#1154)

- temporary downgrade of scipy to 1.4.1 instead of 1.5.0

* Use global_step as priority, if it exists (#1155)

* Use global_step as priority, if it exists

* Fix flake8 error

* Style fix

Co-authored-by: vfdev <[email protected]>

* Fix TrainsSaver handling of Checkpoint's n_saved (#1135)

* Utilize Trains framework callbacks to better support checkpoint saving and respect Checkpoint.n_saved

* Update trains callbacks to new format

* autopep8 fix

* Fix trains mnist example (store checkpoints in local folder)

* Use trains 0.15.1rc0 until PR is approved

* Use CallbackType for Trains callback type resolution.
Add unit test for Trains callbacks

* Update trains version

* Updated test_trains_saver_callbacks

Co-authored-by: jkhenning <>
Co-authored-by: vfdev <[email protected]>

* Stateful handlers (#1156)

* Stateful handlers

* Added state_dict/load_state_dict tests for Checkpoint

* integration test

* Updated docstring and added include_self to ModelCheckpoint

* An integreation test for checkpointing with stateful handlers

* Black and flake8

Co-authored-by: vfdev-5 <[email protected]>

* Bump version to 0.4rc.0.post1

* bump version to v0.4.0 🎉

Co-authored-by: vfdev <[email protected]>
Co-authored-by: Sylvain Desroziers <[email protected]>
Co-authored-by: Desroziers <[email protected]>
Co-authored-by: Marijan Smetko <[email protected]>
Co-authored-by: Anmol Joshi <[email protected]>
Co-authored-by: Lavanya Shukla <[email protected]>
Co-authored-by: Akihiro Matsukawa <[email protected]>
Co-authored-by: Jake Henning <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants