Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[data] Stability & accuracy improvements for Data+Train benchmark (ra…
…y-project#42027) Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0. Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry. Other small fixes for optional parameters in benchmark file, used for debugging purposes. Results of sample release test run: read_images_train_4_gpu: Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}} read_images_train_16_gpu: Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}} read_images_train_16_gpu_preserve_order: Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}} (The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.) --------- Signed-off-by: Andrew Xue <[email protected]> Signed-off-by: Scott Lee <[email protected]> Co-authored-by: Scott Lee <[email protected]> Co-authored-by: Scott Lee <[email protected]>
- Loading branch information