Release v1.5.0-rc.0 release · kubeflow/training-operator

Full Changelog

Closed issues:

MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617
unable to fetch TFJob when I use client.go run tfjob #1612
Pytorchjob dist-mnist no training logs #1601
kubectl get tfjob -o yaml, but not status output #1598
missing image in tf_job_design_doc.md #1590
Labels in Python client are out of date #1587
PyTorchJob Pods "Not Ready" After Completing Training #1577
cannot use "github.com/go-openapi/spec".Schema{...} (type "github.com/go-openapi/spec".Schema) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value #1576
PyTorchJob: OnFailure Policy won't handle pod failure gracefully #1570
pytorchjob doesn't have status.startTIme. #1566
Optional-test-infra Deprecation Notice - Training #1561
Should we update MPIJob unit test CleanPodPolicy field? #1555
--enable-gang-scheduling=true doesn't work for MPIJob #1548
PyTorchJob fails when creating a task with a different namespace but the same name #1543
Reconcile PyTorchJob error: PyTorchJob.status.replicaStatuses: Invalid value: "null" after enable-gang-scheduling #1538
Job TTLs not working #1533
Support PodGroup in scheduler-plugins/coscheduling #1518
support elastic training #1515
Modified the configuration of RootLogger #1514
Add checking import order in CI #1510
Scale down of pytorchJob cause workers pod to restart #1509
Support label selector based success/failure conditions #1507
[feat] Support SuccessPolicy in PyTorchJob #1505
pytorch elastic scheduler error #1504
Could you add the example of MPIJob in this repository #1502
[Feature] Create a Informer/ClientSet for PyTorch Jobs #1499
[feature] Make init container injection logic availabel to all jobs #1498
Roadmaps for 1.4 release #1496
[bug] (MpiJob)Init container KubectlDeliveryImage should remain the ability that it can be specified from container parameters or environment variables. #1494
Reconcile PyTorch Job error Operation cannot be fulfilled on pytrochjobs.kubeflow.org #1492
Python PytorchJob: no attribute openapi_types for example code #1481
PyTorch DistributedDataParallel training with multi nodes #1475
Installing kubeflow-training breaks import for other kubeflow packages (katib, fairing, etc.) #1471
Deprecate ksonnet and use python/golang to submit jobs #1468
Help Wanted in ParameterServerStrategy Example. #1459
Bug: SomeTimes Coredumped using tfjob #1456
[question] PyTorchJob MNIST example training speed #1454
tfjob status not match when EnableDynamicWorker set true #1452
training-operator set scheduler error #1447
[sdk]: Replace TableLogger component in the SDK for better support with ipykernel>=6.x #1446
SDK: wait_for_job reports typeError #1445
Update prometheus monitoring doc #1443
Master branch should provide a nightly image #1433
Clean up test folder before testing #1429
Clean up TF specific docs #1424
[feature] Support SchedulingPolicy in PyTorchJob #1414
Hyperlinks in the "Overview" section is incorrect/not found #1411
add workqueue metric #1407
Validation fails for MXJob Tune example #1402
Rate exceeded for aws ecr image #1400
change layout to follow the standard of kubebuilder? #1397
[example] kubeflow/tf-dist-mnist-test:1.0 is missing in v1.2-branch examples/v1/dist-mnist #1393
Update kubeflow/website for 1.4 release #1392
Cut beta release of tf-operator for 1.4 release #1385
"invalid memory address or nil pointer dereference" #1382
some questions about job sync #1379
Provides a default Grafana dashboard #1376
[feature] Support different PS/worker types #1369
Need to copy all (mainly pytorch) framework's example dir to tf-operator/examples #1366
Add more CRD validations markers to block invalid job on client apply #1363
Update presubmit and post submit job triggers #1354
Optimize post submit jobs flow #1353
Enable leader election in controller manager using controllermanagerconfig #1350
Support mpi jobs in universal operator #1345
post-submit job failure in master branch #1343
Improve observability of universal operator #1340
Best practice to organize main.go and Dockerfile? #1333
Should training operator keep clientset in the same repository? #1332
Test image has incorrect tag? #1329
Prepare e2e tests for all frameworks #1323
Reduce e2e replica-restart-policy-tests running time #1319
Improve logs structure by consolidating libs from controller runtime and controllers #1313
Enable tests for all frameworks #1311
[bug] The pod wil be recreated until the expectation expires #1306
Upgrade CRDs to apiextensions.k8s.io/v1 #1304
Add role details as new columns to kubectl get jobs output for CRD. #1301
How to handle long pending pods in a TF-job? #1282
Could you release a new version of Python SDK #1279
Update swagger.json schema for TFJobSpec to include RunPolicy #1278
Not able to pass environment variable from tfjob to pod #1273
v1_time.py is not generated by hack/python-sdk/gen-sdk.sh #1271
Add a step to upload artifact #1258
[feature] Support multi port in TFJob #1251
[feat] Add scale subresource #1220
Pod get re-created after it exited and get garbage collected #1186
Clean up vendor dependencies #1162

Merged pull requests:

Update training controller image to latest #1625 (johnugeorge)
Update SDK version to 1.5.0 #1624 (johnugeorge)
Upgrade common to v0.4.3 #1623 (johnugeorge)
fix: MPIJob worker still running when NotEnoughResources #1621 (hackerboy01)
fix comments for pytorch-controller #1620 (hackerboy01)
MXNet SDK with Status check fix #1618 (johnugeorge)
Adding GHA for automatic image build and push #1615 (johnugeorge)
fix: requeue when expire time is not up yet #1614 (Garrybest)
Add clientset for MPIJob, PytorchJob, MXJob, and XGBoostJob #1610 (tenzen-y)
Add all generation tools to Makefile #1609 (johnugeorge)
Adding MPI python sdk #1608 (johnugeorge)
Adding XGboost Python sdk #1607 (johnugeorge)
Generating MPI python sdk #1606 (johnugeorge)
Update k8s dependencies to v0.24.1 #1604 (johnugeorge)
Migrate test framework to GHA #1603 (johnugeorge)
Add mpi in update-codegen.sh #1600 (ggaaooppeenngg)
Remove presubmit test depending on optional-test-infra #1596 (aws-kf-ci-bot)
chore: stop action on first fail #1595 (jasonliu747)
fix Pytorjob status inaccuracy when task replica scale down #1593 (PeterChg)
update img url in design doc #1591 (zw0610)
Look for fully-qualified job role label in Python sdk #1588 (person142)
fix torch env typo #1573 (kuizhiqing)
Restart job on failure for Always,OnFailure Policy #1572 (georgkaleido)
Increase success threshold #1568 (haoxins)
update status.startTime for pytorchjob and xgboostjob #1567 (cheimu)
fix: add mpijobs to kubeflow training role #1565 (henrysecond1)
Remove uncalled mpi-controller DeletePodsAndServices() #1558 (cheimu)
fix: MPIJob cannot use gang-scheduling when --enable-gang-scheduling is set #1557 (cheimu)
Update MPIJob unit tests to use spec.runPolicy.cleanPodPolicy #1556 (cheimu)
fix: set mpijob runPolicy.cleanPodPolicy to default none #1554 (cheimu)
fix api reader issue #1551 (zw0610)
fix label and CleanPodPolicy for mpi-controller #1550 (zw0610)
fix UpdateJobStatusInApiServer when gang-scheduling is enabled #1549 (zw0610)
fix: add namespace filtering when getting pods/services for jobs #1545 (henrysecond1)
Remove table-logger dependency #1544 (person142)
Bump pyyaml from 5.1 to 5.4 in /py/kubeflow/tf_operator #1542 (dependabot[bot])
Release Python SDK 1.4.0 #1541 (alembiewski)
mod: Upgrade ginkgo to v2 #1537 (haoxins)
docs: Fix broken links in quick-start-v1.md #1536 (nakamasato)
extends path in __init__.py for SDK correctly #1531 (cakeislife100)
chore: Update changelog for v1.4.0-rc.0 release #1528 (terrytangyuan)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.5.0-rc.0 release