-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(docker): Decouple Tensorboard from the project and add CI build workflow. #2344
Conversation
Also tensorboard isn't needed when using wandb or other types of logging(grafana agent on k8s clusters). I think this is a great idea. |
Hmm, I thought of another question. Can accelerate write logs correctly without installing tensorboard?
This is totally my fault, I tested with a wrong config that leads to OOM 😂 |
After testing, accelerate do needs the tensorboard package to write logs. I have fixed this part and the image size is 10.3GB now. It did not have a significant impact on the size since the actual huge one is tensorflow. |
Explained the different designs of tensorboard in README.md. |
Rebase on to bmaltais/kohya_ss/dev |
- The Dockerfile no longer installs TensorFlow, only PyTorch - The TensorBoard exposure port has been removed from the Dockerfile - A new tensorboard service has been added to the docker-compose.yaml file, utilizing the TensorFlow:latest-gpu image and exposing port 6006. - The requirements_linux_docker.txt file has the tensorboard and tensorflow requirements commented out. Signed-off-by: 陳鈞 <[email protected]>
Remove tensorboard reduce the docker image size from 17GB to 10GB. This reduction enables the image to be built on the GitHub free runner. Signed-off-by: 陳鈞 <[email protected]>
- Modify the Docker publish workflow to only trigger on tag pushes that match the pattern `v*`, for instance, `v1.0.0`. - Make the job run only on Ubuntu and tag commits. - Enable submodules during checkout to pull all the data. - Change Docker image name to `kohya-ss-gui`. - Configure Docker metadata dynamically based on commit SHA and SemVer pattern. - Set up QEMU, a generic and open source machine emulator and virtualizer, possibly for cross-platform docker image builds. - Use `GITHUB_TOKEN` instead of `CR_PAT` for Docker login. - Adjust the build arguments for Docker, including the version and the release number. - Modify cache configuration to store in registry, rather than Github Action (gha) to avoid capacity limit. - Enable software bill of materials (sbom) and provenance for better software supply chain and artifact traceability and verifiability. Signed-off-by: 陳鈞 <[email protected]>
- Remove the mapping for the port 6006 in the docker-compose file Signed-off-by: 陳鈞 <[email protected]>
Signed-off-by: 陳鈞 <[email protected]>
…flow - Removed the `github.sha` tag from the docker publish workflow file. Signed-off-by: 陳鈞 <[email protected]>
- Add the `--do_not_use_shell` option to the Python CMD in the Dockerfile. Signed-off-by: 陳鈞 <[email protected]>
This reverts commit 1bd0465. `shell=False` does not work in Linux when passing the command as a string with arguments. #2284 (comment)
…service - Update the image source for `kohya-ss-gui` service in docker-compose file from `kohya-ss-gui:latest` to `ghcr.io/jim60105/kohya-ss-gui:latest` `ghcr.io/jim60105/kohya-ss-gui` is for testing the forked repository and must be changed to `bmaltais`'s registry before merging the PR. Signed-off-by: 陳鈞 <[email protected]>
Signed-off-by: 陳鈞 <[email protected]>
Signed-off-by: 陳鈞 <[email protected]>
…kage for writing logs. - Removed exact version specification for tensorboard and tensorflow in the linux docker requirements, allowing the latest versions to be installed. - Removed commented-out installations for tensorboard and tensorflow. Signed-off-by: 陳鈞 <[email protected]>
- Added a new section in README.md about the design of the Dockerfile. - Made changes in the Docker limitations section, including the separation of TensorBoard from the project and the related changes, no automatic browser launch, and manual file path input requirement. - Updated instructions for docker upgrade using new code version. - Note added for possible lengthy docker image build times. Signed-off-by: 陳鈞 <[email protected]>
- Implemented rich logging in Dockerfile and set FORCE_COLOR environment variable to true - Set the COLUMNS environment variable to `100` in Dockerfile for wider output - Replaced single quotes with double quotes for `device_ids` value in docker-compose.yaml Signed-off-by: 陳鈞 <[email protected]>
- Change the docker image from `ghcr.io/jim60105/kohya-ss-gui:latest` to `ghcr.io/bmaltais/kohya-ss-gui:latest` - Update the cache source from `ghcr.io/jim60105/kohya-ss-gui:cache` to `ghcr.io/bmaltais/kohya-ss-gui:cache` Signed-off-by: 陳鈞 <[email protected]>
The "Package" tab on your GitHub page is the ghcr docker registry. I think we can first merge this PR and trigger the workflow, allowing it to automatically publish the package. This process should enable the workflow to automatically obtain permission for the package. If the publication fails, come back to check the permission issue. After publishing is completed, remember to adjust the package to be public. The new package will be set as private by default. |
I have triggered a build... tried to configure the GITHUB_TOKEN secret but github won't let me... so I assume it is auto managed by github on the platform... Look like it worked |
Give it a try and let me know if all is fine... I tagged the last commit in dev so it should contain the latest dev code. |
I tried to run the built image with:
as specified in the README but this just ended up building the whole thing locally... |
Yes, GITHUB_TOKEN is builtin feature.
It looks great, and the CI is successful.
When I write it like this, the default behavior is that it will first pull the image, unless the image does not exist (online), then it will build locally. Lines 4 to 7 in 059afdb
|
Yeah… I think a new flag will be required to hide it… I need to think of the best way to implement it… but it was supposed to not display if tensorflow was missing… not sure why this did not work. |
I think it did not pull the image because there is no latest tag for it… maybe… |
Because visibility or self.headless kohya_ss/kohya_gui/class_tensorboard.py Lines 114 to 119 in 059afdb
|
I had take a look at why you modified this for #2349 In my opinion, docker user can directly restart the container when they want to interrupt the training. @YachiZhang FYI |
Yes, I'm not mistaken. It is mentioned in the document. |
ls
I will add a check for tensorflow... I must have broken the logic when I implemented the two button visibility in headless. |
I figured why it was not working... I was in the master branch instead of dev ;-) |
Try the laest build, should do OK with buttons now. |
@jim60105 I was thinking... since running in the repo built docker container... we know what port tensorboard is listening on... So it would be possible to show a button called "Open tensorboard" that would simply open a tab in a browser pointing to the http://127.0.0.0:6006 for tensorboard... do you think it would make sense? It would remove the need for users to remember what port it might run on. |
I think it is not quite appropriate because users may modify the port mapping for tensorboard from docker-compose. If this requirement really needs to be implemented, I would generally use environment variables. |
I already implemented it dev... One can pass a parameter when starting the gui by setting an environment variable: self.tensorboard_port = os.environ.get(
"TENSORBOARD_PORT", self.DEFAULT_TENSORBOARD_PORT
) to change the tensorboard port... maybe this is enough? I am trying to make the GUI as convenient as possible so it can be done from there also. I understand there are other methods... but having it there allow users to stay in the GUI for most everything. |
Nice, then I will complete the part of Docker. |
@bmaltais It doesn't look right here |
Darn, copy paste issue… did not notice as I moved code from my Linux test env to my main code commit env. |
I noticed that you were adjusting the "Start tensorboard" button recently, which made me realize that I could try to decouple it from the project to further reduce the size of the docker image.
In fact, Tensorboard is not an essential core part of this project. Of course, I know that all users using kohya-ss-gui need tensorboard, which is why you bundled it together. But from a docker perspective, we can start another tensorboard container to do the job separately.
I think Tensorboard is very suitable for decoupling in this project. After separating it, we no longer need to manage its upgrade requests (whether it matches our python, cuda version, etc.) in the future. Tensorboard has its own official docker image at https://hub.docker.com/r/tensorflow/tensorflow, and I directly used it in the
docker-compose.yaml
.After this adjustment, the docker image size has reached 10.23GB.
In my experience, reducing it to around 10GB is an important achievement: this size can be built on a GitHub Free runner!
Future Docker users will no longer need to build images themselves; they can pull from your registry. This ensures that the user's runtime environment always matches the one at build time. There won't be any "dependencies cannot be obtained/was updated when building past versions," which would cause programs written at present to not run in the future.
You can check the status of this CI running on my repo: https://github.com/jim60105/kohya_ss/actions/workflows/docker_publish.yml
CI is triggered when you create a tag starting with
v
, build the docker image, and push it to your ghcr.Thanks to GitHub, actions and packages (ghcr.io) are totally free for open source projects.
Caution
This PR will require you to add a GitHub package and configure related permissions.
And there are two places in the docker-compose that need to be modified to your package, which I will mention below.
Although I have adjusted it to run automatically, it still increases management complexity to some extent.
Here is a guide for your reference.
https://docs.github.com/en/packages/managing-github-packages-using-github-actions-workflows/publishing-and-installing-a-package-with-github-actions
I'm not sure if you want to set up CI, feel free to let me know if I should adjust this PR.
I could implement and manage the docker CI on my own account instead just like this one if you think I took it a little out of control.