chore(docker): Decouple Tensorboard from the project and add CI build workflow. #2344

jim60105 · 2024-04-19T06:37:16Z

Decouple Tensorboard from the project (in terms of docker)
Reduce the size of docker image to 10.23GB
Add docker image build CI workflow.

I noticed that you were adjusting the "Start tensorboard" button recently, which made me realize that I could try to decouple it from the project to further reduce the size of the docker image.

In fact, Tensorboard is not an essential core part of this project. Of course, I know that all users using kohya-ss-gui need tensorboard, which is why you bundled it together. But from a docker perspective, we can start another tensorboard container to do the job separately.

Each container should have only one concern. Decoupling applications into multiple containers makes it easier to scale horizontally and reuse containers. —— from Docker Official doc: General best practices for writing Dockerfiles

I think Tensorboard is very suitable for decoupling in this project. After separating it, we no longer need to manage its upgrade requests (whether it matches our python, cuda version, etc.) in the future. Tensorboard has its own official docker image at https://hub.docker.com/r/tensorflow/tensorflow, and I directly used it in the docker-compose.yaml.

After this adjustment, the docker image size has reached 10.23GB.
In my experience, reducing it to around 10GB is an important achievement: this size can be built on a GitHub Free runner!

Future Docker users will no longer need to build images themselves; they can pull from your registry. This ensures that the user's runtime environment always matches the one at build time. There won't be any "dependencies cannot be obtained/was updated when building past versions," which would cause programs written at present to not run in the future.

You can check the status of this CI running on my repo: https://github.com/jim60105/kohya_ss/actions/workflows/docker_publish.yml
CI is triggered when you create a tag starting with v, build the docker image, and push it to your ghcr.
Thanks to GitHub, actions and packages (ghcr.io) are totally free for open source projects.

Caution

This PR will require you to add a GitHub package and configure related permissions.
And there are two places in the docker-compose that need to be modified to your package, which I will mention below.

Although I have adjusted it to run automatically, it still increases management complexity to some extent.
Here is a guide for your reference.
https://docs.github.com/en/packages/managing-github-packages-using-github-actions-workflows/publishing-and-installing-a-package-with-github-actions

I'm not sure if you want to set up CI, feel free to let me know if I should adjust this PR.
I could implement and manage the docker CI on my own account instead just like this one if you think I took it a little out of control.

ccharest93 · 2024-04-19T11:26:26Z

Also tensorboard isn't needed when using wandb or other types of logging(grafana agent on k8s clusters). I think this is a great idea.

jim60105 · 2024-04-19T12:19:10Z

Hmm, I thought of another question.

Can accelerate write logs correctly without installing tensorboard?
I didn't see it being used in the project when I searched globally and didn't notice this problem.

~~Off topic, I cleanup all the build cache and do a clean docker build and start to encounter accelerate died with <Signals.SIGKILL: 9>.~~

~~This even happened when I build with the git commit before this PR, I think it's a non relative issue.~~

~~I have not yet confirmed it is a code issue or my platform issue. I'll come back when I have more information.~~

This is totally my fault, I tested with a wrong config that leads to OOM 😂

jim60105 · 2024-04-19T14:19:44Z

After testing, accelerate do needs the tensorboard package to write logs.

I have fixed this part and the image size is 10.3GB now.

It did not have a significant impact on the size since the actual huge one is tensorflow.

jim60105 · 2024-04-19T16:44:34Z

Explained the different designs of tensorboard in README.md.

jim60105 · 2024-04-21T03:59:15Z

Rebase on to bmaltais/kohya_ss/dev

docker-compose.yaml

- The Dockerfile no longer installs TensorFlow, only PyTorch - The TensorBoard exposure port has been removed from the Dockerfile - A new tensorboard service has been added to the docker-compose.yaml file, utilizing the TensorFlow:latest-gpu image and exposing port 6006. - The requirements_linux_docker.txt file has the tensorboard and tensorflow requirements commented out. Signed-off-by: 陳鈞 <[email protected]>

Remove tensorboard reduce the docker image size from 17GB to 10GB. This reduction enables the image to be built on the GitHub free runner. Signed-off-by: 陳鈞 <[email protected]>

- Modify the Docker publish workflow to only trigger on tag pushes that match the pattern `v*`, for instance, `v1.0.0`. - Make the job run only on Ubuntu and tag commits. - Enable submodules during checkout to pull all the data. - Change Docker image name to `kohya-ss-gui`. - Configure Docker metadata dynamically based on commit SHA and SemVer pattern. - Set up QEMU, a generic and open source machine emulator and virtualizer, possibly for cross-platform docker image builds. - Use `GITHUB_TOKEN` instead of `CR_PAT` for Docker login. - Adjust the build arguments for Docker, including the version and the release number. - Modify cache configuration to store in registry, rather than Github Action (gha) to avoid capacity limit. - Enable software bill of materials (sbom) and provenance for better software supply chain and artifact traceability and verifiability. Signed-off-by: 陳鈞 <[email protected]>

- Remove the mapping for the port 6006 in the docker-compose file Signed-off-by: 陳鈞 <[email protected]>

Signed-off-by: 陳鈞 <[email protected]>

…flow - Removed the `github.sha` tag from the docker publish workflow file. Signed-off-by: 陳鈞 <[email protected]>

- Add the `--do_not_use_shell` option to the Python CMD in the Dockerfile. Signed-off-by: 陳鈞 <[email protected]>

This reverts commit 1bd0465. `shell=False` does not work in Linux when passing the command as a string with arguments. #2284 (comment)

…service - Update the image source for `kohya-ss-gui` service in docker-compose file from `kohya-ss-gui:latest` to `ghcr.io/jim60105/kohya-ss-gui:latest` `ghcr.io/jim60105/kohya-ss-gui` is for testing the forked repository and must be changed to `bmaltais`'s registry before merging the PR. Signed-off-by: 陳鈞 <[email protected]>

Signed-off-by: 陳鈞 <[email protected]>

…kage for writing logs. - Removed exact version specification for tensorboard and tensorflow in the linux docker requirements, allowing the latest versions to be installed. - Removed commented-out installations for tensorboard and tensorflow. Signed-off-by: 陳鈞 <[email protected]>

- Added a new section in README.md about the design of the Dockerfile. - Made changes in the Docker limitations section, including the separation of TensorBoard from the project and the related changes, no automatic browser launch, and manual file path input requirement. - Updated instructions for docker upgrade using new code version. - Note added for possible lengthy docker image build times. Signed-off-by: 陳鈞 <[email protected]>

- Implemented rich logging in Dockerfile and set FORCE_COLOR environment variable to true - Set the COLUMNS environment variable to `100` in Dockerfile for wider output - Replaced single quotes with double quotes for `device_ids` value in docker-compose.yaml Signed-off-by: 陳鈞 <[email protected]>

- Change the docker image from `ghcr.io/jim60105/kohya-ss-gui:latest` to `ghcr.io/bmaltais/kohya-ss-gui:latest` - Update the cache source from `ghcr.io/jim60105/kohya-ss-gui:cache` to `ghcr.io/bmaltais/kohya-ss-gui:cache` Signed-off-by: 陳鈞 <[email protected]>

jim60105 · 2024-04-25T17:14:33Z

The "Package" tab on your GitHub page is the ghcr docker registry.
Here's the documentation for the registry.
https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry

I think we can first merge this PR and trigger the workflow, allowing it to automatically publish the package. This process should enable the workflow to automatically obtain permission for the package. If the publication fails, come back to check the permission issue.

After publishing is completed, remember to adjust the package to be public. The new package will be set as private by default.

bmaltais · 2024-04-25T18:32:55Z

I have triggered a build... tried to configure the GITHUB_TOKEN secret but github won't let me... so I assume it is auto managed by github on the platform... Look like it worked

bmaltais · 2024-04-25T18:40:48Z

Give it a try and let me know if all is fine... I tagged the last commit in dev so it should contain the latest dev code.

bmaltais · 2024-04-25T18:58:10Z

I tried to run the built image with:

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
docker compose up -d

as specified in the README but this just ended up building the whole thing locally...

jim60105 · 2024-04-25T19:13:31Z

I have triggered a build... tried to configure the GITHUB_TOKEN secret but github won't let me... so I assume it is auto managed by github on the platform... Look like it worked

Yes, GITHUB_TOKEN is builtin feature.

Give it a try and let me know if all is fine... I tagged the last commit in dev so it should contain the latest dev code.

It looks great, and the CI is successful.

I tried to run the built image with:
git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
docker compose up -d
as specified in the README but this just ended up building the whole thing locally...

When I write it like this, the default behavior is that it will first pull the image, unless the image does not exist (online), then it will build locally.
I'm not sure if this is a new behavior, just to be safe, please update Docker to the latest version.

kohya_ss/docker-compose.yaml

Lines 4 to 7 in 059afdb

    
           image: ghcr.io/bmaltais/kohya-ss-gui:latest 
        
           user: 1000:0 
        
           build: 
        
             context: .

jim60105 · 2024-04-25T19:21:57Z

I notice that the Tensorboard buttons appeared because I am using --headless in Docker.
In my understanding, the file selector in the container won't works, so headless mode should be used.

This may be a separate issue.

bmaltais · 2024-04-25T20:10:46Z

Yeah… I think a new flag will be required to hide it… I need to think of the best way to implement it… but it was supposed to not display if tensorflow was missing… not sure why this did not work.

bmaltais · 2024-04-25T20:11:52Z

I think it did not pull the image because there is no latest tag for it… maybe…

jim60105 · 2024-04-25T20:46:46Z

I think it did not pull the image because there is no latest tag for it… maybe…

latest tag is exists.
Docker compose do pull the image on my platform.🤔

jim60105 · 2024-04-25T20:49:43Z

Yeah… I think a new flag will be required to hide it… I need to think of the best way to implement it… but it was supposed to not display if tensorflow was missing… not sure why this did not work.

Because visibility or self.headless

kohya_ss/kohya_gui/class_tensorboard.py

Lines 114 to 119 in 059afdb

    
           with gr.Row(): 
        
               button_start_tensorboard = gr.Button( 
        
                   value="Start tensorboard", 
        
                   elem_id="myTensorButton", 
        
                   visible=visibility or self.headless, 
        
               )

jim60105 · 2024-04-25T21:00:45Z

I had take a look at why you modified this for #2349

In my opinion, docker user can directly restart the container when they want to interrupt the training.
Since we handle the signal correctly, the training process will be interrupted before the container stops. Just like when you use Ctrl+C to terminate a program.

@YachiZhang FYI

jim60105 · 2024-04-25T21:11:35Z

When I write it like this, the default behavior is that it will first pull the image, unless the image does not exist (online), then it will build locally.

Yes, I'm not mistaken. It is mentioned in the document.
https://docs.docker.com/compose/compose-file/05-services/#image

bmaltais · 2024-04-25T22:15:13Z

ls

Yeah… I think a new flag will be required to hide it… I need to think of the best way to implement it… but it was supposed to not display if tensorflow was missing… not sure why this did not work.

Because visibility or self.headless

kohya_ss/kohya_gui/class_tensorboard.py

Lines 114 to 119 in 059afdb

with gr.Row():

button_start_tensorboard = gr.Button(

value="Start tensorboard",

elem_id="myTensorButton",

visible=visibility or self.headless,

)

I will add a check for tensorflow... I must have broken the logic when I implemented the two button visibility in headless.

bmaltais · 2024-04-25T22:18:31Z

I figured why it was not working... I was in the master branch instead of dev ;-)

bmaltais · 2024-04-25T23:18:23Z

Try the laest build, should do OK with buttons now.

bmaltais · 2024-04-26T12:31:57Z

@jim60105 I was thinking... since running in the repo built docker container... we know what port tensorboard is listening on... So it would be possible to show a button called "Open tensorboard" that would simply open a tab in a browser pointing to the http://127.0.0.0:6006 for tensorboard... do you think it would make sense? It would remove the need for users to remember what port it might run on.

jim60105 · 2024-04-26T13:54:43Z

I think it is not quite appropriate because users may modify the port mapping for tensorboard from docker-compose.

If this requirement really needs to be implemented, I would generally use environment variables.
Create a .env file to define the port number, use environment variable for port mapping on tensorboard in docker-compose, pass the environment variables to the kohya-ss-gui container, and then read the environment variables in the program to handle the url to open when the button is pressed.
This is a bit too complicated.

jim60105 · 2024-04-26T14:00:17Z

For Windows Docker Desktop users, we simply click here to open the browser, and TensorBoard is listed alongside it, so there's no issue.

For Linux user... Linux users are always aware of what's happening, so there's no need to worry.😉

bmaltais · 2024-04-26T14:04:28Z

I already implemented it dev... One can pass a parameter when starting the gui by setting an environment variable:

self.tensorboard_port = os.environ.get(
            "TENSORBOARD_PORT", self.DEFAULT_TENSORBOARD_PORT
        )

to change the tensorboard port... maybe this is enough?

I am trying to make the GUI as convenient as possible so it can be done from there also. I understand there are other methods... but having it there allow users to stay in the GUI for most everything.

jim60105 · 2024-04-26T14:15:32Z

I already implemented it dev... One can pass a parameter when starting the gui by setting an environment variable:
self.tensorboard_port = os.environ.get(
            "TENSORBOARD_PORT", self.DEFAULT_TENSORBOARD_PORT
        )
to change the tensorboard port... maybe this is enough?

I am trying to make the GUI as convenient as possible so it can be done from there also. I understand there are other methods... but having it there allow users to stay in the GUI for most everything.

Nice, then I will complete the part of Docker.

jim60105 · 2024-04-26T14:20:21Z

@bmaltais It doesn't look right here
0c2c2d4#diff-43cdb079ff236e1a6370e26b93f4b59dc682064756e172593601be8c60b1423aL129-R138

bmaltais · 2024-04-26T16:29:13Z

Darn, copy paste issue… did not notice as I moved code from my Linux test env to my main code commit env.

jim60105 mentioned this pull request Apr 19, 2024

[Feature request] Hide "Start tensorboard" buttons with cli argument. #2345

Closed

jim60105 commented Apr 21, 2024

View reviewed changes

docker-compose.yaml Outdated Show resolved Hide resolved

docker-compose.yaml Outdated Show resolved Hide resolved

jim60105 added 15 commits April 26, 2024 01:09

ci(docker): add docker CI again

1197293

Remove tensorboard reduce the docker image size from 17GB to 10GB. This reduction enables the image to be built on the GitHub free runner. Signed-off-by: 陳鈞 <[email protected]>

chore(docker): remove 6006 port from kohya-ss-gui in docker-compose

05d420f

- Remove the mapping for the port 6006 in the docker-compose file Signed-off-by: 陳鈞 <[email protected]>

chore(docker): version is obsolete

c73e737

Signed-off-by: 陳鈞 <[email protected]>

ci(docker): removed the github.sha tag from the docker publish work…

627ebaf

…flow - Removed the `github.sha` tag from the docker publish workflow file. Signed-off-by: 陳鈞 <[email protected]>

chore(docker): add the --do_not_use_shell in Docker commands

b73fc92

- Add the `--do_not_use_shell` option to the Python CMD in the Dockerfile. Signed-off-by: 陳鈞 <[email protected]>

Revert "chore(docker): add the --do_not_use_shell in Docker commands"

ca15158

This reverts commit 1bd0465. `shell=False` does not work in Linux when passing the command as a string with arguments. #2284 (comment)

docs: update README about docker

b569b27

Signed-off-by: 陳鈞 <[email protected]>

docs: update README toc

c6626b5

Signed-off-by: 陳鈞 <[email protected]>

jim60105 marked this pull request as ready for review April 25, 2024 17:14

bmaltais merged commit 059afdb into bmaltais:dev Apr 25, 2024
1 check passed

jim60105 mentioned this pull request Apr 26, 2024

chore(docker): Configure TensorBoard port through .env file #2397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(docker): Decouple Tensorboard from the project and add CI build workflow. #2344

chore(docker): Decouple Tensorboard from the project and add CI build workflow. #2344

jim60105 commented Apr 19, 2024 •

edited

Loading

ccharest93 commented Apr 19, 2024

jim60105 commented Apr 19, 2024 •

edited

Loading

jim60105 commented Apr 19, 2024

jim60105 commented Apr 19, 2024

jim60105 commented Apr 21, 2024

jim60105 commented Apr 25, 2024

bmaltais commented Apr 25, 2024

bmaltais commented Apr 25, 2024 •

edited

Loading

bmaltais commented Apr 25, 2024

jim60105 commented Apr 25, 2024

jim60105 commented Apr 25, 2024 •

edited

Loading

bmaltais commented Apr 25, 2024

bmaltais commented Apr 25, 2024

jim60105 commented Apr 25, 2024

jim60105 commented Apr 25, 2024

jim60105 commented Apr 25, 2024

jim60105 commented Apr 25, 2024

bmaltais commented Apr 25, 2024

bmaltais commented Apr 25, 2024

bmaltais commented Apr 25, 2024

bmaltais commented Apr 26, 2024 •

edited

Loading

jim60105 commented Apr 26, 2024

jim60105 commented Apr 26, 2024

bmaltais commented Apr 26, 2024 •

edited

Loading

jim60105 commented Apr 26, 2024

jim60105 commented Apr 26, 2024

bmaltais commented Apr 26, 2024

chore(docker): Decouple Tensorboard from the project and add CI build workflow. #2344

chore(docker): Decouple Tensorboard from the project and add CI build workflow. #2344

Conversation

jim60105 commented Apr 19, 2024 • edited Loading

ccharest93 commented Apr 19, 2024

jim60105 commented Apr 19, 2024 • edited Loading

jim60105 commented Apr 19, 2024

jim60105 commented Apr 19, 2024

jim60105 commented Apr 21, 2024

jim60105 commented Apr 25, 2024

bmaltais commented Apr 25, 2024

bmaltais commented Apr 25, 2024 • edited Loading

bmaltais commented Apr 25, 2024

jim60105 commented Apr 25, 2024

jim60105 commented Apr 25, 2024 • edited Loading

bmaltais commented Apr 25, 2024

bmaltais commented Apr 25, 2024

jim60105 commented Apr 25, 2024

jim60105 commented Apr 25, 2024

jim60105 commented Apr 25, 2024

jim60105 commented Apr 25, 2024

bmaltais commented Apr 25, 2024

bmaltais commented Apr 25, 2024

bmaltais commented Apr 25, 2024

bmaltais commented Apr 26, 2024 • edited Loading

jim60105 commented Apr 26, 2024

jim60105 commented Apr 26, 2024

bmaltais commented Apr 26, 2024 • edited Loading

jim60105 commented Apr 26, 2024

jim60105 commented Apr 26, 2024

bmaltais commented Apr 26, 2024

jim60105 commented Apr 19, 2024 •

edited

Loading

jim60105 commented Apr 19, 2024 •

edited

Loading

bmaltais commented Apr 25, 2024 •

edited

Loading

jim60105 commented Apr 25, 2024 •

edited

Loading

bmaltais commented Apr 26, 2024 •

edited

Loading

bmaltais commented Apr 26, 2024 •

edited

Loading