Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(docker): Decouple Tensorboard from the project and add CI build workflow. #2344

Merged
merged 15 commits into from
Apr 25, 2024
Merged

Conversation

jim60105
Copy link
Contributor

@jim60105 jim60105 commented Apr 19, 2024

  • Decouple Tensorboard from the project (in terms of docker)
  • Reduce the size of docker image to 10.23GB
  • Add docker image build CI workflow.

image

I noticed that you were adjusting the "Start tensorboard" button recently, which made me realize that I could try to decouple it from the project to further reduce the size of the docker image.

In fact, Tensorboard is not an essential core part of this project. Of course, I know that all users using kohya-ss-gui need tensorboard, which is why you bundled it together. But from a docker perspective, we can start another tensorboard container to do the job separately.

Each container should have only one concern. Decoupling applications into multiple containers makes it easier to scale horizontally and reuse containers. —— from Docker Official doc: General best practices for writing Dockerfiles

I think Tensorboard is very suitable for decoupling in this project. After separating it, we no longer need to manage its upgrade requests (whether it matches our python, cuda version, etc.) in the future. Tensorboard has its own official docker image at https://hub.docker.com/r/tensorflow/tensorflow, and I directly used it in the docker-compose.yaml.

After this adjustment, the docker image size has reached 10.23GB.
In my experience, reducing it to around 10GB is an important achievement: this size can be built on a GitHub Free runner!

Future Docker users will no longer need to build images themselves; they can pull from your registry. This ensures that the user's runtime environment always matches the one at build time. There won't be any "dependencies cannot be obtained/was updated when building past versions," which would cause programs written at present to not run in the future.

You can check the status of this CI running on my repo: https://github.com/jim60105/kohya_ss/actions/workflows/docker_publish.yml
CI is triggered when you create a tag starting with v, build the docker image, and push it to your ghcr.
Thanks to GitHub, actions and packages (ghcr.io) are totally free for open source projects.

Caution

This PR will require you to add a GitHub package and configure related permissions.
And there are two places in the docker-compose that need to be modified to your package, which I will mention below.

Although I have adjusted it to run automatically, it still increases management complexity to some extent.
Here is a guide for your reference.
https://docs.github.com/en/packages/managing-github-packages-using-github-actions-workflows/publishing-and-installing-a-package-with-github-actions

I'm not sure if you want to set up CI, feel free to let me know if I should adjust this PR.
I could implement and manage the docker CI on my own account instead just like this one if you think I took it a little out of control.

@ccharest93
Copy link
Contributor

Also tensorboard isn't needed when using wandb or other types of logging(grafana agent on k8s clusters). I think this is a great idea.

@jim60105
Copy link
Contributor Author

jim60105 commented Apr 19, 2024

Hmm, I thought of another question.

Can accelerate write logs correctly without installing tensorboard?
I didn't see it being used in the project when I searched globally and didn't notice this problem.


Off topic, I cleanup all the build cache and do a clean docker build and start to encounter accelerate died with <Signals.SIGKILL: 9>.

This even happened when I build with the git commit before this PR, I think it's a non relative issue.

I have not yet confirmed it is a code issue or my platform issue. I'll come back when I have more information.

This is totally my fault, I tested with a wrong config that leads to OOM 😂

@jim60105
Copy link
Contributor Author

After testing, accelerate do needs the tensorboard package to write logs.

I have fixed this part and the image size is 10.3GB now.

It did not have a significant impact on the size since the actual huge one is tensorflow.

@jim60105
Copy link
Contributor Author

Explained the different designs of tensorboard in README.md.

@jim60105
Copy link
Contributor Author

Rebase on to bmaltais/kohya_ss/dev

docker-compose.yaml Outdated Show resolved Hide resolved
docker-compose.yaml Outdated Show resolved Hide resolved
- The Dockerfile no longer installs TensorFlow, only PyTorch
- The TensorBoard exposure port has been removed from the Dockerfile
- A new tensorboard service has been added to the docker-compose.yaml file, utilizing the TensorFlow:latest-gpu image and exposing port 6006.
- The requirements_linux_docker.txt file has the tensorboard and tensorflow requirements commented out.

Signed-off-by: 陳鈞 <[email protected]>
Remove tensorboard reduce the docker image size from 17GB to 10GB.
This reduction enables the image to be built on the GitHub free runner.

Signed-off-by: 陳鈞 <[email protected]>
- Modify the Docker publish workflow to only trigger on tag pushes that match the pattern `v*`, for instance, `v1.0.0`.
- Make the job run only on Ubuntu and tag commits.
- Enable submodules during checkout to pull all the data.
- Change Docker image name to `kohya-ss-gui`.
- Configure Docker metadata dynamically based on commit SHA and SemVer pattern.
- Set up QEMU, a generic and open source machine emulator and virtualizer, possibly for cross-platform docker image builds.
- Use `GITHUB_TOKEN` instead of `CR_PAT` for Docker login.
- Adjust the build arguments for Docker, including the version and the release number.
- Modify cache configuration to store in registry, rather than Github Action (gha) to avoid capacity limit.
- Enable software bill of materials (sbom) and provenance for better software supply chain and artifact traceability and verifiability.

Signed-off-by: 陳鈞 <[email protected]>
- Remove the mapping for the port 6006 in the docker-compose file

Signed-off-by: 陳鈞 <[email protected]>
…flow

- Removed the `github.sha` tag from the docker publish workflow file.

Signed-off-by: 陳鈞 <[email protected]>
- Add the `--do_not_use_shell` option to the Python CMD in the Dockerfile.

Signed-off-by: 陳鈞 <[email protected]>
This reverts commit 1bd0465.

`shell=False` does not work in Linux when passing the command as a string with arguments.
#2284 (comment)
…service

- Update the image source for `kohya-ss-gui` service in docker-compose file from `kohya-ss-gui:latest` to `ghcr.io/jim60105/kohya-ss-gui:latest`

`ghcr.io/jim60105/kohya-ss-gui` is for testing the forked repository and must be changed to `bmaltais`'s registry before merging the PR.

Signed-off-by: 陳鈞 <[email protected]>
…kage for writing logs.

- Removed exact version specification for tensorboard and tensorflow in the linux docker requirements, allowing the latest versions to be installed.
- Removed commented-out installations for tensorboard and tensorflow.

Signed-off-by: 陳鈞 <[email protected]>
- Added a new section in README.md about the design of the Dockerfile.
- Made changes in the Docker limitations section, including the separation of TensorBoard from the project and the related changes, no automatic browser launch, and manual file path input requirement.
- Updated instructions for docker upgrade using new code version.
- Note added for possible lengthy docker image build times.

Signed-off-by: 陳鈞 <[email protected]>
- Implemented rich logging in Dockerfile and set FORCE_COLOR environment variable to true
- Set the COLUMNS environment variable to `100` in Dockerfile for wider output
- Replaced single quotes with double quotes for `device_ids` value in docker-compose.yaml

Signed-off-by: 陳鈞 <[email protected]>
- Change the docker image from `ghcr.io/jim60105/kohya-ss-gui:latest` to `ghcr.io/bmaltais/kohya-ss-gui:latest`
- Update the cache source from `ghcr.io/jim60105/kohya-ss-gui:cache` to `ghcr.io/bmaltais/kohya-ss-gui:cache`

Signed-off-by: 陳鈞 <[email protected]>
@jim60105
Copy link
Contributor Author

The "Package" tab on your GitHub page is the ghcr docker registry.
Here's the documentation for the registry.
https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry

I think we can first merge this PR and trigger the workflow, allowing it to automatically publish the package. This process should enable the workflow to automatically obtain permission for the package. If the publication fails, come back to check the permission issue.

After publishing is completed, remember to adjust the package to be public. The new package will be set as private by default.

@jim60105 jim60105 marked this pull request as ready for review April 25, 2024 17:14
@bmaltais bmaltais merged commit 059afdb into bmaltais:dev Apr 25, 2024
1 check passed
@bmaltais
Copy link
Owner

I have triggered a build... tried to configure the GITHUB_TOKEN secret but github won't let me... so I assume it is auto managed by github on the platform... Look like it worked

@bmaltais
Copy link
Owner

bmaltais commented Apr 25, 2024

Give it a try and let me know if all is fine... I tagged the last commit in dev so it should contain the latest dev code.

@bmaltais
Copy link
Owner

I tried to run the built image with:

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
docker compose up -d

as specified in the README but this just ended up building the whole thing locally...

@jim60105
Copy link
Contributor Author

I have triggered a build... tried to configure the GITHUB_TOKEN secret but github won't let me... so I assume it is auto managed by github on the platform... Look like it worked

Yes, GITHUB_TOKEN is builtin feature.

Give it a try and let me know if all is fine... I tagged the last commit in dev so it should contain the latest dev code.

It looks great, and the CI is successful.

I tried to run the built image with:

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
docker compose up -d

as specified in the README but this just ended up building the whole thing locally...

When I write it like this, the default behavior is that it will first pull the image, unless the image does not exist (online), then it will build locally.
I'm not sure if this is a new behavior, just to be safe, please update Docker to the latest version.

image: ghcr.io/bmaltais/kohya-ss-gui:latest
user: 1000:0
build:
context: .

@jim60105
Copy link
Contributor Author

jim60105 commented Apr 25, 2024

image
I notice that the Tensorboard buttons appeared because I am using --headless in Docker.
In my understanding, the file selector in the container won't works, so headless mode should be used.

This may be a separate issue.

@bmaltais
Copy link
Owner

Yeah… I think a new flag will be required to hide it… I need to think of the best way to implement it… but it was supposed to not display if tensorflow was missing… not sure why this did not work.

@bmaltais
Copy link
Owner

I think it did not pull the image because there is no latest tag for it… maybe…

@jim60105
Copy link
Contributor Author

I think it did not pull the image because there is no latest tag for it… maybe…

latest tag is exists.
Docker compose do pull the image on my platform.🤔
image

@jim60105
Copy link
Contributor Author

Yeah… I think a new flag will be required to hide it… I need to think of the best way to implement it… but it was supposed to not display if tensorflow was missing… not sure why this did not work.

Because visibility or self.headless

with gr.Row():
button_start_tensorboard = gr.Button(
value="Start tensorboard",
elem_id="myTensorButton",
visible=visibility or self.headless,
)

@jim60105
Copy link
Contributor Author

I had take a look at why you modified this for #2349

In my opinion, docker user can directly restart the container when they want to interrupt the training.
Since we handle the signal correctly, the training process will be interrupted before the container stops. Just like when you use Ctrl+C to terminate a program.

@YachiZhang FYI

@jim60105
Copy link
Contributor Author

When I write it like this, the default behavior is that it will first pull the image, unless the image does not exist (online), then it will build locally.

Yes, I'm not mistaken. It is mentioned in the document.
https://docs.docker.com/compose/compose-file/05-services/#image

@bmaltais
Copy link
Owner

ls

Yeah… I think a new flag will be required to hide it… I need to think of the best way to implement it… but it was supposed to not display if tensorflow was missing… not sure why this did not work.

Because visibility or self.headless

with gr.Row():
button_start_tensorboard = gr.Button(
value="Start tensorboard",
elem_id="myTensorButton",
visible=visibility or self.headless,
)

I will add a check for tensorflow... I must have broken the logic when I implemented the two button visibility in headless.

@bmaltais
Copy link
Owner

I figured why it was not working... I was in the master branch instead of dev ;-)

@bmaltais
Copy link
Owner

Try the laest build, should do OK with buttons now.

@bmaltais
Copy link
Owner

bmaltais commented Apr 26, 2024

@jim60105 I was thinking... since running in the repo built docker container... we know what port tensorboard is listening on... So it would be possible to show a button called "Open tensorboard" that would simply open a tab in a browser pointing to the http://127.0.0.0:6006 for tensorboard... do you think it would make sense? It would remove the need for users to remember what port it might run on.

@jim60105
Copy link
Contributor Author

I think it is not quite appropriate because users may modify the port mapping for tensorboard from docker-compose.

If this requirement really needs to be implemented, I would generally use environment variables.
Create a .env file to define the port number, use environment variable for port mapping on tensorboard in docker-compose, pass the environment variables to the kohya-ss-gui container, and then read the environment variables in the program to handle the url to open when the button is pressed.
This is a bit too complicated.

@jim60105
Copy link
Contributor Author

For Windows Docker Desktop users, we simply click here to open the browser, and TensorBoard is listed alongside it, so there's no issue.
image

For Linux user... Linux users are always aware of what's happening, so there's no need to worry.😉

@bmaltais
Copy link
Owner

bmaltais commented Apr 26, 2024

I already implemented it dev... One can pass a parameter when starting the gui by setting an environment variable:

self.tensorboard_port = os.environ.get(
            "TENSORBOARD_PORT", self.DEFAULT_TENSORBOARD_PORT
        )

to change the tensorboard port... maybe this is enough?

I am trying to make the GUI as convenient as possible so it can be done from there also. I understand there are other methods... but having it there allow users to stay in the GUI for most everything.

@jim60105
Copy link
Contributor Author

I already implemented it dev... One can pass a parameter when starting the gui by setting an environment variable:

self.tensorboard_port = os.environ.get(
            "TENSORBOARD_PORT", self.DEFAULT_TENSORBOARD_PORT
        )

to change the tensorboard port... maybe this is enough?

I am trying to make the GUI as convenient as possible so it can be done from there also. I understand there are other methods... but having it there allow users to stay in the GUI for most everything.

Nice, then I will complete the part of Docker.

@jim60105
Copy link
Contributor Author

@bmaltais
Copy link
Owner

Darn, copy paste issue… did not notice as I moved code from my Linux test env to my main code commit env.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants