Kedro-MLflow on AWS batch causes every node to be logged as a separate run #432

hugocool · 2023-07-11T15:07:30Z

Description

When running kedro pipelines on AWS batch with kedro-mlflow every node is logged as a separate run, this is because the pipeline is executed on batch by running each node in a docker container with a separate docker run command, i.e kedro run --node=... .

Context

This is undesirable since these are not separate runs, simply individual nodes and this quickly pollutes your mlflow tracking server.
Therefore each kedro run command issued to batch should be made aware of the already active_run this node is part of.

Possible Implementation

While the changes to the batchrunner should be implemented in the deployment pattern, kedro-mlflow should allow one to pass a mlflow_run_id cli kwarg which sets the run_id.
Im currently implementing a solution using the configloader, a custom cloudrunner and changes to the batchrunner cli.
Im curious whether there is a better/more minimal alternative.

Possible Alternatives

setting a environment variable?
overriding the run_id with the git commit? (although this is difficult in batch since the container should be made aware of the git-commit)

The text was updated successfully, but these errors were encountered:

Galileo-Galilei · 2023-07-16T20:34:56Z

Hi @hugocool, this is a common feature request, and this is partially possible yet.

First, I want to insist the fact that on AWS batch each node is logged in a separated run is a feature, not a bug 😄 AWS batch nodes are orchestrator nodes and they don't have the same purpose that kedro nodes. You can find a very similar discussion about kedro-mlflow support for airflow here: #44 which explains that orchestrator nodes should rather be mapped to kedro pipelines than nodes. This has been discussed with the kedro team a couple of times too.

That said, you request is valid: you may want to propagate a mlflow run id through different orchestrator nodes. Some good news:

if a mlflow run is already active, kedro-mlflow uses it instead of starting a new one, so nothing prevents you to start a mlflow run "manually", e.g. with mlflow.start_run(run_id=YOUR_RUN_ID) before running the node. I think mlflow used to let you set up a MLFLOW_RUN_ID environment variable but it was not encouraged so I am not sure what is the current recommandation for this. The drawback is that if you you need to set up the entire mlflow configuration manuall (including the MLFLOW_TRACKING_URI, the MLFLOW_REGISTRY_URI...) because kedro-mlflow will ignore its configuration file
kedro-mlflow has a mlflow.tracking.run.id in its configuration. If you override the configuration file, or just this key (e.g. with the OmegaConfigLoader you can use a custom resolver to read an environment variable), this will work out of the box.

Both solutions are valid and quite easy to setup. You are not the first one who wants to add some configuration overriding at runtime through CLI args (see #395) but I am quite reluctant to add some extra API when I think kedro will enable it natively with OmegaConfigLoader and runtime parameters (see #kedro-org/kedro#2504, kedro-org/kedro#2175) because this generates much more boilerplate code and responsibility on my side and I have hard time to support this so I'd prefer it to be on the framework side.

hugocool · 2023-07-18T19:13:20Z

Would it not make sense to use a kedro run parameter override?
kedro run --params=<param_key1>:<value1>,<param_key2>:<value2>

so basically i need to override the custom kedro run command send to each docker container on batch to be kedro run --params=mlflow.tracking.run.id:main_run_id, where main_run_id is the mlflow run id of the main proces that manages the execution of nodes on batch.

one could probably make this work through the TemplatedConfigLoader and the use of a global.
Although after some digging around i found this working for kedro 0.17, and could be a different story with 0.18 and the OmegaConfLoader.

Also, I dont know at the top of my head how to push this run.id to kedro-mlflow.
I could do the following in the mlflow.yml:

  run:
    id: "${main_run_id|None}" # if `id` is None, a new run will be created
    name: null # if `name` is None, pipeline name will be used for the run name
    nested: True  # if `nested` is False, you won't be able to launch sub-runs inside your nodes

And then try to pass it as a global, but i cant do extra_params with 0.18.

Do you have an example of how to override the mlflow.tracking.run.id?

I would love to contribute a full working solution, and incorporate it into a kedro-aws extension that is compatible with kedro-mlflow!

marrrcin · 2023-07-21T11:02:32Z

If AWS Batch has some unique ID environment variable injected to every container (like run ID, but specific to the AWS Batch service itself), you can follow the same idea we have for kedro-sagemaker where we first add a node to a pipeline to "start mlflow run" which adds a tag to mlflow (mlflow.tag) with this unique identifier (PIPELINE_EXECUTION_ARN in SageMaker case, I guess that for AWS Batch it would be one of the variables from https://docs.aws.amazon.com/batch/latest/userguide/job_env_vars.html ) as shown here: https://github.com/getindata/kedro-sagemaker/blob/dbd78fd6c1781cc9e8cf046e14b3ab96faf63719/kedro_sagemaker/cli.py#L380 and then the subsequent nodes lookup the MLflow RUN ID by using MLflow SDK as shown here https://github.com/getindata/kedro-sagemaker/blob/dbd78fd6c1781cc9e8cf046e14b3ab96faf63719/kedro_sagemaker/cli_functions.py#L104 and set it to MLFLOW_RUN_ID environment variable in the container, before Kedro starts ;) We have this programmed in docker entrypoints due to SageMaker limitations, but you can do the same with Kedro Hooks.

Galileo-Galilei · 2023-07-23T20:52:27Z

Yes, @marrrcin suggestion is likely the best way to do it: as explained, you need to start mlflow yourself in the container (e.g. by setting manually the MLFLOW_RUN_ID environment variable) before the kedro run starts so that kedro-mlflow will use it as the default configuration instead of starting a new run. The drawback is that you have to set up all the environment variables too (including MLFLOW_TRACKING_URI, MLFLOW_REGISTRY_URI).

hugocool · 2023-07-24T12:43:35Z

Thanks @marrrcin for the suggestion, I had not thought of that!

So the main difference with your suggestion would be the way the mlflow run id is communicated between the orchestrator and the AWS batch containers.

In one approach, it is communicated through the docker run command for the container running in AWS batch, so the Kedro runner (AWSBatchRunner) could set this through the Kedro run --Params=... command. This would be passed as a command override, and could then be picked up in Kedro-mlflow and set as the run id.

The approach mentioned by @marrrcin, if I understand correctly, leverages a identifier issued by AWS batch which shared across all the containers resulting from a single run. This could be Job Definition ID, in the way I am currently implementing it since for every kedro run I create a job definition to set the kedro run command.
One could tag the mlflow run with this job definition id, and look it up before before run of a container in AWS batch. This approach does not need an extra node to run, since the job definition is created before any nodes are running in batch (unless you count the nodes running locally to orchestrate the batch jobs).
The downsides to this approach is that we now also need to set the other mlflow configurations variables as mentioned by @Galileo-Galilei , there will be an API call to mlflow before running the node in batch, and this mechanism does not seem very opaque/hook into established methods for modifying variables before a run. (At least using a kedro-hook is, but looking up and setting env vars does not).

One other thing to take into account is that with the recent addition of support for prefect 2.0, there is now also the possibility of using the prefect AWS batch job. Since I want to migrate to using a single open source orchestrator for as many projects as possible, I would like the method to work for prefect as well.
We discussed some of the hiccups in the slack channel already, but we could open a separate issue for that of course.

Galileo-Galilei · 2023-10-25T19:44:50Z

I close the issue in favor of #395. I hope we can make it work after the 0.19 release with OmegaConfigLoader resolvers, but it still need some work and design.

Galileo-Galilei closed this as completed Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kedro-MLflow on AWS batch causes every node to be logged as a separate run #432

Kedro-MLflow on AWS batch causes every node to be logged as a separate run #432

hugocool commented Jul 11, 2023

Galileo-Galilei commented Jul 16, 2023 •

edited

Loading

hugocool commented Jul 18, 2023

marrrcin commented Jul 21, 2023 •

edited

Loading

Galileo-Galilei commented Jul 23, 2023

hugocool commented Jul 24, 2023

Galileo-Galilei commented Oct 25, 2023

Kedro-MLflow on AWS batch causes every node to be logged as a separate run #432

Kedro-MLflow on AWS batch causes every node to be logged as a separate run #432

Comments

hugocool commented Jul 11, 2023

Description

Context

Possible Implementation

Possible Alternatives

Galileo-Galilei commented Jul 16, 2023 • edited Loading

hugocool commented Jul 18, 2023

marrrcin commented Jul 21, 2023 • edited Loading

Galileo-Galilei commented Jul 23, 2023

hugocool commented Jul 24, 2023

Galileo-Galilei commented Oct 25, 2023

Galileo-Galilei commented Jul 16, 2023 •

edited

Loading

marrrcin commented Jul 21, 2023 •

edited

Loading