Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Python 3.10 and 3.11 #1937

Merged
merged 72 commits into from
Mar 19, 2024
Merged

Conversation

simonzhaoms
Copy link
Collaborator

@simonzhaoms simonzhaoms commented Jun 6, 2023

Description

This PR added the support for Python 3.10 and 3.11 with the following changes:

  • Remove dependencies that are required by other dependencies.
  • Upgrade dependency versions to include those support Python 3.10 and 3.11.
  • Replace out-of-date dependencies with their new alternatives, such as nvidia-ml-py3 -> nvidia-ml-py.
  • Resolved errors introduced by the upgrade, such as errors introduced from fastai to fastai2

However, this upgrade doesn't fix the following issues:

Related Issues

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging branch and not to main branch.

@simonzhaoms simonzhaoms self-assigned this Jun 6, 2023
@miguelgfierro
Copy link
Collaborator

hey @simonzhaoms the tests are broken right now, see #1934. The last thing I've been trying to do is to build a docker file and inject it into the RunConfiguration.

I think we should first fix the tests, and then we can do the upgrade to python 3.10 and 3.11. Do you think you could help me with it?

@simonzhaoms
Copy link
Collaborator Author

hey @simonzhaoms the tests are broken right now, see #1934. The last thing I've been trying to do is to build a docker file and inject it into the RunConfiguration.

I think we should first fix the tests, and then we can do the upgrade to python 3.10 and 3.11. Do you think you could help me with it?

Sure, I can take a look.

@miguelgfierro
Copy link
Collaborator

hey @simonzhaoms, after the tests are fixed, we should do a PR to main. Do you want to merge this PR before doing the PR to main or after?

@simonzhaoms simonzhaoms changed the title Add support for Python 3.10 and 3.11 Add support for Python 3.10 and 3.11 and drop for 3.7 Jun 9, 2023
@simonzhaoms
Copy link
Collaborator Author

hey @simonzhaoms, after the tests are fixed, we should do a PR to main. Do you want to merge this PR before doing the PR to main or after?

after, because there are still some problems with this PR to be solved.

@simonzhaoms
Copy link
Collaborator Author

@miguelgfierro this PR is still a draft, not ready for review.

@simonzhaoms
Copy link
Collaborator Author

simonzhaoms commented Jun 11, 2023

After upgrading the dependent packages and the docker images in the commit b71c4ed, it failed because AzureML SDK tried to compile and install an old version of ruamel.yaml (0.15.89) (See the log).

Screenshot 2023-06-11 at 09 57 01

It looks like an issue with AzureML SDK and Conda (See Azure Machine Learning SDK installation failing with an exception). And it seems that AzureML SDK sets the upper version for ruamel.yaml to 0.15.89, not sure why.

Screenshot 2023-06-11 at 09 57 33

I tried setup.py of the commit b71c4ed on my local machine with Python 3.8 3.9, 3.10 and 3.11. All works because there is no dependency on the old version of ruamel.yaml.

Possible solutions:

  • Upgrade AzureML SDK from v1 to v2 (not sure whether this would work)
  • Wait until the issue with AzureML and Conda resolved.
  • Try install everything in the docker file without using Conda.

@@ -181,7 +178,20 @@ def create_run_config(
run_azuremlcompute = RunConfiguration()
run_azuremlcompute.target = cpu_cluster
run_azuremlcompute.environment.docker.enabled = True
run_azuremlcompute.environment.docker.base_image = docker_proc_type
# See https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-with-custom-image?view=azureml-api-1#use-a-custom-dockerfile-optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonzhaoms right now in the actions we are installing run: pip install --quiet "azureml-core>1,<2" "azure-cli>2,<3". What is the azurml sdk that we are installing? Here I see the latest one as 1.51 https://pypi.org/project/azureml-sdk/#history.

Of the 3 options, I think one that is interesting to explore would be Try install everything in the docker file without using Conda. iif we are reducing dependencies. I think 80% of our problems come from dependencies: #1936 So maybe something to reflect on is how can we reduce dependencies and use more standardize and robust software?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miguelgfierro

@simonzhaoms right now in the actions we are installing run: pip install --quiet "azureml-core>1,<2" "azure-cli>2,<3". What is the azurml sdk that we are installing? Here I see the latest one as 1.51 https://pypi.org/project/azureml-sdk/#history.

I think this azureml-core is only used for launching the script submit_groupwise_azureml_pytest.py.

https://github.com/microsoft/recommenders/blob/b71c4ed66991f8eddfd80a8afbff05b34591c7ee/.github/actions/azureml-test/action.yml#L75-L77

And the AzureML SDK I mentioned is used inside the docker image launched inside submit_groupwise_azureml_pytest.py

https://github.com/microsoft/recommenders/blob/b71c4ed66991f8eddfd80a8afbff05b34591c7ee/.github/actions/azureml-test/action.yml#L85-L94

https://github.com/microsoft/recommenders/blob/b71c4ed66991f8eddfd80a8afbff05b34591c7ee/tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py#L178-L194

I think when we use CondaDependencies.add_pip_package("xxx"), AzureML adds the item xxx in a Conda env yaml file maintained by itself, and the AzureML SDK is also an item added by AzureML by default implicitly.

https://github.com/microsoft/recommenders/blob/b71c4ed66991f8eddfd80a8afbff05b34591c7ee/tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py#L202-L223

What I don't know is why the AzureML SDK trigger the error now when I try to add support for Python 3.10 and upgrade all dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of the 3 options, I think one that is interesting to explore would be Try install everything in the docker file without using Conda. iif we are reducing dependencies. I think 80% of our problems come from dependencies: #1936 So maybe something to reflect on is how can we reduce dependencies and use more standardize and robust software?

I agree. In addition, if possible, I'd use what GitHub actions and workflows can provide to build the testing pipeline rather than use the AzureML service, because AzureML service is not transparent.

setup.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@SimonYansenZhao
Copy link
Collaborator

SimonYansenZhao commented Mar 12, 2024

folks, when I run this manually on my AzureML workspace with tests/ci/azureml_tests/submit_groupwise_azureml_pytest.py, it works without issues. Can someone paste the actual AzureML job's logs (error message) here?

@loomlike The errors can be found at the nightly build on the branch simony-dep-upgrade-20230606: https://github.com/recommenders-team/recommenders/actions/runs/8036183347

But the newest error is the one pasted by Miguel above from running the notebook examples/06_benchmarks/movielens.ipynb run by tests/functional/examples/test_notebooks_gpu.py::test_benchmark_movielens_gpu()

@SimonYansenZhao
Copy link
Collaborator

SimonYansenZhao commented Mar 16, 2024

Hi @miguelgfierro the gpu machine seems to be in a bad status. Several runs of tests have the same error saying

"code": "UserError",
"message": "AzureMLCompute cluster gpu-cluster has an error with code UnhealthyGPU that prevents it from scaling to 1 nodes. Message: ECC was disabled on the node. Node is unhealthy and will reboot.",
"ClusterName": "gpu-cluster",
"Code": "UnhealthyGPU",
"NodeCount": "1",
"code": "ClusterProvisionedWithErrors"

Could you help to check when you have time?
I have already fixed all errors found. I think all tests should pass if no further errors occur.

Signed-off-by: miguelgfierro <[email protected]>
@miguelgfierro
Copy link
Collaborator

@miguelgfierro
Copy link
Collaborator

@SimonYansenZhao see this comment: #2071 (comment)

@miguelgfierro
Copy link
Collaborator

@SimonYansenZhao if the unit tests pass, what do you think of merging this PR, and then figure out how to fix the issues we have in the nightly?

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests pass

Comment on lines +71 to +72
df = train if validation is None else pd.concat([train, validation])
df = df if test is None else pd.concat([df, test])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid having two concats with this one liner:

df = pd.concat(filter(None, [train, validation, test]))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @daviddavo, can you please create a PR to staging with these changes? I gave you written permissions.

git checkout staging
git branch -b new_branch
.... code changes ...
git commit -asm "you message don't forget the -s to sign the commits"
git push origin new_branch

And then PR.

@SimonYansenZhao
Copy link
Collaborator

@SimonYansenZhao if the unit tests pass, what do you think of merging this PR, and then figure out how to fix the issues we have in the nightly?

sounds good @miguelgfierro

@miguelgfierro miguelgfierro merged commit 3821655 into staging Mar 19, 2024
38 checks passed
This was referenced Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants