Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add semi supervised classification #40

Merged
merged 14 commits into from
Aug 28, 2023
Merged

Add semi supervised classification #40

merged 14 commits into from
Aug 28, 2023

Conversation

EdenWuyifan
Copy link
Collaborator

@EdenWuyifan EdenWuyifan commented May 24, 2023

Fix #22.

@roquelopez
Copy link
Collaborator

@EdenWuyifan Could you please test this dataset using f1 as metric? The target column is defects.

I think it will raise this error: ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
I think it will fail during the dataset-splitting phase because the test split would contain 3 labels: two ones of the problem (e.g. 0 and 1) and NaN value. I think we should just check for missing values and make sure they are not present in test split.

@EdenWuyifan
Copy link
Collaborator Author

@EdenWuyifan Could you please test this dataset using f1 as metric? The target column is defects.

I think it will raise this error: ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted']. I think it will fail during the dataset-splitting phase because the test split would contain 3 labels: two ones of the problem (e.g. 0 and 1) and NaN value. I think we should just check for missing values and make sure they are not present in test split.

Sure. I will try it.

@EdenWuyifan EdenWuyifan force-pushed the eden_semi_supervised branch 3 times, most recently from 30620ec to f526a65 Compare June 12, 2023 16:39
@EdenWuyifan EdenWuyifan force-pushed the eden_semi_supervised branch 4 times, most recently from f526a65 to f95ec58 Compare June 15, 2023 17:43
@EdenWuyifan EdenWuyifan changed the title [WIP] Eden semi supervised Add semi supervised classification Jun 15, 2023
@EdenWuyifan EdenWuyifan force-pushed the eden_semi_supervised branch 2 times, most recently from eb2546e to c196545 Compare June 30, 2023 20:18
@EdenWuyifan EdenWuyifan force-pushed the eden_semi_supervised branch 3 times, most recently from ace9e1d to b812886 Compare August 8, 2023 23:33
Copy link
Member

@aecio aecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, the code looks good. Just adding a few suggestions for improvement here.

tests/test_semisupervised.py Outdated Show resolved Hide resolved
time_bound_run=5, score_sorting='auto', metric_kwargs={'average': 'micro'}, split_strategy_kwargs=None,
start_mode='auto', verbose=False):
"""
Create/instantiate an AutoMLSemiSupervisedClassifier object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly explain what is semi-supervised classification (e.g., what are assumptions on the training data/labels). Maybe point out to sklearn documentation if there is good documentation on this there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is done by Roque on this commit 9f99863.
@roquelopez Do we want to move this script?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of having a demo_utils folder was to group all the code needed to run Alpha-AutoML in the NYU HSRN server. For instance, in values.yaml file, there are some configs specifically for NYU HSRN server. This will not work in other Kubernetes servers. That is why I prefer to have a demo to a kubernetes folder. However, if we can add documentation to set up these configs, I think it's ok to have the kubenetes folder.
Also, why do we need jupyterhub, jupyterlab, and jupyterlab-server in the Dockerfile? If it is a 'generic' Dockerfile only the notebook dependency is enough. By generic I mean that users can run it in any kind of environment, not exclusively in Kubernetes. I think that jupyterhub, jupyterlab, and jupyterlab-server dependencies were added there because they are needed by the Jupyterhub instance in NYU HSRN. If that is the case, that Dockerfile should not be in the root directory, it should be in a folder specifically for the demo. The Dockerfile in the root should contain only the necessary dependencies to run Alpha-AutoML. Also, I think PipelineProfiler doesn't work in jupyterlab.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current Dockerfile in the root folder is not specific to the HSRN cluster. It works on any installation of JupyterHub or in a single machine using docker (it starts the traditional Jupyter UI). This is why there are two 'flavors' of the image being built in the CI scripts (ghcr.io/vida-nyu/alpha-automl and ghcr.io/vida-nyu/alpha-automl-jupyterhub).

That said, I don't know why we are installing the dependencies like that with pinned versions. We actually should probably be just extending pre-built images with the latest versions of JupyterHub. In any case, I think all of this should be discussed and changed in a separate PR. I would just revert it for now and we can discuss it the next meeting.

BTW, many people only use JupyterLab, so PipelineProfilter needs to be fixed too. It should be a high priority.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of ghcr.io/vida-nyu/alpha-automl and ghcr.io/vida-nyu/alpha-automl-jupyterhub. I agree, let's discuss this in another PR.

@aecio
Copy link
Member

aecio commented Aug 21, 2023

Also, why is this moving the Dockerfile to scripts/demo_utils/Dockerfile? Do we still want to do that?

@aecio
Copy link
Member

aecio commented Aug 22, 2023

Not blocking merging this PR, but found another detail that should be addressed. I run the example notebook and noted that the pipeline summary shown in plot_learderboard() does not show the actual estimator. For example:
SimpleImputer, SelfTrainingClassifier is actually wrapping a RandomForestClassifier estimator. This is inconsistent with what is shown in the PipelineProfiler plot. Maybe unwrapping the primitive to show SimpleImputer, SelfTrainingClassifier, RandomForestClassifier would be best?

@roquelopez
Copy link
Collaborator

@EdenWuyifan feel free to merge it.

@EdenWuyifan EdenWuyifan merged commit 799f657 into devel Aug 28, 2023
1 check passed
@roquelopez roquelopez deleted the eden_semi_supervised branch September 5, 2023 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for semi-supervised tasks
3 participants