Add semi supervised classification #40

EdenWuyifan · 2023-05-24T19:57:34Z

Fix #22.

roquelopez · 2023-05-26T20:29:04Z

@EdenWuyifan Could you please test this dataset using f1 as metric? The target column is defects.

I think it will raise this error: ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
I think it will fail during the dataset-splitting phase because the test split would contain 3 labels: two ones of the problem (e.g. 0 and 1) and NaN value. I think we should just check for missing values and make sure they are not present in test split.

EdenWuyifan · 2023-05-26T20:34:19Z

@EdenWuyifan Could you please test this dataset using f1 as metric? The target column is defects.

I think it will raise this error: ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted']. I think it will fail during the dataset-splitting phase because the test split would contain 3 labels: two ones of the problem (e.g. 0 and 1) and NaN value. I think we should just check for missing values and make sure they are not present in test split.

Sure. I will try it.

aecio

Generally, the code looks good. Just adding a few suggestions for improvement here.

tests/test_semisupervised.py

alpha_automl/pipeline_synthesis/pipeline_builder.py

alpha_automl/builtin_primitives/semisupervised_classifier.py

aecio · 2023-08-21T21:30:41Z

alpha_automl/automl_api.py

+                 time_bound_run=5, score_sorting='auto', metric_kwargs={'average': 'micro'}, split_strategy_kwargs=None,
+                 start_mode='auto', verbose=False):
+        """
+        Create/instantiate an AutoMLSemiSupervisedClassifier object.


Briefly explain what is semi-supervised classification (e.g., what are assumptions on the training data/labels). Maybe point out to sklearn documentation if there is good documentation on this there.

It is done by Roque on this commit 9f99863.
@roquelopez Do we want to move this script?

The idea of having a demo_utils folder was to group all the code needed to run Alpha-AutoML in the NYU HSRN server. For instance, in values.yaml file, there are some configs specifically for NYU HSRN server. This will not work in other Kubernetes servers. That is why I prefer to have a demo to a kubernetes folder. However, if we can add documentation to set up these configs, I think it's ok to have the kubenetes folder.
Also, why do we need jupyterhub, jupyterlab, and jupyterlab-server in the Dockerfile? If it is a 'generic' Dockerfile only the notebook dependency is enough. By generic I mean that users can run it in any kind of environment, not exclusively in Kubernetes. I think that jupyterhub, jupyterlab, and jupyterlab-server dependencies were added there because they are needed by the Jupyterhub instance in NYU HSRN. If that is the case, that Dockerfile should not be in the root directory, it should be in a folder specifically for the demo. The Dockerfile in the root should contain only the necessary dependencies to run Alpha-AutoML. Also, I think PipelineProfiler doesn't work in jupyterlab.

The current Dockerfile in the root folder is not specific to the HSRN cluster. It works on any installation of JupyterHub or in a single machine using docker (it starts the traditional Jupyter UI). This is why there are two 'flavors' of the image being built in the CI scripts (ghcr.io/vida-nyu/alpha-automl and ghcr.io/vida-nyu/alpha-automl-jupyterhub).

That said, I don't know why we are installing the dependencies like that with pinned versions. We actually should probably be just extending pre-built images with the latest versions of JupyterHub. In any case, I think all of this should be discussed and changed in a separate PR. I would just revert it for now and we can discuss it the next meeting.

BTW, many people only use JupyterLab, so PipelineProfilter needs to be fixed too. It should be a high priority.

I was not aware of ghcr.io/vida-nyu/alpha-automl and ghcr.io/vida-nyu/alpha-automl-jupyterhub. I agree, let's discuss this in another PR.

aecio · 2023-08-21T22:07:03Z

Also, why is this moving the Dockerfile to scripts/demo_utils/Dockerfile? Do we still want to do that?

alpha_automl/builtin_primitives/semisupervised_classifier.py

aecio · 2023-08-22T19:56:23Z

Not blocking merging this PR, but found another detail that should be addressed. I run the example notebook and noted that the pipeline summary shown in plot_learderboard() does not show the actual estimator. For example:
SimpleImputer, SelfTrainingClassifier is actually wrapping a RandomForestClassifier estimator. This is inconsistent with what is shown in the PipelineProfiler plot. Maybe unwrapping the primitive to show SimpleImputer, SelfTrainingClassifier, RandomForestClassifier would be best?

roquelopez · 2023-08-28T15:20:53Z

@EdenWuyifan feel free to merge it.

EdenWuyifan force-pushed the eden_semi_supervised branch 3 times, most recently from 30620ec to f526a65 Compare June 12, 2023 16:39

EdenWuyifan force-pushed the eden_semi_supervised branch 4 times, most recently from f526a65 to f95ec58 Compare June 15, 2023 17:43

EdenWuyifan changed the title ~~[WIP] Eden semi supervised~~ Add semi supervised classification Jun 15, 2023

EdenWuyifan force-pushed the eden_semi_supervised branch 2 times, most recently from eb2546e to c196545 Compare June 30, 2023 20:18

EdenWuyifan added 10 commits August 8, 2023 19:14

add semi supervised classifier base

95a45a1

update text semi supervised classification

8f29b2f

add SEMISUPERVISED_CLASSIFIER extra params

f588999

ready for merge

ad33549

clean examples

b0d698a

add autonbox estimator for semi supervised learning

7f1d7fe

fix testcase

b4fa397

remove text classification

819fce1

fix format

55a5a24

fix warning messages

b812886

EdenWuyifan force-pushed the eden_semi_supervised branch 3 times, most recently from ace9e1d to b812886 Compare August 8, 2023 23:33

fix __version__

573a488

aecio reviewed Aug 21, 2023

View reviewed changes

aecio reviewed Aug 22, 2023

View reviewed changes

alpha_automl/builtin_primitives/semisupervised_classifier.py Outdated Show resolved Hide resolved

aecio reviewed Aug 22, 2023

View reviewed changes

alpha_automl/builtin_primitives/semisupervised_classifier.py Outdated Show resolved Hide resolved

EdenWuyifan added 2 commits August 24, 2023 17:48

resolve reviews

24e9654

fix semi classifier plot_pipeline

15974a3

EdenWuyifan force-pushed the eden_semi_supervised branch from 66798e3 to 15974a3 Compare August 24, 2023 23:28

resolve reviews

8a10b21

aecio approved these changes Aug 25, 2023

View reviewed changes

roquelopez approved these changes Aug 25, 2023

View reviewed changes

EdenWuyifan merged commit 799f657 into devel Aug 28, 2023
1 check passed

roquelopez deleted the eden_semi_supervised branch September 5, 2023 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add semi supervised classification #40

Add semi supervised classification #40

EdenWuyifan commented May 24, 2023 •

edited by roquelopez

Loading

roquelopez commented May 26, 2023

EdenWuyifan commented May 26, 2023

aecio left a comment

aecio Aug 21, 2023

EdenWuyifan Aug 21, 2023

roquelopez Aug 22, 2023

aecio Aug 22, 2023

roquelopez Aug 22, 2023

aecio commented Aug 21, 2023

aecio commented Aug 22, 2023

roquelopez commented Aug 28, 2023

Add semi supervised classification #40

Add semi supervised classification #40

Conversation

EdenWuyifan commented May 24, 2023 • edited by roquelopez Loading

roquelopez commented May 26, 2023

EdenWuyifan commented May 26, 2023

aecio left a comment

Choose a reason for hiding this comment

aecio Aug 21, 2023

Choose a reason for hiding this comment

EdenWuyifan Aug 21, 2023

Choose a reason for hiding this comment

roquelopez Aug 22, 2023

Choose a reason for hiding this comment

aecio Aug 22, 2023

Choose a reason for hiding this comment

roquelopez Aug 22, 2023

Choose a reason for hiding this comment

aecio commented Aug 21, 2023

aecio commented Aug 22, 2023

roquelopez commented Aug 28, 2023

EdenWuyifan commented May 24, 2023 •

edited by roquelopez

Loading