Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fiber Notebook line 15 - train test split #552

Closed
beyucel opened this issue May 4, 2021 · 4 comments
Closed

Fiber Notebook line 15 - train test split #552

beyucel opened this issue May 4, 2021 · 4 comments
Assignees
Labels
Milestone

Comments

@beyucel
Copy link
Contributor

beyucel commented May 4, 2021

(Line 15) We create the train and test split and get a flat array. Then, we are inputting the flat array as an input for the pipeline. Ideally, It has to be a 2D microstructure. In that setting, we are converting the image to a 1D structure and performing the pipeline with 1D microstructures.

We need to take a look at that

@beyucel beyucel added the bug label May 4, 2021
@beyucel beyucel self-assigned this May 4, 2021
@wd15 wd15 added this to the 0.4.1 milestone May 4, 2021
@beyucel
Copy link
Contributor Author

beyucel commented Jul 9, 2021

@wd15 I added a generic transformer to fix the issue here. When I use full PCA it gives this error :

ValueError: operands could not be broadcast together with shapes (882,882) (640,1) (882,882)

It is the first GenericTransformer that causes the issue. Randomized PCA does not have the same issue but it doesn't give good results. There is something wrong with the dask randomized PCA as sklearn randomized PCA doesn't have the same issue.

Any suggestions for the broadcast issue?

@wd15
Copy link
Contributor

wd15 commented Jul 13, 2021

@wd15 I added a generic transformer to fix the issue here. When I use full PCA it gives this error :

ValueError: operands could not be broadcast together with shapes (882,882) (640,1) (882,882)

It is the first GenericTransformer that causes the issue. Randomized PCA does not have the same issue but it doesn't give good results. There is something wrong with the dask randomized PCA as sklearn randomized PCA doesn't have the same issue.

Any suggestions for the broadcast issue?

What's happening here is that the number of features is larger than the number of samples when it arrives at the PCA and that breaks the "full" PCA when using dask. If you reduce the amount of data out of the correlations it will work. Use this function as the reshape function:

def reshape_func(x):
    print(x.shape)
    out = x.reshape(x.shape[0], -1)
    print(out.shape)
    return out

in

pca_steps = [
    ("reshape", GenericTransformer(lambda x: x.reshape(x.shape[0], 21, 21))),
    ("discritize",PrimitiveTransformer(n_state=2, min_=0.0, max_=1.0)),
    ("correlations",TwoPointCorrelation(periodic_boundary=True, cutoff=5,correlations=[(0, 0), (1, 1)])),
#    ('flatten', FlattenTransformer()),
#    ('flatten', GenericTransformer(lambda x: x.reshape(x.shape[0], -1))),  
    ('flatten', GenericTransformer(reshape_func)),  
    ("pca", PCA(svd_solver='full',n_components=20))
]

and you'll see it. For example, with a cutoff of 5 and 2 correlations the shapes are

>>> params_to_tune = {'pca__n_components': np.arange(20, 24),'poly__degree': np.arange(2, 4)}
>>> grid_search = GridSearchCV(pipeline, params_to_tune).fit(x_train, y_train)
(640, 11, 11, 2)
(640, 242)
(640, 11, 11, 2)
(640, 242)
(640, 11, 11, 2)
(640, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)(320, 11, 11, 2)
(320, 242)

(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(960, 11, 11, 2)
(960, 242)

Now 320 > 242 so things work. If you change it so that the amount of data is greater than 320 then it breaks. 3 correlations for example or cutoff=11

@wd15
Copy link
Contributor

wd15 commented Jul 13, 2021

Should we raise a warning in the two point correlations when the features are greater than the samples?

@beyucel beyucel mentioned this issue Jul 13, 2021
@wd15
Copy link
Contributor

wd15 commented Jul 15, 2021

Looks like we have this repaired now.

@wd15 wd15 closed this as completed Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants