Fiber Notebook line 15 - train test split #552

beyucel · 2021-05-04T08:15:00Z

(Line 15) We create the train and test split and get a flat array. Then, we are inputting the flat array as an input for the pipeline. Ideally, It has to be a 2D microstructure. In that setting, we are converting the image to a 1D structure and performing the pipeline with 1D microstructures.

We need to take a look at that

beyucel · 2021-07-09T17:10:44Z

@wd15 I added a generic transformer to fix the issue here. When I use full PCA it gives this error :

ValueError: operands could not be broadcast together with shapes (882,882) (640,1) (882,882)

It is the first GenericTransformer that causes the issue. Randomized PCA does not have the same issue but it doesn't give good results. There is something wrong with the dask randomized PCA as sklearn randomized PCA doesn't have the same issue.

Any suggestions for the broadcast issue?

wd15 · 2021-07-13T00:16:48Z

@wd15 I added a generic transformer to fix the issue here. When I use full PCA it gives this error :

ValueError: operands could not be broadcast together with shapes (882,882) (640,1) (882,882)

It is the first GenericTransformer that causes the issue. Randomized PCA does not have the same issue but it doesn't give good results. There is something wrong with the dask randomized PCA as sklearn randomized PCA doesn't have the same issue.

Any suggestions for the broadcast issue?

What's happening here is that the number of features is larger than the number of samples when it arrives at the PCA and that breaks the "full" PCA when using dask. If you reduce the amount of data out of the correlations it will work. Use this function as the reshape function:

def reshape_func(x):
    print(x.shape)
    out = x.reshape(x.shape[0], -1)
    print(out.shape)
    return out

in

pca_steps = [
    ("reshape", GenericTransformer(lambda x: x.reshape(x.shape[0], 21, 21))),
    ("discritize",PrimitiveTransformer(n_state=2, min_=0.0, max_=1.0)),
    ("correlations",TwoPointCorrelation(periodic_boundary=True, cutoff=5,correlations=[(0, 0), (1, 1)])),
#    ('flatten', FlattenTransformer()),
#    ('flatten', GenericTransformer(lambda x: x.reshape(x.shape[0], -1))),  
    ('flatten', GenericTransformer(reshape_func)),  
    ("pca", PCA(svd_solver='full',n_components=20))
]

and you'll see it. For example, with a cutoff of 5 and 2 correlations the shapes are

>>> params_to_tune = {'pca__n_components': np.arange(20, 24),'poly__degree': np.arange(2, 4)}
>>> grid_search = GridSearchCV(pipeline, params_to_tune).fit(x_train, y_train)
(640, 11, 11, 2)
(640, 242)
(640, 11, 11, 2)
(640, 242)
(640, 11, 11, 2)
(640, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)(320, 11, 11, 2)
(320, 242)

(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(320, 11, 11, 2)
(320, 242)
(960, 11, 11, 2)
(960, 242)

Now 320 > 242 so things work. If you change it so that the amount of data is greater than 320 then it breaks. 3 correlations for example or cutoff=11

wd15 · 2021-07-13T00:18:38Z

Should we raise a warning in the two point correlations when the features are greater than the samples?

wd15 · 2021-07-15T18:35:31Z

Looks like we have this repaired now.

beyucel added the bug label May 4, 2021

beyucel self-assigned this May 4, 2021

wd15 added this to the 0.4.1 milestone May 4, 2021

beyucel mentioned this issue Jul 13, 2021

Fixnotebook #555

Merged

wd15 closed this as completed Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fiber Notebook line 15 - train test split #552

Fiber Notebook line 15 - train test split #552

beyucel commented May 4, 2021

beyucel commented Jul 9, 2021

wd15 commented Jul 13, 2021

wd15 commented Jul 13, 2021

wd15 commented Jul 15, 2021

Fiber Notebook line 15 - train test split #552

Fiber Notebook line 15 - train test split #552

Comments

beyucel commented May 4, 2021

beyucel commented Jul 9, 2021

wd15 commented Jul 13, 2021

wd15 commented Jul 13, 2021

wd15 commented Jul 15, 2021