Skip to content

hwk0702/BRCA-Breast-Cancer-Prediction-Model-Using-DNN-and-K-means

Repository files navigation

BRCA Breast Cancer Prediction Model Using DNN and K-means

1. Introduction

1.1 Set goals for project

  • Use BRCA_prognosis data to create a program that can detect gene anomalies and predict breast cancer.

① Execute test data and training data separately.

② Using gene data from patients, distinguishes between good and risky genes.

③ As a result, the program will be makes good and risky predictions of the gene.

1.2 Data description

  • Data structure


2. Model training with sklearn

  • Without preprocessing, the results of unsupervised learning

i. KNN (k=[3, 5, 7, 9, 11, 13, 15])

ii. Naive Bayesian Classification

iii. Information gain ( max_depth=[3, 5, 7, 9, 11, 13, 15] )

iv. SVM (kernel = [linear, poly, rbf, sigmoid])

v. DNN (solver=[adam, sgd, lbfgs], activation= [identity, logistic, tanh, relu])

The accuracy of DNN was the highest at about 0.85, and DNN had the highest value except the sensitivity. As a result, DNN (solver = ibfgs, activation = logistic) is the best classification.

When SVM kernel is sigmoid and DNN solver is adman and sgd, it is not classified properly.


3. Method

  1. Preprocessing

① Edit labels array

For use in the DNN model, the Labels array is changed to a two-dimensional array, and the column vectors are replaced by row vectors.

② One-Hot-Encoding

One-Hot-Encoding is used to change the values of labels. One-Hot-Encoder is also referred to as One-of-K encoding and converts an integer scalar value having a value of 0 to K-1 into a K-dimensional vector having a value of 0 or 1.

③ Normalization

Normalize the values of the data. Normalization is a transformation to make all of the individual data the same size.

  1. Data grouping

Divide into two groups with similar characteristics to get better results.

① Principal component analysis (PCA)

The dimension of the data is reduced to two dimensions.

① K-means

Use K-means to divide into two groups.

③ Grouping

  1. Training

① DNN (Deep Neural Network)

The hidden layer is composed of four layers (4096, 1024, 256, 32) and the Learning_rate is set to 0.0001. I used the solver as the adam optimizer function and the activate function as relu. Train step were set to 500.

② Dropout

Avoid using some of the neurons at each learning step to prevent some features from sticking to specific neurons, balancing the weights to prevent overfitting.

Dropouts were set to 0.8.

③ Regularization

Let’s not have too big numbers in the weight. And, prevent overfitting.

Reularization was set to 0.001.


4. result

  1. Result

① First Group & Second Group

② Sum first group, seconde group result

Accuracy was 0.88

  1. Compare with other methods

① Not grouping, Not regularization, node(1024,256,32)

② Not grouping, Not regularization, node(4096,1024,256,32)

③ Not grouping, Use regularization, node(4096,1024,256,32)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published