This repository tutorial for Machine learning preprocesses, Exploraty Data Analysis, Machine learning algorithms.
- What is the Machine Learning ?
- Learning methods
- Supervised Learning
- Unsupervised Learning
- Machine learning algorithms graphs and explanation
- Regression models
- Classification models
- Supervised Learning
- Unsupervised Learning
- Introduction the Deep Learning with Logistic regression(a separate deep learning repository will be prepared)
- Encoding Types
- Label encoder
- OneHotEncoding
- NLP(Natural Language Process)
- PCA(Principle Component Analysis)
- What is K-Fold-CrossValidation ?
- GridSearchCV vs RandomizedCV ?
- What are the Overfitting status for models?
- How can we understand the overfitting in our model ?
- Recommendation Systems
- Exploraty Data Analysis and training data that we use
Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.
There are three types of learning methods in this repository. These are Supervised, Unsupervised, Ensemble methods.
Supervised Learning algorithms goal, if data has a label and we want to prediction these label we use supervised learning methods.
Supervised learning methods also two type. These are regression and classifacation.
Regression algorithms goal is training data for continous labels. Regression types where given in below.
- Linear Regression
- Decision Tree Regression
- Random Forest Regression
Classification models goal train data for discrete labels and classification algorithms also two type Supervised and Unsupervised Learning algorithms. Classification types where given in below.
- K neirest neighbor Classification
- Linear SVM
- Decision Tree Classification
- Random Forest Classification
- Naive Bayes Classification
Unsupervised if data has a no label and we want to cluster to data we can use unsupervised learning method
- KMeans
- Hieararchical Clustering
*Source for graph : https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-multiple-regression/fitting-multiple-regression-model.html
The model usefull for continous labels
The goal is try to draw the most optimized line. So why did the algorithm summing error and square? because some errors are positive value and samo errors are negative value, if we sum to the errors we might have errors zero and it's not realist value. So the algorithms try to minimize MSE(mean squared eror).
Algorithm that the best try to splitting the coordinate plane into many parts(leaf we can say) and makes predictions as a result of comparisons.
- Source for graph : https://www.researchgate.net/figure/Fig-A10-Random-Forest-Regressor-The-regressor-used-here-is-formed-of-100-trees-and-the_fig3_313489088
The algorithm usefull for recommendation algorithms(for example Netflix and YouTube recommendations) Random forest actually, has a lot of decision trees's results that we avarage results. Ensemble learning family that using multiple ml algorithms simultaneously.
The algorithms the data spot that we want to predict calculate the neirest spots distance with using euclidean distance and decison the label by neirest spots surrounding a number of labels. The algorithms that use to euclidean distance needs always to normalization because some distances too bigger than the other distances, that will be domination the other distances so it wont't be good perform!!
The processes goals optimize best margin by support vectors.
Naive Bayes algorithm depend of probality by spots position
Decision Tree try to best splitting for classification, after that using the thresholds that the best splitting when it prediction proceses
Random Forest has a lot of Decision trees and it use to these Tress for prediction processes.
It is aimed to decrease the wcss value.
There is no definitive answer since cluster analysis is essentially an exploratory approach but generally, when specify optimize cluster number we should look Euclidean distances and pick the threshold that distance has longest Euclidean distance.
Introduction the Deep Learning with Logistic regression(a separate deep learning repository will be prepared)
Shortly, deep learning training processes realise with data not a model therefore Deep learning better than machine learning algorithms in big datas.
Logistic Regression basic of the neural networks.
Forward and backward propagation goals, find the best weigth(w) and bias(b).
Machine learning algorithms needs to numerical categorization because the computer needs to understand correlations, so there are two encoding types in sckit-learn library.
Popular conversion tool for manipulating categorical variables. In this technique, each data is assigned an alphabetical, different integer.
Basic example for label encoder:
One Hot Encoding means that categorical variables are displayed as binary (binary). This process must first convert the categorical values to the values of integers. Then its integer value is represented as a binary with all values except the integer index marked with 1.
Basic example for one hot encoding:
Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language.
As we can see, actually the methods has tried to clear texts for computer and has tried to encoding the all letters for models.
The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).
Let’s recall the three types of movie reviews we saw earlier:
- Review 1: This movie is very scary and long
- Review 2: This movie is not scary and is slow
- Review 3: This movie is spooky and good We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews:
*Source : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
Shortly, Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed.
The PCA Explanation with graph
So what is the variance ? As you can see in graph variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean.
We can say for K-Fold-CrossValidation, that is testing processes for avoid the overfitting. How it works ?
As you can see the process goal split the train data by k number. In number of 'k-1' in train data split for training process, in number of 1 split in train data for testing process.
In gridsearchCV a process that searches exhaustively through a manually specified subset of the hyperparameter space of the targeted algorithm.
In randomizedsearchcv, instead of providing a discrete set of values to explore on each hyperparameter, we provide a statistical distribution or list of hyper parameters. Values for the different hyper parameters are picked up at random from this distribution Actually, we can say grid search method important for optimization models processing.
As we can see diffrences between grid search and random search. So we can say if we had a time and our model have no complexity we can use gridsearchCV. But we had a too big data and our model have complexity we can use RandomizedSearchCV.
Overfitting basicly, when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. example where given in below for overfitting.
- As you can see the overfitting status has some patterns by a train data.
Underfitting basicly, the counterpart of overfitting, happens when a machine learning model is not complex enough to accurately capture relationships between a dataset’s features and a target variable. Example where given in bellow for underfitting status.
- As you can see underfitting status has no optimize line for training data.
- As you can see from our learning curve our example model has a bayes diffrences and we can understand the status that is overfitting.
- So, whhat shoull we do when we have learning curve like that.
- First off all we can try to increase the trainin data.
- We can try the feature selection, feature selection actually picking the important feature for target label.
- Also we can feature dimension PCA we can told PCA where given in above.
Basicly, a recommendation system , is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item.
Even if the systems sometimes too dangerous for humanity (as you know that a Cambridge Analytica scandal in USA's presidential election), but it's usefull in someplaces for example like a Netflix, YouTube, Amazon, Facebook application.
How it works the systems ? We realize a two types for these systems that are "User based" and "Items based"
As you can see user based depends on a users's habits and try to finding similarty with users. On the other hand items depends on a try to finding similarty with items. Generally, items based systems more usefull than userbased systems because people sometimes change the their habits so the process won't recommend appropriate product for people in a future, but the items will cannot change in a future and the process optimization might has good performs.
The data has audi cars 1997 and 2020 between years some features where feaurues in below and trying to guess price in dependent where given in data.
Using three regression models because 'price' feature is continuous label
- Data analysis
- Average price by years
- Transmision type by years
- Fuel types by years
- Model types by years
- Avarage MPG by years
- Avarage Engine Size by years
- General display of numerical features by year
- ML Preprocessing
- Obtain train and test spliting process
- Encoding categorical features for learning processes
- Used to Label encoder.
- Train and test split
- Learning Time !
- Linear Regression
- Learnin curve
- Decision tree regressor
- Learnin curve
- Random Forest regressor
- Learning curve
- Linear Regression
The data, has cancer cells feature where feature in below and trying to diagnose in dependent of feauture. Using Logistic Regression method the method that are basic of artifical neural networks and good performs if data has a binary labels.
- ML preprocessings
- Tryin to understand data
- Encoding labels with LabelEncoder method
- Implement to variables
- Normalization values
- Train and test splitting
- Learning Time!
- Logistic Regression
- Implementing initilazing parameters and sigmoid function
- Implementing forward and backward propagation
- Implementing and update parameters
- Implementing prediction
- Implementing Logistic Regression
- Logistic Regression with sklearn
- Grid searching for best hyperparameters
- Knn classification
- Linear SVM
- Decision Tree Classification
- Random Forest Classification
- Naive Bayes Classification
- Logistic Regression
- Compare the learning algorithms
- Visualization part
- Confusion matrixes
The stroke prediction data set trying to guess to stroke status in dependent some features where given in data.
The data is also imbalanced data, it was good experiences for me.
- Imported data
- Load to data
- Tried to understand data
- Control the missing value
- Feature engineering
- Correlation numerical values
- Smoking status by gender type and relation with stroke
- Smoking status by work type relation with stroke
- Smoking status by age mean relation with stroke
- Smoking status by avg_glucose_level mean relation with stroke
- Smoking status by bmi mean relation with stroke
- Genaral visualization for that we did feature engineering
- Density map of numerical values(hypertension; age and bmi level relation)
- Examined to median values in numerical values
- Control the outlier values for numerical values
- Filling the missing value
- Obtained training and testing variables for learning proceses
- Encoding catagorical features with label encoder for learning processes
- Train and test splitting and try to balanced labels
- Normalization for continous columns
- Implement the PCA(Principle Component Analysis) data and 2D visualization
- Learning time!
- Optimization hyperparameters with RandomizedSearchCV
- Logistic Regression
- Fitting and testing model
- Knn Classification
- Fitting and testing model
- Decision Tree Classification
- Fitting and testing model
- Logistic Regression
- Optimization hyperparameters with RandomizedSearchCV
This content has Biomechanical Feautures of orthopedic patiens dataset. Humans's bimechanicals has a a lot of feaure that you can see features below.
- Load to data
- Trying to understand data
- Feature engineering
- Analyze the correlation between features
- lumbar_lordosis_angle and pelvic_incidence
- degree_spondylolisthesis and pelvic_incidence
- pelvic_tilt_number and pelvic_incidence
- lumbar_lordosis_angle and sacral_slope
- Train and test splitting processes
- Decleration variabels for splitting processes
- Encoding labels(object to int64) with label encoding method
- Test and train splitting
- Normalization for numerical values
- Learning time!
- Logistic Regression
- Knn classification
- Linear SVM
- Decision Tree Classification
- Random Forest Classification
- Naive Bayes Classification
- Compare the learning algorithms
- Visualization part
- Confusion matrixes
The dataset has mall customers calculate the spending score with features where given in data.
- The data has no label and it is avalaible for clustering.
- We will cluster the datas with Kmeans algorithms(Unsupervised method).
- Trying to understand data
- Encoding feature that has a object type
- Feature engineering
- Analysis correlation values.
- Avarage to anual income by gender.
- Avarage to spending score by gender.
- Relation with age and spending score.
- General
- By gender
- Relation with age and Anual incomes
- General
- By gender
- Relation with Anual incomes and Spending score
- General
- By gender
- Dropping to 3 feature for clustering and visualization
- Clustering time!
- KMeans
- Specify k number with elbow method.
- Clustering with kmeans algorithms
- Visualization Clusters and centroids.
- Hierarcial Clustering(HC)
- Visualization Dendogram map and decide to cluster number
- Implement the HC algorithms
- Visualization Clusters.
The data set has tweet that relation with Covid - 19, also has a label for tweet type that
The content predict the positive, negative, neutral, extremely positive, extremely Negative tweets type with NLP.
Content
- Importing libraries
- Load to data
- Trying to understand data
- Visualization to percantage tweet types general
- Analyzing the tweets type location that more than has 100 tweets
- Detecthing location that has more than 100 tweets
- Create new data frame for tweets type that has more than 100 tweets
- Filling the data frame that we created
- Visualization to percantage tweets type for Covid 19 by location that has more than 100 tweets
- NLP(Natural Language Processing) Processes
- Pick the indices for testing preprocessing processes
- Remove the ırrevelant strings(: , :) , ! , //...) and convert the lower case
- Tokinenize to text
- Lemmization all words, and convert again text form
- Create a new data frame for tweet types and tweets
- Data cleaning
- Dropping NaN values
- The processes that we testing on one indices implement on data.
- Remove the stopwords and implement the countvectorizer
- Bag of words
- Encoding labels with LabelEncoder method
- The most use words, with data visiulazation
- Splitting to train and test data
- Learning Time!
- Logistic Regression
- Confusion matrix
The data set GoogleAppStore reviews for mobile application, and also label for positive, negative and neutral label.
The content predict the positive, negative and neutral tags of the reviews with NLP.
Content
- Importing libraries.
- Trying to understand data.
- Create a new data frame for replies and labels(Concat feature).
- Dropping NaN replies from data frame.
- Pick the indices for testing preprocessing processes
- Remove the redundant strings(these are ':' , ':)' , '//'...).
- Split to words.
- Lemmazation.
- The processes that we testing on one indices implement on data.
- Bag of words.
- Create the 'Sparce Matrix' for bag of words.
- The most use words, with data visiulazation
- Learning Time!
- Split the train and test data.
- Training with Random Forest.
- Training with Logistic Regression
- Confusion matrixes
- Random Forest
- Logistic Regression