MLT.txt

1
    what's machine learning?
        the science of programming computers so they can learn from data
    why use ml?
        problems for which existing solutions require a lot of hand-tuning or long lists of rules: one ML algorithm can often simplify code and perform better 
        complex problems for which there is no good solution at all using a traditional approach 
        fluctuating environments(变幻的环境): A ML system can adapt to new data 
        getting insights about complex problems and large amounts of data(data mining,data science )

    Types of ML systems 
        data: supervised, unsupervised, reinforcement learning(强化学习)
        offline/batch  vs. online learning 
        instance-based(lazy)  vs. model-based(eager) learning 

    Supervised learning: given input X, predict output Y
        training set: a set of examples with correct input-ouput pairs
        labelled set: input data with its correct output 
    unsupervised learning : given input X, predict output Y
        No training sets: there is no labelled data 

    Semi-supervised 半监督

    offline learning : all inputs are available from the beginning 
    online learning : inputs come into the system as a stram

    instance-based learning : no training phase 
        evaluate the data point at the time it is chosen 
        compare the new data point with existing ones in the system 
        typically no parameters to be set 
    model-based learning 
        maintain a data model 
        update the model based on new data 
        parameters to be maintained set 

    main challengs:
        data issues:
            insufficient data 数据不足
            nonrepresentative data 非代表性数据
            poor quality data 
        algorithm issues 
            feature selection 
            overfitting 
            underfitting

    generalisation 泛化: good performace on both training data and never-seen-before data 
    overfitting : high accuracy on training data, but low quality in prediction 

    testing and validation 
        K-fold cross validation 
            K disjunct partition of data points 
            use (k-1) for training, the K-th one for testing 
            repeat this K times 

2 Machine Learning Technologies: Classification

    top down: inspiration from higher abstraction levels 
    bottom up: inspiration from biology

    binary classifiers 
    performance measures 性能指标
        goal: measure the goodness of the ML model 
        standard performance metrics
            root mean square error: RMSE=sqrt((h(xi)-yi)/m) i=1->m
            mean absolute error: MAE=|h(xi)-yi|/m i=1->m
            the smaller the error the more accurate 
            used in regression models
            for classification,use other function 
                accuracy : percentage of predictions where the classifier is correct 

    accuracy as a performance metric 
        imbalance data 
        accuracy is not a good performance metric for classifiers 
    confusion matrix 混淆矩阵
        tp: true positive 
        tn: true negative 
        fp: false positive (error)
        fn: false negative (error)
        precision= tp/tp+fp : ratio of correct ones among yes predictions 
        recall= tp/tp+fn : ratio of yes that were successfully predicted 

    F-socre (F1 score)
        F1 = tp/tp+(fn+fp)/2
        = 2/(1/precision + 1/recall)
        capture both precision and recall in a concise way 
        harmonic mean 调和平均数
        f1 is high only if both is high
        bot always good 

    precision/recall trade-off(how to set the parameters)
        decision threshold
        plot precision directly against recall

    ROC curve (receiver opearting characteristic)
        another common tool used with binary classifiers 
        plots the true positive precision(TPR,recall) against the false positive rate(FPR)
        FPR: ratio of negative instances incorrectly classified as positive = 1-true negative rate(TNR)
        TNR: ratio of negative instances correctly classified as negative.called speficity
        ROC curve plots sensitivity (recall) versus 1-specificity
        recall=识别真/识别真+失败假 =tp/tp+fn 
        TNR= tn/tn+fp =识别假/识别假+错误真
        ROC: recall:(1-TNR)
    compare different classifiers
        measure the area under the curve(AUC 面积)
        A perfect classifier will have a ROC AUC equal to 1
        a purely random classifier will have a ROC AUC equal to 0.5

    multiclass classification 
        more than 2 labels
        some can handle by default(random forest)
        other require non-trivial modifications(SVM)
        One-versus-all(OvA) strategy: use multiple binary classifiers,one for each class->choose the one with highest decision score
        One-versus-one(OvO) one binary classifier for each pair of classes(choose the one with highest decision score)
            more computations, but easier to scale with large training sets

        multilabel classification: assign multipe labels to data 

03 Machine Learning Technologies Regression Models

    Linear Regression 
        input X,output Y
        Linear correlation between X and Y
        dot product 向量积
        MSE minimise loss function

    Polynomial Regression
        underfitting
            loss on both training and validation data converges to a large value
            increase model complexity 
        overfitting
            there is a big gap in loss for training and validation data sets
            increate training data size

    Regularisation
        capture model complexity in the cost function
        Ridge regression : Linear regressino + regularisation term
        Lasso(Least Absolute Shrinkage and Selection Operator Regression)
        Elastic Net:
    Solving regularised regression models
        closed form solution for ridge regression 
        Lasso +Elastic Net: Gradient descent approach 

    Gradient Descent 
        main issue: need to use the whole training data set other calculate the error and the gradients
        stochastic gradient descent(SGD): randomly choose one data point and calculate the error function and its gradient for that single data point

    regression as classifier 
        logistic regression : computes the probability that the data belongs to a certain task 
        linear regression -> logistic function 
        training the logistic regression model 
        the log loss:
            no closed form solution 
            SGD can find a local minimum efficiently 
        Softmax regression : multiclass classification 
            idea:maintain a score for each class
            use these scores to calculate the probability of likelihood 


04 Machine Learning Technologies Support Vector Machines

    large empty stripe
        most data are off the street 
        2 data points are on the edge of the street 

    large margin classification 
        fitting the widest possible street between term classes(wide margin)
            adding more training data "off the street " does not affect the decision boundary at all
            fully determined by data lying on the edge -> support vectors

    support vector classifiers/ support vector machines 
        classification with wide margin 
        fully determined by support vectors(data on the edge of the margin )

        initial step: feature scaling 
        hard margin classification : none of the data points can be within the margin 
        in general : linearly non-separable data

        soft margin classificaion : data points can be within the margin( we allow violations )
            need to minimise the number of violations
        
        wide margin vs. low violation 
            hyperparameter C
            small C : wider margin but higer violations
            large C = smaller margin (low generalisation power )but low violations

            in case of overfitting : reduce C 

    nonlinear SVM
        polynomical features 
        
        it adds the combinations of possible expansions 

    the "magical" kernel trick 
        polynomial features : too low degree = underfitting , too high degree = very slow computation time 

        kernel trick "magic" it makes it possible to get the same result as if you added many polynomial features, even with very high-degree polynomials, without actually having to add them.
            SVC(coed0=1)
        useful for non-linear data 
        for linearly separable ones, using LinearSVC class is faster
            d = degree
            C = margin hyperparameter 
            r = how much the model is influenced by high-degree polynomials vs. low-degree ones 

    another trick: Radial Basis Functions (RBF)
        a set of landmarks (some chosen data points )
        Measure similarity of other data points to these landmarks 
        similarity function : Gaussian radial basis function 

        RBF kernel 
            similarity features : how many do we need ?
                too few : cannot handle the complexity of the data 
                too high : computational time is very large 
            same kernel trick 
                SVC(gamma=5)

    SVM regression 
        the trick is to reverse the objective 
            instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations(instances off the street )
            the width of the street is controlled by a hyperparameter: epsilon 

    SVM regression 
        linear SVM regression 
            LinearSVR(epsilon=1.5)
        nonlinear SVM regression 
            SVR(kernel="poly",degree=2,C=100,epsilon =0.1)

05 Machine Learning Technologies Decision Trees

    A decision tree takes a series of input defining a situation , and outputs a binary decision/classification.
    a decision tree spells out an order for checking the properties(features ) of the situation until we have enough information to decide what's going on.
    we use the observable features to predict the outcomes (or some important hiddent or unknown quantity)

    which feature to start with 
        choose the next feature whose value can reduce the uncertainty about the outcome of the classification the most

        reducing uncertainty (in knowledge )=increase (known) information 
            choose the attribute that provides the highest information gain
            choose the one that has the highest Gini score improvement 

    entropy
        borrow similar concepts from information and coding theory 

        a measure of the amount of disorder or uncertainty in a system 
        a tidy room has low entropy: you can be reasonable certain your keys are on the hook you made for them 
        a messy room has high entropy : things are all over the place and your keys could be absolutely anywhere

    conditional entropy 
        entropy measures the uncertainty of a given state of the system 
        how much uncertainty would remain about the outcome Y if we knew (for instane)the value of attribute X

    information gain 
        the difference represents how much uncertainty would decrese 
        idea: outcome of the classification,X is the chosen attribute 
        information gain : change in the uncertainty if we chose x
        choose X with the highest information gain 

    Gini score 
        choose the feature that provide the lowest children nodes
        repeat until reaching leaf node

    regularisation 
        decision tree is a nonparametric model 
            it doesn't have any     restriction on its parameters 
        easy to become overfitted

        by regularising the decision tree,we improve the generalisation power of the model 

    regression with decision trees 

    instability 
        decision trees like to use orthogonal decision boundaries

06 Machine Learning Technologies Ensemble Learning

    combining multiple(weak) classifiers 
        single classifier: train one single data model on training data 
            idea: combine multiple classifiers 

    hard voting vs. soft voting 
        hard voting 
            each classifier produces a single prediction 
            take the majority vote 
        soft voting 
            if the classifiers can output class probabilities
            average class probabilities classifiers, and take the class with highest average 

    common practices 

    Bagging and pasting 
        use combination of same type of classifiers(SVM)
        randomly sample a subset of training data to train each classifier 

        bagging(bootstrap aggregating) :choose with repleacement(can sample same data point for the same classifier multiple times )
        pasting: choose without replacement (cannot sample the same data point for same classifier multiple times )

        with Bagging, decision boundaries are less orthogonal 

    random patches and random subspaces
        BaggingClassifier class: can support feature sampling as well
            set max_features and bootstrap_features
            work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling 
            useful when we have high-dimensional data
        random patches method: sampling both training instances and features
        random subspaces method: keeping all training instances but sampling features

    Boosting 
        ensemble learning = combine multiple classifiers at the same time 
        Boosting (=hypothesis Boosting ): combine them sequentially


    Adaboost (adaptive boosting)
        sequentially train predictors(classifiers ),each trying to correct its predecessor
            first predictor is trained on the training data set 
            then tested on testing data-> identify misclassified data points 
            maintain weight w(i) for each data point i
            if data i is misclassified -> boost w(i)
            weighted data will be input data for next predictor 
            next predictor will focus more on the misclassification cases of previous predictor(due to higher weight values )
            continue until desired number of predictors is reached
    gradient Boosting 
        also sequential training
            new predictor is trained on the residual errors of the previous predictor 

    stacking 
        use a separate ML model to learn how to aggregate
            Blender(meta-learner)
            (StackingClassifier可用)

07 Machine Learning Technologies Dimensionality Reduction

    curse of dimensionality

    dimension reduction with projection 
        in real-world problems 
            data are typically not spread out uniformly across all dimensions
            many features can be constant (or have few values)
            others are highly correlated

    dimension reduction with manifold learning 
        manifold(informal definition )a bent and twisted version of a low dimensional shape in a (much )higher dimensional space
            locally resembles a d-dimensional hyperplane(d is lower than current dimension)
        
        manifold assumption(hypothesis)
            most real-world high-dimensional datasets lie close to a much lower dimensional manifold 
            this assumption is very often empirically observed

        idea: unroll the manifold to the lower dimensional shape -> easier classification/regression 

    principal component analysis(PCA)
        projection method 
            far the most popular dimensionality reduction method
            first it identifies the hyperplane that lies closet to the data ,and then it projects the data onto it 
        
        PCA identifies the axis that accounts for the largest amount of variance in the training set(first principle component )
        PCA then identifies the next axis, orthogonal to the previous one, that accounts the largest amount of remaining variance(2nd principle  component)
        continues until it reaches the required number of dimensions (ith axis= ith principle component )

        choose the right number of dimensions
            generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance
            for data visualisazion - in that case you will generally want to reduce the dimensionality down to 2 or 3

    incremental PCA
        sometimes the whole dataset doesn't fit into memory ->PCA cannot be used
        split the data into mini-batches

    kernel PCA

    locally linear embedding (LLE)
        manifold learning technique
        nonlinear dimensionality reducion 
        high level description 
            it measures how each training instance linearly relates to its closest
            it then looks for a low-dimensional representation of the training set where these local relationships are best preserved
            
08 Machine Learning Technologies (Deep) Artificial Neural Networks

    classification
        top down:
            inspiration from higher abstraction levels
                decision trees 
                nearest neighbour 
        bottom up:
            (artificial )neural networks-inspiration from biology

    inspiration from the brain 
        contains key properties of real neurons
            synaptic weights
            cumulative affect
            threshold for activation "all or nothing"(neuron fires an output signal if the sum of inputs is above threshold)

    the perceptron model 感知器模型
        synaptic weights
        cumulative affect
        threshold for activation "all or nothing"
        self-training the weights

    weighted sum of the inputs

    1 neuron = linear separator/regressor 

    types of activation functions
        threshold function 
        piecewise-linear function 
        sigmoid function S型函数


    multi-layered neural networks
        achieve non-linear separator with the perceptron model

        perceptrons feeding into other perceptrons
        our black box is quiet complicated now, can approximate arbitrary functions given enough hidden neurons

    training multi-layered neural networks
        backpropagation of errors
            high level description :
            we build an loss function(L=RMSE)
            we employ a bot of calculus to calculate the partial derivative of L with respect to each weight(we use chain rule to do so)
            use a differentiable activation function 
            we can thus know which way we need to "nudge" each weight for a given training example 

    some further issues of neural networks 

    deep learning 
        main idea: transform the input space into higher level abstractions with lower dimensions(similar to the feature expansion trick)

        multi-layer architecture(typically with many hidden layers)
        each layer is responsible for a space transformation step
        by doing so, the complexity of non-linearity is decreased
        this is very expensive. Need to rely on new computational solutions: GPUs, grid computing

09 Machine Learning Technologies Deep Neural Nets

    issues when using DNNs
        vanishing gradients(or the other extreme case : exploding gradients)
        slow training with gradient descent techniques 
        overfitting issues in large networks

    the vanishing gradients problem
        training = backpropagation
            we first calculate the weight gradient needed at the last hidden layer for the required change in the error function 
            we then backpropagate to the previous layers(also calculate the weight gradient of that layer)
        observation : gradients often get smaller and smaller as we progress back to the previous layers 
        consequene: weights in the first layers(those closer to the input layer )never get significant changes 
            gradient descent gets stuck in bad local minima

        reason
            usage of the sigmoid activation function 
            random initialisation with truncated Gaussian
        
        saturated(sigmoid) function 
            value will be close to 0 or 1
            gradient will be very close to 0
            affect the layers below its level

            mitigation strategy 1 = different random initialisation 
            Xavier initialisation 

            mitigation strategy 2 = different activation functions 
            nonsaturating activation functions:
                ReLU(rectified linear unit)
                    easy to compute
                    no limit on the maximum value
                    issue: dying ReLU when the output becomes 0->will stay 0 forever
                    observation : in some case up to half of the neurons suffer from this 
        
        solution for the dying ReLU problem :
            leaky ReLU 
            ELU(exponential linear unit)
                it can have negative values + no max value ->reduces the saturation problem 
                nonzero gradient for z<0 -> avoids the dying unit problem 
                smooth function(differentiable)-> good for gradient descent 
            
            mitigation startegy 3: batch normalisation 
                additional operation just before the activation function 
                normalises the inputs(centering around 0)
                uses 2 new parameters per layer to scale and shift the result of the activation functions 
                the model has to learn these new parameters as well
            benefit:
                significantly reduces the vanishing/exploding gradients problems 
                achieves improvement even without any other mitigation solutions 
            drawback 
                computationally very slow

    the exploding gradients problem
        mitigation strategy 1: bath normalisation 

        mitigation strategy 2: gradient clipping 
            clip the gradient if they become too large
            frequently used in RNNs

    mitigation strategies against slow training 
        transfer learning 
            reuse network already trained on similar problems 
        faster optimisers
            so far: stochastic gradient descent(SGD)
        more efficient techniques 
            momentum optimization 
            Nesterov accelerated gradient
            AdaGrad
            RMSProp
            Adam(adaptive moment estimation)
        
        learning rate in optimisation algorithms
            start with large then gradually reduce it 
        
        avoiding overfitting in DNNs
            too many parameters (weights,neurons,layers)
                overfitting will occur quite often
            use regularisation 
            early stopping 
                train on a mini-batch,then run on validation set, then train again(on another bathc)
                but stop training as soon as error rate on validation set starts increasing(performance starts dropping )
            dropout:
                at every training step, every neuron (excluding the output layer) has a probability p of being entirely ignored within that training step
            data augmentation 
                artificially generate new data points from existing ones 
                add some noise to existing data but keep the same label

10 Machine Learning Technologies Convolutional Neural Nets

    Neural nets in computer vision 
        one of the core domains of CS and AI
            identify(classify) visual objects
        NNs are the state-of-the-art solution techniques 
            natural representation of human vision 
            intuitive model for convolution networks

    ideally
        DNN with fully connected layers(each node is connected to every single node in the next layer)
            can learn any data model
        issues:
            too many parameters 

    idea:
        convolution layer
            motivated by nature
            neurons in the first convolution layer are not connected to every single pixel in the input image, but only to pixels in their receptive fields
            next convolution layer : each neuron is connected only to neurons within a small rectangle of first layer

    reducing size of convolutional layer
        stride : size of jump between neighbouring rectangles
        using large strides will significantly reduce the layer's size, but the cost is information loss

    filters 
        weight of edges that connect the rectangle with the neuron form a matrix 
        matrix = filter

    feature map
        the whole layer applies the same filter

    feature maps and filters in CNN
        during training , the CNN learns the most useful filters
        combines them into more complex patterns

    stacking feature maps
        packing 2D feature layers on top of each other -> 3D stacks

    TensorFlow implementation
        X is the input mini-batch (a 4D tensor)
        filters is the set of filters to apply (a 4D tensor)
        strides is four-element 1D array: two central elements are the vertical and horizontal strides. The first and last elements must currently be equal to 1(they may one day be used to specify a batch stride - to skip some instance, and a channel stride - to skip some of the previous layer's feature maps or channels)
        padding must be either VALID or
        SAME

    memory requirements of CNNs
        convolutional layers require a huge amount of RAM, especially during training 
            because the reverse pass of backpropagation requires all the intermediate values computed during the forward pass

            example. 1 convolution layer with 5*5 filters,200 features maps(150*100),stride=1,padding=same
                total parameters :15200
                225 million folat multiplications
                (with 32bit floats)this=11.4MB RAM for 1 instance
                100 instance = 1GBRAM
        
        if training crashes because of and out-of-memory error:
            reduce the mini-batch size
            reduce dimensionality using a stride, or remove a few layers
            use 16bit floats instead of 32bit floats
            distribute the CNN across multiple devices

    pooling layers
        goal: to reduce input image size(to reduce computational load,memory)
            additional consequence : location invariance - the CNN can tolerate a small image shift 
        pooling layer: similar to convolution layer- each neuron is connected to a rectangle of input layer
            rectangle size, strie,padding are the same 
            there are no weights
            aggregation : max or mean
            max pooling : aggregation =max(other inputs are dropped )
            average pooling : aggregation =mean

    famous CNN: LeNet-5
        the most widely known CNN architecture 
    AlexNet
        much larger and deeper
        it was the first to stack convolution layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer
    GoogLeNet
        much deeper than previous CNNs
        contains sub-networks called inception modules: use parameters much more efficiently 
        actually has 10 times fewer parameters that AlexNet(roughly 6 million instead of 60million)
    ResNet
        extremely deep CNN composed of 152 layers
        idea: skip connections(also called shortcut connections ): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack 

11 Machine Learning Technologies Recurrent Neural Nets

classification + regression : given X, predict Y

time series :
    sequence of data in time 
    next data depends on previous set of data(historical data )
    eg. stock prices, weather 

recurrent neural nets

    feedforward neural nets
        each connection is a forward connection 
        no loops
    recurrent neural net
        backward connections(and therefore loops)are allowed
    simplest version : neuron +self-loop
        output of previous time step is also input of current step
        unrolling the network through time

    more complex network: input x and output y(both are vectors)
    each neuron has 2 types of weights
        one for the input x(t) and one for the output of prev.step y(t-1)
    
    realisation: a part of the network can preserve some state across time step
        this is called a memory cell
        basic cells: a single recurrent neuron,or recurrent layer
memory cells
    state of memory cell : h(t)=f(h(t-1),x(t))

RNNs can do
    in terms of input-output pairs
    sequence-to-sequence:input sequence, output sequence 
        stock price prediction 
    sequence-to-vectore: input sequence, output single vector/data
        NLP: from a single sentence ,guess the moode of the speaker
    vector to sequence : input single vector/data, output sequence 
        from a single image, generate a description in words
    encoder-decoder model: encoder: sequence-to-vector ,decoder: vectore-to sequence 
        machine translation from one language to another

training RNNs
    idea: unroll the network through time, then use regular back propagation 
        BPTT(back propagation through time)
    
    issue: the rolled out network is typically very large

    mitigation strategies
        ReLU
        Dropout
        gradient clipping ,faster optimizers
        truncated backpropagation: set a time horizon for the rolling out

long short-term memory (LSTM)
    2 states: h(t) for short-term,c(t) for long-term memory 

    an LSTM cell can learn to recognize an important input,store it in the long-term state, learn to preserve it for as long as it is needed, and learn to extract it whenever it is needed
    this explains why they have been amazingly successful at capturing long-term patterns in time series, long texts, audio recordings, and more

    input gate : recognise important inputs
    forget gate: keep in memory until needed(then erase from there)
    output gate: generates the output 

gate recurrent unit(GRU)
    simpler version of LSTM
    performace-wise it's still quite good(seems to be as good as LSTM)