-
Notifications
You must be signed in to change notification settings - Fork 0
/
MLT.txt
670 lines (533 loc) · 29.6 KB
/
MLT.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
1
what's machine learning?
the science of programming computers so they can learn from data
why use ml?
problems for which existing solutions require a lot of hand-tuning or long lists of rules: one ML algorithm can often simplify code and perform better
complex problems for which there is no good solution at all using a traditional approach
fluctuating environments(变幻的环境): A ML system can adapt to new data
getting insights about complex problems and large amounts of data(data mining,data science )
Types of ML systems
data: supervised, unsupervised, reinforcement learning(强化学习)
offline/batch vs. online learning
instance-based(lazy) vs. model-based(eager) learning
Supervised learning: given input X, predict output Y
training set: a set of examples with correct input-ouput pairs
labelled set: input data with its correct output
unsupervised learning : given input X, predict output Y
No training sets: there is no labelled data
Semi-supervised 半监督
offline learning : all inputs are available from the beginning
online learning : inputs come into the system as a stram
instance-based learning : no training phase
evaluate the data point at the time it is chosen
compare the new data point with existing ones in the system
typically no parameters to be set
model-based learning
maintain a data model
update the model based on new data
parameters to be maintained set
main challengs:
data issues:
insufficient data 数据不足
nonrepresentative data 非代表性数据
poor quality data
algorithm issues
feature selection
overfitting
underfitting
generalisation 泛化: good performace on both training data and never-seen-before data
overfitting : high accuracy on training data, but low quality in prediction
testing and validation
K-fold cross validation
K disjunct partition of data points
use (k-1) for training, the K-th one for testing
repeat this K times
2 Machine Learning Technologies: Classification
top down: inspiration from higher abstraction levels
bottom up: inspiration from biology
binary classifiers
performance measures 性能指标
goal: measure the goodness of the ML model
standard performance metrics
root mean square error: RMSE=sqrt((h(xi)-yi)/m) i=1->m
mean absolute error: MAE=|h(xi)-yi|/m i=1->m
the smaller the error the more accurate
used in regression models
for classification,use other function
accuracy : percentage of predictions where the classifier is correct
accuracy as a performance metric
imbalance data
accuracy is not a good performance metric for classifiers
confusion matrix 混淆矩阵
tp: true positive
tn: true negative
fp: false positive (error)
fn: false negative (error)
precision= tp/tp+fp : ratio of correct ones among yes predictions
recall= tp/tp+fn : ratio of yes that were successfully predicted
F-socre (F1 score)
F1 = tp/tp+(fn+fp)/2
= 2/(1/precision + 1/recall)
capture both precision and recall in a concise way
harmonic mean 调和平均数
f1 is high only if both is high
bot always good
precision/recall trade-off(how to set the parameters)
decision threshold
plot precision directly against recall
ROC curve (receiver opearting characteristic)
another common tool used with binary classifiers
plots the true positive precision(TPR,recall) against the false positive rate(FPR)
FPR: ratio of negative instances incorrectly classified as positive = 1-true negative rate(TNR)
TNR: ratio of negative instances correctly classified as negative.called speficity
ROC curve plots sensitivity (recall) versus 1-specificity
recall=识别真/识别真+失败假 =tp/tp+fn
TNR= tn/tn+fp =识别假/识别假+错误真
ROC: recall:(1-TNR)
compare different classifiers
measure the area under the curve(AUC 面积)
A perfect classifier will have a ROC AUC equal to 1
a purely random classifier will have a ROC AUC equal to 0.5
multiclass classification
more than 2 labels
some can handle by default(random forest)
other require non-trivial modifications(SVM)
One-versus-all(OvA) strategy: use multiple binary classifiers,one for each class->choose the one with highest decision score
One-versus-one(OvO) one binary classifier for each pair of classes(choose the one with highest decision score)
more computations, but easier to scale with large training sets
multilabel classification: assign multipe labels to data
03 Machine Learning Technologies Regression Models
Linear Regression
input X,output Y
Linear correlation between X and Y
dot product 向量积
MSE minimise loss function
Polynomial Regression
underfitting
loss on both training and validation data converges to a large value
increase model complexity
overfitting
there is a big gap in loss for training and validation data sets
increate training data size
Regularisation
capture model complexity in the cost function
Ridge regression : Linear regressino + regularisation term
Lasso(Least Absolute Shrinkage and Selection Operator Regression)
Elastic Net:
Solving regularised regression models
closed form solution for ridge regression
Lasso +Elastic Net: Gradient descent approach
Gradient Descent
main issue: need to use the whole training data set other calculate the error and the gradients
stochastic gradient descent(SGD): randomly choose one data point and calculate the error function and its gradient for that single data point
regression as classifier
logistic regression : computes the probability that the data belongs to a certain task
linear regression -> logistic function
training the logistic regression model
the log loss:
no closed form solution
SGD can find a local minimum efficiently
Softmax regression : multiclass classification
idea:maintain a score for each class
use these scores to calculate the probability of likelihood
04 Machine Learning Technologies Support Vector Machines
large empty stripe
most data are off the street
2 data points are on the edge of the street
large margin classification
fitting the widest possible street between term classes(wide margin)
adding more training data "off the street " does not affect the decision boundary at all
fully determined by data lying on the edge -> support vectors
support vector classifiers/ support vector machines
classification with wide margin
fully determined by support vectors(data on the edge of the margin )
initial step: feature scaling
hard margin classification : none of the data points can be within the margin
in general : linearly non-separable data
soft margin classificaion : data points can be within the margin( we allow violations )
need to minimise the number of violations
wide margin vs. low violation
hyperparameter C
small C : wider margin but higer violations
large C = smaller margin (low generalisation power )but low violations
in case of overfitting : reduce C
nonlinear SVM
polynomical features
it adds the combinations of possible expansions
the "magical" kernel trick
polynomial features : too low degree = underfitting , too high degree = very slow computation time
kernel trick "magic" it makes it possible to get the same result as if you added many polynomial features, even with very high-degree polynomials, without actually having to add them.
SVC(coed0=1)
useful for non-linear data
for linearly separable ones, using LinearSVC class is faster
d = degree
C = margin hyperparameter
r = how much the model is influenced by high-degree polynomials vs. low-degree ones
another trick: Radial Basis Functions (RBF)
a set of landmarks (some chosen data points )
Measure similarity of other data points to these landmarks
similarity function : Gaussian radial basis function
RBF kernel
similarity features : how many do we need ?
too few : cannot handle the complexity of the data
too high : computational time is very large
same kernel trick
SVC(gamma=5)
SVM regression
the trick is to reverse the objective
instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations(instances off the street )
the width of the street is controlled by a hyperparameter: epsilon
SVM regression
linear SVM regression
LinearSVR(epsilon=1.5)
nonlinear SVM regression
SVR(kernel="poly",degree=2,C=100,epsilon =0.1)
05 Machine Learning Technologies Decision Trees
A decision tree takes a series of input defining a situation , and outputs a binary decision/classification.
a decision tree spells out an order for checking the properties(features ) of the situation until we have enough information to decide what's going on.
we use the observable features to predict the outcomes (or some important hiddent or unknown quantity)
which feature to start with
choose the next feature whose value can reduce the uncertainty about the outcome of the classification the most
reducing uncertainty (in knowledge )=increase (known) information
choose the attribute that provides the highest information gain
choose the one that has the highest Gini score improvement
entropy
borrow similar concepts from information and coding theory
a measure of the amount of disorder or uncertainty in a system
a tidy room has low entropy: you can be reasonable certain your keys are on the hook you made for them
a messy room has high entropy : things are all over the place and your keys could be absolutely anywhere
conditional entropy
entropy measures the uncertainty of a given state of the system
how much uncertainty would remain about the outcome Y if we knew (for instane)the value of attribute X
information gain
the difference represents how much uncertainty would decrese
idea: outcome of the classification,X is the chosen attribute
information gain : change in the uncertainty if we chose x
choose X with the highest information gain
Gini score
choose the feature that provide the lowest children nodes
repeat until reaching leaf node
regularisation
decision tree is a nonparametric model
it doesn't have any restriction on its parameters
easy to become overfitted
by regularising the decision tree,we improve the generalisation power of the model
regression with decision trees
instability
decision trees like to use orthogonal decision boundaries
06 Machine Learning Technologies Ensemble Learning
combining multiple(weak) classifiers
single classifier: train one single data model on training data
idea: combine multiple classifiers
hard voting vs. soft voting
hard voting
each classifier produces a single prediction
take the majority vote
soft voting
if the classifiers can output class probabilities
average class probabilities classifiers, and take the class with highest average
common practices
Bagging and pasting
use combination of same type of classifiers(SVM)
randomly sample a subset of training data to train each classifier
bagging(bootstrap aggregating) :choose with repleacement(can sample same data point for the same classifier multiple times )
pasting: choose without replacement (cannot sample the same data point for same classifier multiple times )
with Bagging, decision boundaries are less orthogonal
random patches and random subspaces
BaggingClassifier class: can support feature sampling as well
set max_features and bootstrap_features
work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling
useful when we have high-dimensional data
random patches method: sampling both training instances and features
random subspaces method: keeping all training instances but sampling features
Boosting
ensemble learning = combine multiple classifiers at the same time
Boosting (=hypothesis Boosting ): combine them sequentially
Adaboost (adaptive boosting)
sequentially train predictors(classifiers ),each trying to correct its predecessor
first predictor is trained on the training data set
then tested on testing data-> identify misclassified data points
maintain weight w(i) for each data point i
if data i is misclassified -> boost w(i)
weighted data will be input data for next predictor
next predictor will focus more on the misclassification cases of previous predictor(due to higher weight values )
continue until desired number of predictors is reached
gradient Boosting
also sequential training
new predictor is trained on the residual errors of the previous predictor
stacking
use a separate ML model to learn how to aggregate
Blender(meta-learner)
(StackingClassifier可用)
07 Machine Learning Technologies Dimensionality Reduction
curse of dimensionality
dimension reduction with projection
in real-world problems
data are typically not spread out uniformly across all dimensions
many features can be constant (or have few values)
others are highly correlated
dimension reduction with manifold learning
manifold(informal definition )a bent and twisted version of a low dimensional shape in a (much )higher dimensional space
locally resembles a d-dimensional hyperplane(d is lower than current dimension)
manifold assumption(hypothesis)
most real-world high-dimensional datasets lie close to a much lower dimensional manifold
this assumption is very often empirically observed
idea: unroll the manifold to the lower dimensional shape -> easier classification/regression
principal component analysis(PCA)
projection method
far the most popular dimensionality reduction method
first it identifies the hyperplane that lies closet to the data ,and then it projects the data onto it
PCA identifies the axis that accounts for the largest amount of variance in the training set(first principle component )
PCA then identifies the next axis, orthogonal to the previous one, that accounts the largest amount of remaining variance(2nd principle component)
continues until it reaches the required number of dimensions (ith axis= ith principle component )
choose the right number of dimensions
generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance
for data visualisazion - in that case you will generally want to reduce the dimensionality down to 2 or 3
incremental PCA
sometimes the whole dataset doesn't fit into memory ->PCA cannot be used
split the data into mini-batches
kernel PCA
locally linear embedding (LLE)
manifold learning technique
nonlinear dimensionality reducion
high level description
it measures how each training instance linearly relates to its closest
it then looks for a low-dimensional representation of the training set where these local relationships are best preserved
08 Machine Learning Technologies (Deep) Artificial Neural Networks
classification
top down:
inspiration from higher abstraction levels
decision trees
nearest neighbour
bottom up:
(artificial )neural networks-inspiration from biology
inspiration from the brain
contains key properties of real neurons
synaptic weights
cumulative affect
threshold for activation "all or nothing"(neuron fires an output signal if the sum of inputs is above threshold)
the perceptron model 感知器模型
synaptic weights
cumulative affect
threshold for activation "all or nothing"
self-training the weights
weighted sum of the inputs
1 neuron = linear separator/regressor
types of activation functions
threshold function
piecewise-linear function
sigmoid function S型函数
multi-layered neural networks
achieve non-linear separator with the perceptron model
perceptrons feeding into other perceptrons
our black box is quiet complicated now, can approximate arbitrary functions given enough hidden neurons
training multi-layered neural networks
backpropagation of errors
high level description :
we build an loss function(L=RMSE)
we employ a bot of calculus to calculate the partial derivative of L with respect to each weight(we use chain rule to do so)
use a differentiable activation function
we can thus know which way we need to "nudge" each weight for a given training example
some further issues of neural networks
deep learning
main idea: transform the input space into higher level abstractions with lower dimensions(similar to the feature expansion trick)
multi-layer architecture(typically with many hidden layers)
each layer is responsible for a space transformation step
by doing so, the complexity of non-linearity is decreased
this is very expensive. Need to rely on new computational solutions: GPUs, grid computing
09 Machine Learning Technologies Deep Neural Nets
issues when using DNNs
vanishing gradients(or the other extreme case : exploding gradients)
slow training with gradient descent techniques
overfitting issues in large networks
the vanishing gradients problem
training = backpropagation
we first calculate the weight gradient needed at the last hidden layer for the required change in the error function
we then backpropagate to the previous layers(also calculate the weight gradient of that layer)
observation : gradients often get smaller and smaller as we progress back to the previous layers
consequene: weights in the first layers(those closer to the input layer )never get significant changes
gradient descent gets stuck in bad local minima
reason
usage of the sigmoid activation function
random initialisation with truncated Gaussian
saturated(sigmoid) function
value will be close to 0 or 1
gradient will be very close to 0
affect the layers below its level
mitigation strategy 1 = different random initialisation
Xavier initialisation
mitigation strategy 2 = different activation functions
nonsaturating activation functions:
ReLU(rectified linear unit)
easy to compute
no limit on the maximum value
issue: dying ReLU when the output becomes 0->will stay 0 forever
observation : in some case up to half of the neurons suffer from this
solution for the dying ReLU problem :
leaky ReLU
ELU(exponential linear unit)
it can have negative values + no max value ->reduces the saturation problem
nonzero gradient for z<0 -> avoids the dying unit problem
smooth function(differentiable)-> good for gradient descent
mitigation startegy 3: batch normalisation
additional operation just before the activation function
normalises the inputs(centering around 0)
uses 2 new parameters per layer to scale and shift the result of the activation functions
the model has to learn these new parameters as well
benefit:
significantly reduces the vanishing/exploding gradients problems
achieves improvement even without any other mitigation solutions
drawback
computationally very slow
the exploding gradients problem
mitigation strategy 1: bath normalisation
mitigation strategy 2: gradient clipping
clip the gradient if they become too large
frequently used in RNNs
mitigation strategies against slow training
transfer learning
reuse network already trained on similar problems
faster optimisers
so far: stochastic gradient descent(SGD)
more efficient techniques
momentum optimization
Nesterov accelerated gradient
AdaGrad
RMSProp
Adam(adaptive moment estimation)
learning rate in optimisation algorithms
start with large then gradually reduce it
avoiding overfitting in DNNs
too many parameters (weights,neurons,layers)
overfitting will occur quite often
use regularisation
early stopping
train on a mini-batch,then run on validation set, then train again(on another bathc)
but stop training as soon as error rate on validation set starts increasing(performance starts dropping )
dropout:
at every training step, every neuron (excluding the output layer) has a probability p of being entirely ignored within that training step
data augmentation
artificially generate new data points from existing ones
add some noise to existing data but keep the same label
10 Machine Learning Technologies Convolutional Neural Nets
Neural nets in computer vision
one of the core domains of CS and AI
identify(classify) visual objects
NNs are the state-of-the-art solution techniques
natural representation of human vision
intuitive model for convolution networks
ideally
DNN with fully connected layers(each node is connected to every single node in the next layer)
can learn any data model
issues:
too many parameters
idea:
convolution layer
motivated by nature
neurons in the first convolution layer are not connected to every single pixel in the input image, but only to pixels in their receptive fields
next convolution layer : each neuron is connected only to neurons within a small rectangle of first layer
reducing size of convolutional layer
stride : size of jump between neighbouring rectangles
using large strides will significantly reduce the layer's size, but the cost is information loss
filters
weight of edges that connect the rectangle with the neuron form a matrix
matrix = filter
feature map
the whole layer applies the same filter
feature maps and filters in CNN
during training , the CNN learns the most useful filters
combines them into more complex patterns
stacking feature maps
packing 2D feature layers on top of each other -> 3D stacks
TensorFlow implementation
X is the input mini-batch (a 4D tensor)
filters is the set of filters to apply (a 4D tensor)
strides is four-element 1D array: two central elements are the vertical and horizontal strides. The first and last elements must currently be equal to 1(they may one day be used to specify a batch stride - to skip some instance, and a channel stride - to skip some of the previous layer's feature maps or channels)
padding must be either VALID or
SAME
memory requirements of CNNs
convolutional layers require a huge amount of RAM, especially during training
because the reverse pass of backpropagation requires all the intermediate values computed during the forward pass
example. 1 convolution layer with 5*5 filters,200 features maps(150*100),stride=1,padding=same
total parameters :15200
225 million folat multiplications
(with 32bit floats)this=11.4MB RAM for 1 instance
100 instance = 1GBRAM
if training crashes because of and out-of-memory error:
reduce the mini-batch size
reduce dimensionality using a stride, or remove a few layers
use 16bit floats instead of 32bit floats
distribute the CNN across multiple devices
pooling layers
goal: to reduce input image size(to reduce computational load,memory)
additional consequence : location invariance - the CNN can tolerate a small image shift
pooling layer: similar to convolution layer- each neuron is connected to a rectangle of input layer
rectangle size, strie,padding are the same
there are no weights
aggregation : max or mean
max pooling : aggregation =max(other inputs are dropped )
average pooling : aggregation =mean
famous CNN: LeNet-5
the most widely known CNN architecture
AlexNet
much larger and deeper
it was the first to stack convolution layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer
GoogLeNet
much deeper than previous CNNs
contains sub-networks called inception modules: use parameters much more efficiently
actually has 10 times fewer parameters that AlexNet(roughly 6 million instead of 60million)
ResNet
extremely deep CNN composed of 152 layers
idea: skip connections(also called shortcut connections ): the signal feeding into a layer is also added to the output of a layer located a bit higher up the stack
11 Machine Learning Technologies Recurrent Neural Nets
classification + regression : given X, predict Y
time series :
sequence of data in time
next data depends on previous set of data(historical data )
eg. stock prices, weather
recurrent neural nets
feedforward neural nets
each connection is a forward connection
no loops
recurrent neural net
backward connections(and therefore loops)are allowed
simplest version : neuron +self-loop
output of previous time step is also input of current step
unrolling the network through time
more complex network: input x and output y(both are vectors)
each neuron has 2 types of weights
one for the input x(t) and one for the output of prev.step y(t-1)
realisation: a part of the network can preserve some state across time step
this is called a memory cell
basic cells: a single recurrent neuron,or recurrent layer
memory cells
state of memory cell : h(t)=f(h(t-1),x(t))
RNNs can do
in terms of input-output pairs
sequence-to-sequence:input sequence, output sequence
stock price prediction
sequence-to-vectore: input sequence, output single vector/data
NLP: from a single sentence ,guess the moode of the speaker
vector to sequence : input single vector/data, output sequence
from a single image, generate a description in words
encoder-decoder model: encoder: sequence-to-vector ,decoder: vectore-to sequence
machine translation from one language to another
training RNNs
idea: unroll the network through time, then use regular back propagation
BPTT(back propagation through time)
issue: the rolled out network is typically very large
mitigation strategies
ReLU
Dropout
gradient clipping ,faster optimizers
truncated backpropagation: set a time horizon for the rolling out
long short-term memory (LSTM)
2 states: h(t) for short-term,c(t) for long-term memory
an LSTM cell can learn to recognize an important input,store it in the long-term state, learn to preserve it for as long as it is needed, and learn to extract it whenever it is needed
this explains why they have been amazingly successful at capturing long-term patterns in time series, long texts, audio recordings, and more
input gate : recognise important inputs
forget gate: keep in memory until needed(then erase from there)
output gate: generates the output
gate recurrent unit(GRU)
simpler version of LSTM
performace-wise it's still quite good(seems to be as good as LSTM)