They proposed a way to alliviate the unstable gradients problem. We need the signal to flow properly in both direction
(forward and backward). We dont' want the signal to die out, nor do we want it to explode and saturate. For the signal
to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs.(Imagine
a chain of amplifiers, your voice has to come out of each amplifier at the same amplitude as it came in) and we need the
gradients to have equal variance before and after flowing through a layer in the reverse direction. It is not possible
unless fan-in = fan-out, but Glorot and Bengio proposed a good compromise that has proven to work very well in practice:
the connection weights of each layer must be initialized randomly as described below, where:
Normal distribution with mean 0 and variance:
Using SVD, we decomposition matrix
Initialization | Activation functions |
|
---|---|---|
Glorot | None, tanh, sigmoid, softmax | |
He | ReLU, Leaky ReLU, ELU, GELU, Swish, Mish | |
LeCun | SELU |
Once a dataset has been clustered, it is usually possible to measure each instance's affinity with each cluster. Each
instances's feature vector x can then be replaces with the vector of its cluster affinities. If there are
The cluster affinites can often be useful as extra features.
We use the "sparse_categorical_crossentropy" loss because we have sparse labels (i.e., for each instance, there is just a target class index, from 0 to 9), and classes are exclusive. If instead we had one target probability per class for each instance (such as one-hot vectors, e.g., [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] to represent class 3), then we would need to use the "categorical_crossentropy" loss instead.
If we were doing binary classification or multilable binary classification, then we would use the "sigmoid" activation function in the output layer instead of "softmax" activation function, and we would use the "binary_crossentropy" loss.
sh docker pull tensorflow/serving
sh docker run -it --rm -v "/home/raha/Desktop/DNN/my_mnist_model:/models/my_mnist_model" -p 8500:8500 -p 8501:8501 -e MODEL_NAME=my_mnist_model tensorflow/serving
- -it: Makes the container interactive and displays the server's output
- --rm
- -v: Makes the host's my_mnist_model directory available to the container at the path /models/my_mnist_model
- -p: The Docker image is configured to use port 8500 to serve the gRPC API and 8501 to serve the REST API by default.
- -e: Sets the container's MODEL_NAME environment variable, so TF Serving knows which model to serve. By default, it
- will look for models in the /models directory and it will automatically serve the latest version it finds.
If you set it slightly too high, it will make progress very quickly at first, but it will end up dancing around the optimum and never really settling down
A function of the iteration number,
Use a constant learning rate for a number of epochs then a smaller learning rate for another number of epochs
Measure the validation error every N steps and reduce the learning rate by a factor of
- An MLP may not have any activation function for the output layer, so it's free to output any value, this is generally fine.
- If you want to guarantee that the output will always be positive, then you should use the ReLU activation function in the output layer, or the softplus activation function, which is a smooth variant of ReLU.
- If you want to guarantee that the predictions will always fall within a given range of values, then you should use the sigmoid function or the hyperbolic tagent and scale the targets to the appropriate range
- You usually want mean squared error for regression
- If you have a lot of outliers in the training set, you may prefer mean absolute error
- You may want to use Huber loss, which is a combination of both
It is quadratic when the error is smaller than a threshold
Hyperparameter | Typical value |
---|---|
# hidden layers | Typically 1 to 5 |
# neurons per hidden layer | Typically 10 to 100 |
# output neurons | 1 per prediction dimension |
Hidden activation | ReLU |
Output activation | None, or ReLU/softplus (if positive outputs) or sigmoid/tanh (if bounded outputs) |
Loss function. | MSE, or Huber if outliers |
You can combine both ModelCheckpoint and EarlyStopping callbacks to save checkpoints of your model in case your computer crashes, and interrupt training early when there is no more progress. The number of epochs can be set to a large value then, just make sure the learning rate is not too small, or else it might keep making slow progress until the end.
Number of Hidden Layers
An MLP with just one hidden layer can theoretically model even the most complex functions, provided it has enough neurons. But for complex problems, deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, allowing them to reach much better performance with the same amount of training data.
To understand why, suppose you are asked to draw a forest using some drawing software, but you are forbidden to copy and paste anything. It would take an enormous amount of time: you would have to draw each tree individually, branch by branch, leaf by leaf. If you could instead draw one leaf, copy and paste it to draw a branch, then copy and paste that branch to create a tree, and finally copy and paste this tree to make a forest, you would be finished in no time.
Real-world data is often structured in such a hierarchical way, and deep neural networks automatically take advantage of this fact: lower hidden layers model low-level structures (e.g., line segments of various shapes and orientations), intermediate hidden layers combine these low-level structures to model intermediate-level structures (e.g., squares, circles), and the highest hidden layers and the output layer combine these intermediate structures to model high-level structures (e.g., faces)
Number of Neurons per Hidden Layer
As for the hidden layers, it used to be common to size them to form a pyramid with fewer and fewer neurons at each layer --the rationale being that many low-level features can coalesce into fat fewer high-level features. However, this practice has been largely abandoned because it seems that using the same number of neurons in all hidden layers perform just as well in most cases, or even better; plus, there is only one hyperparameter to tune, instead of one per layer. That said, depending on the dataset, it can sometimes help to make the first hidden layer bigger than the others.
Just like the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting. Alternatively, you can try building a model with slightly more layers and neurons than you actually need, then use early stopping and other regularization techniques to prevent it from overfitting too much. A scientist at Google, has dubbed this the "stretch pants" approach: instead of wasting time looking for pants that perfectly match your size, just use large stretch pants that will shrink down to the right size. With this apprach you avoid bottleneck layers that could ruin your model, Indeed, if a layer has too few neurons it will not have enough representational power to preserve all the useful information from the inputs. No matter how big and powerful the rest of the network is, that information will never be recovered. In general you will get more bang for your buck by increasing the number of layers instead of the number of neurons per layer.
The main benefit of using large batch sizes is that hardware accelerators like GPUs can process them efficiently but large batch sizes often lead to training instabilities, especially at the beginning of training, and the resulting model may not generalize as well as a model trained with a small batch size.
This hierarchical architecture help DNNs:
- converge faster
- improves their ability to generalize to new datasets
It starts by training many different models for few epochs, then it eliminates the worst models and keeps only the 1/factor models, repeating this selection process until a single model is left.
Output of a recurrent layer for a single instance
- Unstable gradients which can be alleiviated using various techniques including recurrent dropout and recurrent layer normalization
- A (vary) limited short-term memory, which can be extended using LSTM and GRU cells
For small sequences, a regular dense network and for very long sequences CNNs can work quite well.
Many reseachers prefer to use the tanh activation function in RNNs rather than ReLU.
Since the output of a recurrent neuron at time step
A cell's state (
- sequence-to-sequence network: forcast time series
- sequence-to-vector network: feed a movie review words and output a sentiment score
- vector-to-sequence network: captioning an image
- encoder-decoder: translating a sentence, encoder: sequence-to-vector network, decoder: vector-to-sequence network
When a time series is correlated with a lagged version of itself, we say that the time series is autocorrelated.
Differencing is a common technique used to remove trend and seasonality from a time series: it's easier to study a stationary time series, meaning one whose statistical properties remain constant over time, without any seasonality or trends. Once you're able to make accurate forecasts on the differenced time series, it's easy to turn them into forecasts for the actual time series by just adding back the past values that were previously subtracted
Using differencing over a single time step will produce an approximation of the derivative of the time time series. This means that it will eliminate any linear trend, transforming it into a constant value. If the original time series has a quadratic trand then two rounds of differencing will eliminate quadratic trends.
It models the time series in the same way as ARIMA, but it additionally models a seasonal component for a given frequency (e.g. weekly) using the exact same ARIMA approach.
Good p,q,P and Q values are usually fairly samll (typically 0 to 2, sometimes up to 5 or 6), and d and D are typically 0 or 1, sometimes 2. There are more principled approaches to selecting good hyperparameters, based on analyzing the ACF and PACF or minimizing AIC or BIC but grid search is a good place to start
The random forest algorithm introduces extra randomness when growing trees; instead of searching for the very best
feature when splitting a node, it searches for the best feature among a random subset of features. By default,
it samples
It is hard to tell in advance whether a Random Forest will perform better or worse than an Extra Trees. Generally, the only way to know is to try both and compare them.
Reducing C makes the street larger, but it also leads to more margin violation, so there is less risk of overfitting. If you reduce it too much, then the model ends up underfitting. If your SVM model is overfitting, you can try regularizing it by reducing C.
"decision_function" function in SVM measures the signed distance between each instance and the decision boundary.
>>> svm_clf.decision_function(X_new)
array([0.66, -0.22])
Unlike LogisticRegression, LinearSVC doesn't have a predict_proba() method to estimate the class probabilities, That said, if you use the SVC class instead of LinearSVC, and if you set its probability hyperparameter to True, then the model will fit an extra model at the end of training to map the SVM decision function scores to estimated probabilities. Under the hood, this requires using 5-fold cross-validation to generate out-of-sample predictions for every instance in the training set, then training a LogisticRegression model, so it will slow down training considerably. After that, the predict_proba() and predict_log_proba() methods will be available.
Autoencoders can act as feature detectors, and they can be used for unsupervised pretraining of deep neural networks.
Some autoencoders are generative models
A neuron dies when its weights get tweaked in such a way that the input of the ReLU function is negative for all instances in the training set so it just keeps outputting zeros, and gradient descent does not affect it anymore because the gradient of the ReLU function is zero when its input is negative. A dead neuron may come back to life if its inputs evolve over time
- Leaky ReLU
- randomized leaky ReLU (RReLU), seemed to act as a regularizer
- parametric leaky ReLU (PReLU)
They all suffer from not being smooth function, their derivatives abruptly change at z = 0. This sort of discontinuity can make gradient descent bounce around the optimum and slow down convergence.