In this page, we introduce a differential based method for vector and matrix derivatives (matrix calculus), which only needs a few simple rules to derive most matrix derivatives. This method is useful and well established in mathematics; however, few documents clearly or detailedly describe it. Therefore, we make this page aiming at the comprehensive introduction of matrix calculus via differentials.
* If you want results only, there is an awesome online tool Matrix Calculus. If you want "how to," let's get started.
- , , and denote , , and respectively.
- The first half of the alphabet denote constants, and the second half denote variables.
- denotes matrix transpose, is the trace, is the determinant, and is the adjugate matrix.
- is the Kronecker product and is the Hadamard product.
- Here we use numerator layout, while the online tool Matrix Calculus seems to use mixed layout. Please refer to Wiki - Matrix Calculus - Layout Conventions for the detailed layout definitions, and keep in mind that different layouts lead to different results. Below is the numerator layout,
- Identities 1
- Identities 2
- Identities 3 - chain rules
- Identities 4 - total differential. Actually, all identities 1 are the matrix form of the total differential in eq. (24).
To derive a matrix derivative, we repeat using the identities 1 (the process is actually a chain rule) assisted by identities 2.
finally from eq. (2), we get .
finally from eq. (3), we get .
finally from eq. (1), we get .
finally from eq. (5), we get .
finally from eq. (2), we get .
finally from eq. (3), we get . From line 3 to 4, we use the conclusion of , that is to say, we can derive more complicated matrix derivatives by properly utilizing the existing ones. From line 6 to 7, we use to introduce the in order to use eq. (3) later, which is common in scalar-by-matrix derivatives.
finally from eq. (3), we get .
finally from eq. (3), we get .
E.g. 5 - two layer neural network, , is a loss function such as Softmax Cross Entropy and MSE, is an element-wise activation function such as Sigmoid and ReLU
finally from eq. (3), we get .
finally from eq. (3), we get .
Since
then
therefore
* See examples.md for more examples.
Now, if we fully understand the core mind of the above examples, I believe we can derive most matrix derivatives in Wiki - Matrix Calculus by ourself. Please correct me if there is any mistake, and raise issues to request the detailed steps of computing the matrix derivatives that you are interested in.