Linear Algebra for Deep Learning

The following content was posted to lecture forum, summarizing my brief comments on parts of MIT 18.06 (Spring 2005) Linear Algebra by Prof. Strang link. My main focus is on how this course connects to deep learning.

lecture 1: the geometry of linear algebra: helps you visualize vectors and matrix operations. Useful for understanding why the concepts of ‘space’ is so important in deep learning, like feature space or how gradient descent searches for a ‘platform’ in that space
lecture 2: elimination of matrices: shows how systems of linear equations can be expressed and solved through matrix operations. This lecture helps understanding why deep learning is built around matrix operation, and why some vectors are treated as variables while some matrices we call it weights. Essentially, these are just all systems of equations, each describing relationships among different variables.
lec3, multiplication and inverse matrices: explain matrix multiplication geometrically. This helps you grasp why we say a linear layer in deep learning ‘transform’ features into another ‘space’, and why nonlinear functions like sigmoid are needed and are inserted between transformations, etc.
lec4, matrix factorization, it helps in understanding more advanced decompositions in machine learning and deep learning, such as PCA, SVD, and LoRA
lec5-6, transpose, space, column space, and null space: it introduces some common matrix operations. Deep learning often use transposes, this section covers their properties. e.g., \(X^TX\) (X’s transpose and X) is ‘symmetric’, and symmetric matrices have invertible forms, which is why the linear regression analytical solution use them. It also explains vector space and subspace, key for understanding deep learning at a more structural level. e.g., each layer operates on a vector within a vector space, and linear transformations only preserve that structure, so out-of-distribution data outside the ‘span’ cannot be mapped in if we just apply linear transformation, but nonlinear transformation (e.g., sigmoid, kernel tricks) can alter the space.
lec7-8, solving \(Ax=b\): shows algebraic methods for finding solutions. In deep learning, we usually approximate these with methods like gradient descent. But learning this, you’ll understand why we just approximate the solutions, e.g., exact solutions are computationally heavy, and ‘low-rank’/’singular’ matrices can make them only have the trivial solution \(x=0\) (aka non solvable if \(x!=0\))
lec9, 14, independence, basis and dimention, orthogonal vectors and subspaces: helps in understanding information redundancy in deep learning and how some methods reduce it through decorrelation. Orthogonal subspaces also relate to more advanced deep learning’s matrix decomposition methods. e.g., when separating a speech feature subspace into orthogonal ‘speaking style’ and ‘content’ spaces, you can question whether such decomposition is really legit? Are ‘speaking style’ and ‘content’ really independant with each other?
lec15, 16, projection matrix; atm I cannot come up with a direct deep learninguse, but understanding it helps in grasping other matrix operations. It shows why some decompositions can be viewed as feature projections, or another example, why the analytical solution in linear regression represents projection onto the model space (like a line in 2d plane). You’ll see how geometry and formulas elegantly align.
lec 19-21: determinant, eigenvalues and eigenvectors: improtant for understand transformations (all those feedforward layers/projection matrix in deep learning). Eigenvalues describe how a transformation scales or shrinks vectors in certain directions. e.g., in SVD decomposition, large eigenvalues correspond to bigger impact, dimensions with higher eigenvalues are considered important to prediction. For phenomena like ‘gradient vanishing’ (or ‘gradient explode’) can also be explained by eigen values. If a transformation have eigenvalues smaller than 1, then keep applying this transformation repeatly will make the vector smaller and smaller until it becomes zero –> gradient vanishes. Which kind of network repeately applies the same transformation? RNNs. So if you’ve learned linear algebra well, when you see an RNN design, you might immediately think, ‘wait, this setup can cause gradient vanishing or explosion (determined by the eigenvalues)’. The determinant is also worth looking at, mainly bc it’s used to find eigenvalues. Ofc it has many other uses tho I can’t think of them all right now.