Large Scale Linear Algebra with Applications to Learning

Some Basic Information

Instructor: Constantine Caramanis


Office: TBA

Office Hours: TBA


Time: Wednesday 2:30-4:30 PM,

Location: 351 Meyer

Course Overview

This course develops some of the fundamental tools for large scale data analysis and machine learning. It is a course about data and computation, and thus focuses attention here. Specifically, the focus will fall on large scale linear algebra, including iterative, approximate and randomized algorithms for basic linear algebra computations and matrix functions. An important step along the way, will be to develop some of the basic results from Matrix Perturbation Theory. A fundamental companion to this will be the development of concentration inequalities for matrices. Throughout, these tools will be developed with a view towards applications in machine learning, such as clustering, dimensionality reduction, mixture models, etc.

Course Prerequisites

Good familiarity with Linear Algebra (e.g., facility with SVD, spectral theorem, matrix norms, etc.) is important, as we will freely use concepts, tools and techniques from linear algebra; the same goes for Probability and Stochastic Processes.


The grade for the course will be determined by the final written project. If you decide to take this class, I expect that we will discuss ideas for the written project before the 5th class. Projects could be of an expository, research or computational nature. Also, I will give out problem sets periodically, but these are not to be turned in, and are just extra exercises to help you practice material covered in class, or that fill in holes that were left during the lecture.

The course will be taught from a collection of sources; some books, but mainly research papers. These will be referenced and linked to (when possible) below.

Lecture Schedule and References

Here is a brief summary of what we covered in class, and the main references related to the material. I will continue filling this in as we go along.

  • Lecture 1, October 16: Some basic motivation and outline of the class. A quick introduction to spectral clustering, and motivation for considering matrix perturbation. For a much more in depth introduction on spectral clustering, including the leadup to the basic matrix perturbation results we will develop in subsequent classes, see this tutorial by Luxburg, and references therein.

  • Lecture 2, October 23: In this lecture we began discussing matrix perturbation. We considered the simple 2 by 2 case where notation is much more simple, and also the problem could essentially be solved by brute force. But we developed the results bringing out the main ideas that we will use in the general case. Then we began talking about invariant subspaces, and the QR decomposition.

    There are several excellent references. The main references we are using here are: Matrix Computations, by Golub and Van Loan, and also Matrix Perturbation Theory, by Stewart and Sun. An excellent linear algebra background book is Linear Algebra Done Right, by Axler.

  • Lecture 3, October 30: In this lecture we discussed the Schur decomposition, and showed how, among other things, this leads to results that demonstrate that normal matrices have a more stable spectrum. Indeed, several results based on, or related to, the Schur decomposition, show that the distance from being normal can control the instability of eigenvalues of a non-normal matrix. We also began our discussion of invariant subspaces in general, and we introduced the important linear operator T, given (in latex notation) by T_{A,B}(X) = AX - XB. We showed that if A and B have disjoint spectrum, then T is non-singular. We will soon use T to measure how well "separated" the spectrum of A is from that of B. We also showed that the spectrum of T is equal to the Minkowski-difference of the spectrum of A and the spectrum of B.

    The main reference for today's class is Stewart and Sun.

  • Lecture 4, November 6: I am traveling so there will be no in-person lecture. However, I have recorded a lecture which you can watch. It is divided into four parts which you can find here:

    Part 1 : review of invariant subspaces;
    Part 2 : the approximation theorem;
    Part 3 : the perturbation theorem for general matrices;
    Part 4 : two sin-theta theorems for Hermitian matrices.

    I have also posted the (hand-written) notes for this class here.

    This lecture finishes the main perturbation theory material. In particular, we covered the so-called approximation theorem, which says: how close is an invariant subspace of A to a given invariant subspace? The result relies critically on the separation between spectra of sub matrices, as measured by the smallest singular value of the (non-singular) operator T introduced in Lecture 3. We then showed that this notion of separation is Lipschitz continuous, which then allows us to immediately give perturbation theorems, which basically say: How close is an invariant subspace of A+E to an invariant subspace of A? Finally, we specialize everything to the setting of Hermitian matrices. Life is easier here because the existence of an invariant subspace is immediate, and so one can move directly to bounding the distance between subspaces. Thus we develop part of the so-called direct bounds, due to Davis and Kahan. It is these results that we use for spectral clustering. See also Problem Set 2.

    The main reference for today's class is Stewart and Sun.

  • Lecture 5, November 13: In this class, we look at the application of our sin theta theorems to spectral clustering, and along the way encounter our first use for matrix concentration inequalities. We then move to more computational issues, and the question of techniques for computing eigenvalues and eigenvectors for very large matrices. We will mostly be discussing techniques from sketching and randomized linear algebra, which will then motivate a more careful look at matrix concentration inequalities. For deterministic iterative algorithms, for those interested (total for the three lectures: about 2 hours), see here for three ghost-lectures on:

    The Power Method, and Lanczos Algorithm, part 1/2 , Lanczos Algorithm, part 2/2.

    These are the most popular iterative techniques for computing the top (and bottom) singular vectors of a matrix. There is a focus on understanding convergence rates, as well as quality of early termination. In the case of Lanczos, the important notion of Krylov subspaces (important in their own right, because of applications in other areas such as conjugatae gradient) is also developed.

    I have also posted the (hand-written) notes for these three lectures here.

  • Lecture 6, November 20: In this clas we finished the application of matrix perturbation to spectral clustering, and in particular obtain bounds for the planted model. To do this, we had to use the Matrix Bernstein Inequality. We also began our discussion of matrix sketching and randomized linear algebra.

  • Lecture 7, November 27: I am traveling so there will be no in-person lecture.

  • Lecture 8, December 11 : In this class, we start our series on matrix concentration inequalities, following Joel Tropp's lecture notes.

  • Lecture 9, December 18 : Guest Lecture -- Professor Elad Hazan. Multiplicative Weights and SDPs.

  • Lecture 10, December 25 : Guest Lecture -- Professor Nati Srebro. Optimization, Low Rank, and Matrix Completion.

  • Lecture 11, January 1 : In this video lecture, we continue our work on matrix concentration inequalities. Last time we had used Lieb's theorem to obtain what Tropp calls the "Master Bounds." In this series of three short video lectures, we recap the derivation of the master bounds, apply them to Gaussian matrix series, derive the Matrix Chernoff Bounds and apply them to bounding the spectrum of a submatrix of a given deterministic matrix, and, finally, we prove the Matrix Bernstein inequality, which we had used at the last part of our lecture on Spectral Clustering.

    Matrix Concentration Inequalities, part 1/3,
    Matrix Concentration Inequalities, part 2/3 ,
    Matrix Concentration Inequalities, part 3/3.

  • Lecture 12, January 8 : In this lecture we pick up where the video lectures left off, and discuss Matrix Concentration Inequalities and applications to randomized linear algebra. In particular, we discuss work in the recent STOC '13 paper Low Rank Approximation and Regression in Input Sparsity Time by K. Clarkson and D.P. Woodruff.

  • Lecture 13, January 15 : We continue discussing matrix concentration inequalities, and the need for bounds and algorithms that operate in the intrinsic dimension or complexity, be that defined by sparsity of the input, or fast decay of the spectrum. We also begin discussion the power of tensors for mixture models.

  • Lecture 14, January 22 : We continue our discussion of tensors, including the basic decomposition algorithms and some applications.

    Problem Sets

    In addition to the many problems I mark as exercises during the class, these are for practice, or to fill in details, and are not to be turned in. Some are easier/harder than others.

    Problem Set 1

    Problem Set 2

    Problem Set 3