**IE 598 BIG DATA OPTIMIZATION (Fall 2016)**

TR: 11:00AM 12:20PM, 206 TB

Updates

** **

§ **12****-16-2016: Project report is due at midnight.**

** **

§ **12****-06-2016: ** Project Presentations III

Meghana Bande, Incremental Gradient Methods
- SAGA

Juho Kim, Optimization Perspectives on
Approximate Bayesian Inference

Harsh Gupta and Sai Kiran Burle, Fast
Streaming Regression

§ **12****-01-2016: ** Project Presentations II

Juan Xu and Kaiqing Zhang, Communication-Efficient
Algorithms for Distributed Optimization

Shripad Gade, Matrix
Completion

Peter Liu, Second-Order Methods for
Large-Scale Problems

Joseph Lubars, Approximate Graph Matching
with Convex Optimization

§ **11****-29-2016: ** Project Presentations I

Samantha Thrust and Hongyi Li, Alternative
Views of Accelerated Gradient Descent

Jialin Song, Economic Analysis of Public
Wi-Fi Monetization and Content Promotion

Lucas Buccafusca, AlphaGo:
The Program Ahead of Its Time

Heran Zhang, Optimization in Deep
Reinforcement Learning

§ **11****-22-2016 and 11-24-2016: **Thanksgiving Break!

§ **11****-17-2016**: Lecture25-Extra
Lecture: Optimization for Poisson Likelihood Models

§ **11****-15-2016**: Lecture24-Summary and Outlook [Slide]

Case study on large scale
logistic regression, check Python demo and Python
code here.

§ **11****-10-2016**: Lecture23-Incremental
Gradient Methods for Finite Sum Problems

(SAG, SAGA, SVRG, etc)

§ **11****-08-2016:** Lecture22-Stochastic
Optimization II

(Stochastic Mirror Descent)

§ **11****-03-2016:** Lecture21-Stochastic
Optimization I

(Sample Average Approximation,
Stochastic Gradient Descent)

§ **11****-01-2016: Template for final report****:** download here

§ **11****-01-2016:** Lecture20-Splitting Algorithms for Nonsmooth + Nonsmooth Problems.

(Douglas
Rachford splitting, Peaceman
Rachford splitting, ADMM, etc.)

§ **10****-27-2016:** Lecture19-(Accelerated)
Proximal Gradient Method for Smooth + Nonsmooth
Problems.

Case study on LASSO, check Python demo for LASSO and Python code here.

§ **10****-25-2016:** Lecture18-Mirror
Prox for Saddle Point Problems

§ **10****-20-2016: **Lecture17-Smoothing
Techniques II & Proximal Point Algorithm

§ **10****-18-2016: **Lecture16-Smoothing
Techniques I

§ **10****-18-2016: 2-round ****scribing
+ presentation:** Sign
up for presentation and scribing here

§ **10****-13-2016: **Lecture15-From Subgradient Descent to Mirror Descent

§ **10****-11-2016: **Lecture14-Subgradient
Method

§ **10****-06-2016: **Lecture13-Case
Study on Logistic Regression . Check Python demo for logistic regression and Python code
here.

§ **10****-04-2016: **Lecture12-Coordinate Descent
Algorithms

§ **09****-29-2016: **Class cancelled due to Allerton

§ **09****-27-2016: **Lecture11-Conditional
Gradient (a.k.a. Frank Wolfe Algorithm)

§ **09****-22-2016: **Lecture10-Projected Gradient Descent

§ **09****-20-2016: **Lecture9-Gradient Descent and Its Acceleration

§ **09****-15-2016: **Lecture8-Gradient
Descent

§ **09****-13-2016: **Lecture7-Introduction to Optimization Algorithms

§ **09****-08-2016: **Lecture6-Conic
Programming

§ **09****-06-2016: **Lecture5-Convex
Optimization

§ **08****-29-2016 and 09-01-2016: **Lecture3&4-Convex
Functions

§ **08****-29-2016: ****Scribing:** Sign up for scribing here, and
find the Latex template here

§ **08****-25-2016: **** **Lecture2-Convex Sets

§ **08****-23-2016: **** **Lecture1-Introduction

** **

Course
Information

Instructor

Niao He

Email: niaohe@illinois.edu

Office: Transportation Building 211

Office Hour: Tuesday, 10-11AM or by appointment via email

Course Description

Nearly every problem in machine learning and high-dimensional statistics can be formulated in terms of optimization of some function, possibly under some constraints. In the wake of Big Data era, there is an increasingly strong need to design efficient and scalable algorithms to address optimization problems in large scale - including problems exhibiting inherently high dimensions and problems involving truly massive data sets. The key aim of this course is to expose students to a wide range of modern algorithmic developments in optimization (mainly in convex optimization) and bring them near the frontier of research in large-scale optimization and machine learning.

*Courtesy
warning:* The course will be theory-oriented
and emphasize deep understanding of structures of optimization problems and
computation complexity of numerical algorithms. The course will not cover any
software or tools used for big data analytics and is not an application course.

Prerequisites

** **Students
are expected to have strong knowledge of linear algebra, real analysis, and
probability theory. Some prior exposure to optimization at a graduate level is
preferred.

Textbooks

** **No book is required,
but you are highly
recommended to read the following:

Ben-Tal
& Nemirovski. *Lectures on Modern Convex Optimization*, SIAM. 2011

Nesterov. *Introductory Lectures on Convex Optimization: A
Basic Course*. Kluwer-Academic.
2003

Boyd
& Vandenberghe. *Convex
Optimization*. Cambridge
University Press. 2003

Sra,
Nowozin, Wright (eds). *Optimization for Machine Learning*. MIT Press. 2011

Bubeck. *Convex Optimization: Algorithms and Complexity*. In Foundations and Trends in Machine Learning. 2015

Grading Policy

There will be no homework or exams. Grades will be based on

*Scribing *(25%): each student will be assigned to scribe
lectures 1~2 times.

*Project *(50% report +25%
presentation): This could be writing
a survey on a certain topic based on several papers, conducting a novel
large-scale experiment, or thinking about a concrete open theoretical question,
applying optimization techniques to your own field, formalizing an interesting
new topic, or trying to relate several problems. The end result should be a
10-15 page report, and a 25-30 minute presentation. I will provide a few ideas
for possible topics for projects.

Instruction on Scribe Notes

Remember that you are not scribing the notes for me, but for the entire class! Try to be responsible and rigorous. Take careful notes during class and be a good scriber.

Here is some rubric:

*Minimal (5/10*): your source file should be compilable; your notes must contain no serious typos or
mistakes and it should reflect everything we discussed in class and be written
in complete sentences; you should fill in the skipped proofs if expected.

*Acceptable (8/10*): your notes should contain enough explanation
(between sections, after main theorems) so that even a student who might have
missed the class can understand the material reasonably well; the proofs should
be succinct, clear, and correct; your notations and numbering should be
consistent.*
*

*Exemplary (10/10): *your note should be a wholistic
piece of work and ideally it should be instructive. I encourage you to read the
reference books or papers related to form a comprehensive understanding of the
topic and you may add your thoughts, questions, or relevant material in the
notes.

Please
email me the latex and pdf of your notes as well as any figures included, ** within
3 days** after the class for which you were the designated scriber. If
your note falls in the unacceptable region, you will be asked to make
revisions.

Topical Outline and Tentative Schedule

Module I: Introduction and Fundamentals (3 weeks)

Learning under the lenses of optimization

Basics of convex analysis

Conic programming: LP, SOCP, SDP

Structured optimization: convexity, smoothness, separability, etc.

Module II: Smooth Convex Optimization (4 weeks)

Gradient descent method

Accelerated gradient descent methods (Nesterovs method and its variants)

Projection-free methods (Frank-Wolfe)

Coordinate descent methods

Applications in logistic regression, matrix completion, image processing, etc.

Module III: Nonsmooth Convex Optimization (4 weeks)

Subgradient descent/Mirror descent

Smoothing techniques

Mirror-Prox method

Primal-Dual algorithms

Proximal algorithms

Applications in support vector machines, sparsity learning, etc.

Module IV: Stochastic and Online Optimization (3 weeks)

Sample average approximation

Stochastic approximation (stochastic gradient descent)

Incremental gradient methods

Online algorithms

Applications in empirical risk minimization, online learning, etc.

Module V: Optimization Perspectives for Selective Topics in Machine Learning (2 weeks)

Sparsity learning

Large-scale kernel machines

Reinforcement learning

Deep learning

Project Guidelines

** General****
guideline**

You must work individually or in pairs (<=2 students).

This could be writing a survey on a certain topic based on several papers, conducting a novel large-scale experiment, or thinking about a concrete open theoretical question, applying optimization techniques to your own field, formalizing an interesting new topic, or trying to relate several problems.

Your project can overlap with your research agenda or class project for other classes, but you must emphasize the role of optimization.

The end result should be a 10-15 page written report in Latex, and a 25-30 minute presentation.

** ****Deadlines**

**Proposal submission**: Oct 15^{th}, 2016

You should submit a 1 page outline of project topic, names of team members, and a brief description of background information, how you plan to conduct the problem and references.

**Final presentation** (25%): Nov
29^{th}, Dec 1^{st}, Dec 6^{th}, 2016

**Final report submission**
(50%): Dec 16^{th},
2016, Midnight

Project Ideas

** **

** ****Below lists some
broad areas and a few specific ideas: **

*Literature survey *

(e.g., study 3-5 papers on same topic)

*Large-scale optimization algorithm and theory*

(e.g., consider a new setting, propose a modified algorithm and analyze convergence)

*Optimization modelling and applications *

(e.g., formulate an optimization model from a practical problem in any field, apply algorithms discussed in class to solve it)

*Numerical study and implementation*

(e.g., implement and compare different algorithms and study their practical behaviors)

** **

*Theory: *

** **** Interpretation
of Nesterovs accelerated algorithms**

Boyd, and E. Candes. A Differential Equation for Modeling Nesterovs Accelerated Gradient Method: Theory and Insights. NIPS 2014.

L. Lessard, B. Recht, and A. Packward. Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints. SIAM Journal on Optimization 2016.

Wibisono, A. Wilson, and M. Jordan. A Variational Perspective on Accelerated Methods in Optimization. ArXiv 2016.

Z. Allen-Zhu and L. Orrchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. ArXiv 2014,

S. Bubeck, Y W. Su, S.. Lee and M. Singh. A geometric alternative to Nesterovs accelerated gradient descent. ArXiv 2015.

G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Arxiv 2015.

N. Flammarion and F. Bach. From Averaging to Acceleration, There is Only a Step-size. Arxiv 2015.

W. Krichene, A. Bayen, and P. Bartlett. Accelerated mirror descent in continuous and discrete time. NIPS 2015.

** ****Beyond convexity: non-convex optimization**

S. Ghadimi and G. Lan. Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming. Mathematical Programming 2016.

E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond Convexity: Stochastic Quasi-Convex Optimization. Arxiv 2015.

J. Lee, M. Simchowitz, M. Jordan, B. Recht. Gradient Descent Converges to Minimizers. COLT 2016.

J. Sun, Q. Qu, and J. Wright. When Are Nonconvex Problems Not Scary? Arxiv 2015.

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition. Arxiv 2015.

H. Karimi, J. Nutini, and M. Schmidt. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition. European Conference on Machine Learning and Knowledge Discovery in Databases, 2016.

J. Liang, J. Fadili, G. Peyrι. A Multi-step Inertial Forward--Backward Splitting Method for Non-convex Optimization. NIPS 2016.

** ****Advances in incremental and stochastic algorithms**

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. NIPS 2013.

A Defazio, F Bach and S Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. NIPS 2014.

G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Arxiv 2015.

S. Reddi, S. Sra, B. Poczos, and A. Smola. Fast Stochastic Methods for Nonsmooth Nonconvex Optimization. NIPS 2016.

H. Lin, J. Mairal, and Z. Harchaoui. A
universal catalyst for first-order optimization. NIPS 2016.

** ****Beyond first-order: second-order methods**

G. Grapiglia and Y. Nesterov. Globally Convergent Second-order Schemes for Minimizing Twice-differentiable Functions. CORE Discussion 2016.

Y. Nesterov, B. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 2006.

R. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A Stochastic Quasi-Newton Method for Large-Scale Optimization. SIAM Journal on Optimization 2016.

A, Rodomanov, A. Com, and D. Kropotov. A Superlinearly-Convergent Proximal Newton-Type Method for the Optimization of Finite Sums. ICML 2016.

N. Agarwal, B. Bullins, and E. Hazan. Second Order Stochastic Optimization in Linear Time. arXiv 2016.

R. Kolte, M. Erdogdu, and A. Oz. Accelerating SVRG via second-order information. OPT 2015

** ****Beyond first-order: zero-order methods**

S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 2013.

J. Duchi, M. Jordan, M. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: the power of two function evaluations. IEEE on Information Theory, 2015.

A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions. JMLR 2015.

** ****Distributed and parallel optimization algorithms**

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild. A lock-free approach to parallelizing stochastic gradient descent. NIPS 2011.

J. Duchi, M. Wainwright, and Y. Zhang. Communication-Efficient Algorithms for Statistical Optimization. JMLR 2013.

Shamir, N. Srebro, T. Zhang. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method. ICML 2014

* Applications: *

** ****Optimization in Machine Learning**

L. Bottou, F. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. arXiv 2016.

P. Zhao, T. Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. ICML 2015.

C. Wang, Y. Wang, E. Weinan, and R. Schapire. Functional Frank-Wolfe Boosting for General Loss Function. arXiv 2015.

F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. ACM 2015.

** ****Optimization in Statistical Inference**** **

X. Nguyen, M. Wainwright, and M. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. NIPS 2008.

A. Juditsky, A. Nemirovski. Nonparametric estimation by convex programming. The Annals of Statistics 2009.

A. Goldenshluger, A. Juditsky, and A. Nemirovski. Hypothesis testing by convex optimization. Electronic Journal of Statistics 2015.

M. Welling, D. Bren, and Y. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. ICML 2011.

M. Hoffman, D. Blei, C. Wang, J. Paisley, J. Edu. Stochastic Variational Inference. JMLR 2013.

** ****Optimization in Signal/ Image Processing**

A. Chambolle, T. Pock. A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging. JMIV 2011.

E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust Principal Component Analysis?. ACM 2011.

Y. Cherapanamjeri, K. Gupta, and P. Jain. Nearly-optimal Robust Matrix Completion. arXiv 2016.

A. Juditsky, F. K. Karzan, and A. Nemirovski. Randomized first order algorithms with applications to L1-minimization. Math. Program., Ser. A 2012.

** ****Optimization in Markov Decision Processes**

Y. Chow, M. Ghavamzadeh. Algorithms for CVaR Optimization in MDPs. NIPS 2014.

R. Sutton, H. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvαri, and E. Wiewiora. Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation. ICML 2009.

T. Jaakkola, M. Jordan, and P. S. Satinder.
Convergence of Stochastic Iterative Dynamic
Programming Algorithms, Neural Computation 1994.** **