IE 598 – BIG DATA OPTIMIZATION (Fall 2016)

TR: 11:00AM – 12:20PM, 206 TB

 

 

Updates

 

§  12-16-2016:  Project report is due at midnight.

 

§  12-06-2016:  Project Presentations III

•    Meghana Bande, Incremental Gradient Methods - SAGA

•    Juho Kim, Optimization Perspectives on Approximate Bayesian Inference

•    Harsh Gupta and Sai Kiran Burle, Fast Streaming Regression

§  12-01-2016:  Project Presentations II

•    Juan Xu and Kaiqing Zhang, Communication-Efficient Algorithms for Distributed Optimization

•    Shripad Gade, Matrix Completion

•    Peter Liu, Second-Order Methods for Large-Scale Problems

•    Joseph Lubars, Approximate Graph Matching with Convex Optimization

§  11-29-2016:  Project Presentations I

•    Samantha Thrust and Hongyi Li, Alternative Views of Accelerated Gradient Descent

•    Jialin Song, Economic Analysis of Public Wi-Fi Monetization and Content Promotion

•    Lucas Buccafusca, AlphaGo: The Program Ahead of Its Time

•    Heran Zhang, Optimization in Deep Reinforcement Learning

§  11-22-2016 and 11-24-2016: Thanksgiving Break!

§  11-17-2016:   Lecture25-Extra Lecture: Optimization for Poisson Likelihood Models

§  11-15-2016:   Lecture24-Summary and Outlook [Slide]

                          Case study on large scale logistic regression, check Python demo and Python code here.

§  11-10-2016:   Lecture23-Incremental Gradient Methods for Finite Sum Problems

                          (SAG, SAGA, SVRG, etc)

§  11-08-2016:  Lecture22-Stochastic Optimization II  

                          (Stochastic Mirror Descent)

§  11-03-2016:  Lecture21-Stochastic Optimization I

                          (Sample Average Approximation, Stochastic Gradient Descent)

§  11-01-2016:  Template for final report: download here

§  11-01-2016:  Lecture20-Splitting Algorithms for Nonsmooth + Nonsmooth Problems.

       (Douglas Rachford splitting, Peaceman Rachford splitting, ADMM, etc.)

§  10-27-2016: Lecture19-(Accelerated) Proximal Gradient Method for Smooth + Nonsmooth Problems.

        Case study on LASSO, check Python demo for LASSO and Python code here.   

§  10-25-2016: Lecture18-Mirror Prox for Saddle Point Problems

§  10-20-2016: Lecture17-Smoothing Techniques II & Proximal Point Algorithm

§  10-18-2016: Lecture16-Smoothing Techniques I

§  10-18-2016: 2-round scribing + presentation: Sign up for presentation and scribing here

§  10-13-2016: Lecture15-From Subgradient Descent to Mirror Descent

§  10-11-2016:  Lecture14-Subgradient Method

§  10-06-2016: Lecture13-Case Study on Logistic Regression . Check Python demo for logistic regression and Python code here. 

§  10-04-2016: Lecture12-Coordinate Descent Algorithms

§  09-29-2016: Class cancelled due to Allerton

§  09-27-2016: Lecture11-Conditional Gradient (a.k.a. Frank Wolfe Algorithm)

§  09-22-2016: Lecture10-Projected Gradient Descent

§  09-20-2016: Lecture9-Gradient Descent and Its Acceleration 

§  09-15-2016:  Lecture8-Gradient Descent

§  09-13-2016:  Lecture7-Introduction to Optimization Algorithms

§  09-08-2016: Lecture6-Conic Programming

§  09-06-2016: Lecture5-Convex Optimization

§  08-29-2016 and 09-01-2016: Lecture3&4-Convex Functions

§  08-29-2016: Scribing: Sign up for scribing here, and find the Latex template here

§  08-25-2016:  Lecture2-Convex Sets

§  08-23-2016:  Lecture1-Introduction

 

Course Information

Instructor

Niao He

Email: niaohe@illinois.edu

Office: Transportation Building 211

  Office Hour: Tuesday, 10-11AM or by appointment via email

Course Description

Nearly every problem in machine learning and high-dimensional statistics can be formulated in terms of optimization of some function, possibly under some constraints. In the wake of Big Data era, there is an increasingly strong need to design efficient and scalable algorithms to address optimization problems in large scale - including problems exhibiting inherently high dimensions and problems involving truly massive data sets. The key aim of this course is to expose students to a wide range of modern algorithmic developments in optimization (mainly in convex optimization) and bring them near the frontier of research in large-scale optimization and machine learning.

 

Courtesy warning: The course will be theory-oriented and emphasize deep understanding of structures of optimization problems and computation complexity of numerical algorithms. The course will not cover any software or tools used for big data analytics and is not an application course.

Prerequisites

   Students are expected to have strong knowledge of linear algebra, real analysis, and probability theory. Some prior exposure to optimization at a graduate level is preferred.

Textbooks

 No book is required, but you are highly recommended to read the following:

•    Ben-Tal & Nemirovski. Lectures on Modern Convex Optimization, SIAM. 2011

•    Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer-Academic. 2003

•    Boyd & Vandenberghe. Convex Optimization. Cambridge University Press. 2003

•    Sra, Nowozin, Wright (eds). Optimization for Machine Learning. MIT Press. 2011

•    Bubeck. Convex Optimization: Algorithms and Complexity. In Foundations and Trends in Machine Learning. 2015

Grading Policy

There will be no homework or exams.  Grades will be based on

•    Scribing (25%): each student will be assigned to scribe lectures 1~2 times.

•    Project (50% report +25% presentation): This could be writing a survey on a certain topic based on several papers, conducting a novel large-scale experiment, or thinking about a concrete open theoretical question, applying optimization techniques to your own field, formalizing an interesting new topic, or trying to relate several problems. The end result should be a 10-15 page report, and a 25-30 minute presentation. I will provide a few ideas for possible topics for projects. 

Instruction on Scribe Notes

  Remember that you are not scribing the notes for me, but for the entire class! Try to be responsible and rigorous. Take careful notes during class and be a good scriber.

 

  Here is some rubric:

•    Minimal (5/10):  your source file should be compilable; your notes must contain no serious typos or mistakes and it should reflect everything we discussed in class and be written in complete sentences; you should fill in the skipped proofs if expected.

•    Acceptable (8/10): your notes should contain enough explanation (between sections, after main theorems) so that even a student who might have missed the class can understand the material reasonably well; the proofs should be succinct, clear, and correct; your notations and numbering should be consistent.

•    Exemplary (10/10): your note should be a wholistic piece of work and ideally it should be instructive. I encourage you to read the reference books or papers related to form a comprehensive understanding of the topic and you may add your thoughts, questions, or relevant material in the notes.

 

 Please email me the latex and pdf of your notes as well as any figures included, within 3 days after the class for which you were the designated scriber. If your note falls in the unacceptable region, you will be asked to make revisions.

 

Topical Outline and Tentative Schedule

Module I: Introduction and Fundamentals (3 weeks)

–      Learning under the lenses of optimization

–      Basics of convex analysis

–      Conic programming: LP, SOCP, SDP

–      Structured optimization: convexity, smoothness, separability, etc.

Module II: Smooth Convex Optimization (4 weeks)

–      Gradient descent method

–      Accelerated gradient descent methods (Nesterov’s method and its variants)

–      Projection-free methods (Frank-Wolfe)

–      Coordinate descent methods

–      Applications in logistic regression, matrix completion, image processing, etc.

Module III: Nonsmooth Convex Optimization (4 weeks)

–      Subgradient descent/Mirror descent

–      Smoothing techniques

–      Mirror-Prox method

–      Primal-Dual algorithms

–      Proximal algorithms

–      Applications in support vector machines, sparsity learning, etc.

Module IV: Stochastic and Online Optimization (3 weeks)

–      Sample average approximation

–      Stochastic approximation (stochastic gradient descent)

–      Incremental gradient methods

–      Online algorithms

–      Applications in empirical risk minimization, online learning, etc.

Module V: Optimization Perspectives for Selective Topics in Machine Learning (2 weeks)

–      Sparsity learning

–      Large-scale kernel machines

–      Reinforcement learning

–      Deep learning

 

Project Guidelines

   General guideline

•    You must work individually or in pairs (<=2 students).

•    This could be writing a survey on a certain topic based on several papers, conducting a novel large-scale experiment, or thinking about a concrete open theoretical question, applying optimization techniques to your own field, formalizing an interesting new topic, or trying to relate several problems.

•    Your project can overlap with your research agenda or class project for other classes, but you must emphasize the role of optimization.

•    The end result should be a 10-15 page written report in Latex, and a 25-30 minute presentation.

 Deadlines

•    Proposal submission: Oct 15th, 2016

•    You should submit a 1 page outline of project topic, names of team members, and a brief description of background information, how you plan to conduct the problem and references.

•    Final presentation (25%): Nov 29th, Dec 1st, Dec 6th, 2016

•    Final report submission (50%): Dec 16th, 2016, Midnight

 

Project Ideas

 

 Below lists some broad areas and a few specific ideas:

•    Literature survey

(e.g., study 3-5 papers on same topic)

•    Large-scale optimization algorithm and theory

(e.g., consider a new setting, propose a modified algorithm and analyze convergence)

•    Optimization modelling and  applications

(e.g., formulate an optimization model from a practical problem in any field, apply algorithms discussed in class to solve it)

•    Numerical study and implementation

(e.g., implement and compare different algorithms and study their practical behaviors)

 

Theory:

   Interpretation of Nesterov’s accelerated algorithms

Boyd, and E. Candes. A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights. NIPS 2014.

L. Lessard, B. Recht, and A. Packward. Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints. SIAM Journal on Optimization 2016.

Wibisono, A. Wilson, and M. Jordan. A Variational Perspective on Accelerated Methods in Optimization. ArXiv 2016.

Z. Allen-Zhu and L. Orrchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. ArXiv 2014,

S. Bubeck, Y W. Su, S.. Lee and M. Singh. A geometric alternative to Nesterovs accelerated gradient descent. ArXiv 2015.

G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Arxiv 2015.

N. Flammarion and F. Bach. From Averaging to Acceleration, There is Only a Step-size. Arxiv 2015.

W. Krichene, A. Bayen, and P. Bartlett. Accelerated mirror descent in continuous and discrete time. NIPS 2015.

   Beyond convexity:  non-convex optimization

S. Ghadimi and G. LanAccelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming. Mathematical Programming 2016.

E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond Convexity: Stochastic Quasi-Convex Optimization. Arxiv 2015.

J. Lee, M. Simchowitz, M. Jordan, B. RechtGradient Descent Converges to Minimizers.  COLT 2016.

J. Sun, Q. Qu, and J. Wright. When Are Nonconvex Problems Not Scary? Arxiv 2015.

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition. Arxiv 2015. 

H. Karimi, J. Nutini, and M. Schmidt. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition. European Conference on Machine Learning and Knowledge Discovery in Databases, 2016. 

J. Liang, J. Fadili, G. Peyrι. A Multi-step Inertial Forward--Backward Splitting Method for Non-convex Optimization. NIPS 2016.

   Advances in incremental and stochastic algorithms

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. NIPS 2013.

A Defazio, F Bach and S Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. NIPS 2014.

G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Arxiv 2015.

S.  Reddi, S. Sra, B. Poczos,  and A. Smola. Fast Stochastic Methods for Nonsmooth Nonconvex Optimization. NIPS 2016.

H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. NIPS 2016.

   Beyond first-order: second-order methods

G. Grapiglia and Y. Nesterov. Globally Convergent Second-order Schemes for Minimizing Twice-differentiable Functions. CORE Discussion 2016.  

Y. Nesterov, B. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 2006.

R. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A Stochastic Quasi-Newton Method for Large-Scale Optimization. SIAM Journal on Optimization 2016.

A, Rodomanov, A. Com, and D. Kropotov. A Superlinearly-Convergent Proximal Newton-Type Method for the Optimization of Finite Sums. ICML 2016.

N. Agarwal, B. Bullins, and E. Hazan. Second Order Stochastic Optimization in Linear Time. arXiv 2016.

R. Kolte, M. Erdogdu, and A. Oz. Accelerating SVRG via second-order information. OPT 2015

   Beyond first-order: zero-order methods

S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 2013.

J. Duchi, M. Jordan, M. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: the power of two function evaluations. IEEE on Information Theory, 2015.

A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions. JMLR 2015.

   Distributed and parallel optimization algorithms

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild. A lock-free approach to parallelizing stochastic gradient descent. NIPS 2011.

J. Duchi, M. Wainwright, and Y. Zhang. Communication-Efficient Algorithms for Statistical Optimization. JMLR 2013.

Shamir, N. Srebro, T. Zhang. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method. ICML 2014

 

 Applications:

   Optimization in Machine Learning

L. Bottou, F. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. arXiv 2016.

P. Zhao, T. Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. ICML 2015.

C. Wang, Y. Wang, E. Weinan, and R. Schapire. Functional Frank-Wolfe Boosting for General Loss Function. arXiv 2015.

F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. ACM 2015.

   Optimization in Statistical Inference

X. Nguyen, M. Wainwright, and M. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. NIPS 2008.

A. Juditsky, A. Nemirovski. Nonparametric estimation by convex programming. The Annals of Statistics 2009.

A. Goldenshluger, A. Juditsky, and A. Nemirovski. Hypothesis testing by convex optimization. Electronic Journal of Statistics 2015.

M. Welling, D. Bren, and Y. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. ICML 2011.

M. Hoffman, D. Blei, C. Wang, J. Paisley, J. Edu. Stochastic Variational Inference. JMLR 2013.

   Optimization in Signal/ Image Processing

A. Chambolle, T. Pock. A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging. JMIV 2011.

E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust Principal Component Analysis?. ACM 2011.

Y. Cherapanamjeri, K. Gupta, and P. Jain. Nearly-optimal Robust Matrix Completion. arXiv 2016.

A. Juditsky, F. K. Karzan, and A. Nemirovski. Randomized first order algorithms with applications to L1-minimization. Math. Program., Ser. A 2012.

   Optimization in Markov Decision Processes

Y. Chow, M. Ghavamzadeh. Algorithms for CVaR Optimization in MDPs. NIPS 2014.

R. Sutton, H. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvαri, and E. Wiewiora. Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation. ICML 2009.

T. Jaakkola, M. Jordan, and P. S. Satinder. Convergence of Stochastic Iterative Dynamic Programming Algorithms, Neural Computation 1994.