IE 598 – BIG DATA OPTIMIZATION (Spring 2018)

MW: 4:00AM – 5:20PM, 203 TB

 

 

Updates

 

§  04-24-2018:  Final report LaTex template here.

§  04-16-2018:  Lecture slides on Summary and Nonconvex Optimization

§  04-16-2018:  Python Demo (Subgradient method, PG, Smoothing, ADMM) for LASSO  here.

§  03-28-2018:  Python Demo (SGD, SVRG) for Logistic Regression  here.

§  02-07-2018:  Python Demo (GD, AGD, CD) for Logistic Regression  here.

§  01-27-2018:  You can sign up your team and presentation paper  here.

§  01-17-2018:  Introduction lecture slide

§  01-17-2018:  Welcome to class!

 

 

Course Information

Instructor

Niao He

Email: niaohe@illinois.edu

Office: Transportation Building 211

  Office Hour: Monday, 3-4PM or by appointment via email

Course Description

Nearly every problem in machine learning and high-dimensional statistics can be formulated in terms of optimization of some function, possibly under some constraints. In the wake of Big Data era, there is an increasingly strong need to design efficient and scalable algorithms to address optimization problems in large scale - including problems exhibiting inherently high dimensions and problems involving truly massive data sets. The key aim of this course is to expose students to a wide range of modern algorithmic developments in optimization (mainly in convex optimization) and bring them near the frontier of research in large-scale optimization and machine learning.

 

Courtesy warning: The course will be theory-oriented and emphasize deep understanding of structures of optimization problems and computation complexity of numerical algorithms. The course will not cover any software or tools used for big data analytics and is not an application course.

Prerequisites

   Students are expected to have strong knowledge of linear algebra, real analysis, and probability theory. Some prior exposure to optimization at a graduate level is preferred.

Lecture Notes

   I will not post lecture notes on the website this semester, but you can refer to my past lecture notes: IE598-BigDataOpt-lecturenotes-fall2016.pdf

Reference Books

 No textbook is required, but you are highly recommended to read the following:

•    Ben-Tal & Nemirovski. Lectures on Modern Convex Optimization, SIAM. 2011

•    Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. 2017

•    Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer-Academic. 2003

•    Sra, Nowozin, Wright (eds). Optimization for Machine Learning. MIT Press. 2011

•    Bubeck. Convex Optimization: Algorithms and Complexity. In Foundations and Trends in Machine Learning. 2015

Grading Policy

There will be no homework or exams.  Students will be divided into groups of up to 3 students depending on class size.  Grades will be based on

•    Paper Presentation (25%): each group will be assigned to present at least one paper.

•    Final Project (10% proposal +40% report +25% presentation): The project may consist of any combination of theoretical analysis, application, and literature survey (possibly all three) and you should aim for a project suitable for submission to a conference or journal. For example, this could be conducting a novel large-scale experiment, or thinking about a concrete open theoretical question, applying optimization techniques to your own field, formalizing an interesting new topic, or trying to relate several problems. The end result should be a 10-15 page report in Latex with given format, and a 25-30 minute presentation.

 Deadlines

•    Group decision: Feb 7th, 2018

o   You should email me names of team members and the paper(s) you select to present. 

•    Proposal Submission: Feb 21th, 2018  (Feb 28th , 2018)

o   You should submit at least 1 page outline of project topic, names of team members, and a brief description of background information, how you plan to conduct the problem and references.

•    Paper presentation:  Mar 5th – Mar 14th , 2018

•    Proposal resubmission (if needed): Mar 14th , 2018

•    Final project presentation: Apr 23rd – May 2nd, 2018

•    Final report submission: May 11th , 2018

 

Final Presentation Schedule (April 23 –May 2)

 

Dates

Group

Members

Paper

Time

Apr. 23

1

Menglong Li             

Online Submodular  Minimization in Real Space [Slide]

4:00-4:30PM

2

Zhikai Guo

Nitin angellamudi Seo Taek Kong

Approximate Hessian Second Order Method [Slide]

4:30-5:00PM

 

 

 

 

Apr. 25

3

Myung Hwan Song

Understanding the Generalization Performance of SGD with Warm Restart [Slide]

4:00-4:30PM

4

Peijun Xiao

Dawei Li

Junchi Yang

Relative Smoothness and Relative Continuity [Slide]

4:30-5:00PM

5

Ravi Raman

Surya Sankagiri

Understanding Acceleration in Stochastic Gradient Descent [Slide]

5:00-5:30PM

Apr. 30

6

Amber Srivastava

Alireza Askarian

Reduction of Dynamic Graphs

4:00-4:30PM

7

Bozun Wang                             

Predicting Length of Waiting Lists for Hospitals in England

4:30-5:00PM

8

Yifan Hu

Conditional SGD

5:00-5:30PM

May. 2

9

Eli Chien

Jianhao Peng

Chao Pan             

On the Study of Stochastic Gradient Coordinate Descent(SGCD)

4:00-4:30PM

10

Siqi Zhang

Stochastic Mirror Descent Algorithms in Nonconvex Optimization

4:30-5:00PM

 

11

Zheng Liu

Reinforcement Learning for Radiation Source Detection

5:00-5:30PM

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Paper Presentation Schedule (March 5 – March 14)

 

Dates

Group

Members

Paper

Time

Mar. 5

1

Eli Chien

Jianhao Peng

Chao Pan             

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. [Slide]

4:00-4:30PM

2

Siqi Zhang

Bozun Wang                             

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. [Slide]

4:30-5:00PM

3

Wenjing Yin

Hypothesis Testing by Convex Optimization.

5:00-5:30PM

Mar.7

4

Amber Srivastava Alireza Askarian

Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning. [Slide]

4:00-4:30PM

5

Ravi Kiran Raman Surya Sankagiri

The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning. [Slide]

4:30-5:00PM

6

Judy Gan

Yifan Hu

Trust Region Policy Optimization. [Slide]

5:00-5:30PM

Mar. 12

7

Myung Hwan Song

An Overview of Gradient Descent Optimization Algorithms. [Slide]

4:00-4:30PM

8

Zhikai Guo

Nitin Tangellamudi

Seo Taek Kong

Identifying and Attacking the Saddle Point Problem in High-dimensional Non-convex Optimization. [Slide]

4:30-5:00PM

9

Peijun Xiao

Dawei Li

Junchi Yang

Theoretical insights into the optimization landscape of over-parameterized shallow neural networks.[Slide]

5:00-5:30PM

Mar. 14

10

Mariya Vasileva

Generative Adversarial Nets.

4:00-4:30PM

11

Menglong Li

Zheng Liu

Towards Principled Methods for Training Generative Adversarial Networks. [Slide]

4:30-5:00PM

 


 

 

Topical Outline and Schedule

Review Topics: Introduction and Fundamentals (2 weeks)

–      Learning under the lenses of optimization

–      Basics of convex analysis

–      Basics of convex optimization theory

Basic Topics: Large-Scale Convex Optimization (4 weeks)

–      Smooth Optimization (gradient descent and acceleration, coordinate descent, etc.)

–      Nonsmooth Optimization (subgradient method, mirror descent, smoothing techniques, proximal algorithm, etc.)

–      Stochastic and Online Optimization (sample average approximation, stochastic approximation, online gradient descent, etc.)

Presentation Topics: Optimization for Machine Learning (2 weeks)

–      Optimization for Empirical Risk Minimization

–     Optimization for Deep Learning

–      Optimization for Generative Adversarial Networks

–      Optimization for Reinforcement Learning

–      Optimization for Statistical Inference

Advanced Topics: Recent Advances in Large-Scale Optimization (4 weeks)

–      Projection-free methods

–      Derivative-free methods

–      Variance-reduced methods

–      Saddle point optimization

–      Parallel and distributed optimization

–      Nonconvex optimization

Final Presentation (2 weeks)

 

 

 

Presentation Papers

 

A. Optimization in Empirical Risk Minimization

[1] Bousquet and Bottou. (2008). The Tradeoffs of Large Scale Learning. NIPS.

[2] Shalev-Shwartz and  Zhang. (2013). Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. JMLR.

[3] Zhang and Xiao. (2017). Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. JMLR.

 B. Optimization in Deep Learning

[1] Ruder. (2017). An Overview of Gradient Descent Optimization Algorithms. ArXiv.

[2] Duchi, Hazan, and Singer. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR.

[3] Wilson el al. (2017). The Marginal Value of Adaptive Gradient Methods in Machine Learning. NIPS.

[4] Dauphin et al. (2014). Identifying and Attacking the Saddle Point Problem in High-dimensional Non-convex Optimization. NIPS.

[5] Soltanolkotabi et al. (2017). Theoretical Insights into the Optimization Landscape of Over-parameterized Shallow Neural Networks. Arxiv.

C. Optimization for Generative Adversarial Networks

[1] Goodfellow el al. (2014). Generative Adversarial Nets. NIPS.

[2] Arjovsky and Bottou.  (2017). Towards Principled Methods for Training Generative Adversarial Networks. ICLR.

D. Optimization for Reinforcement Learning

[1] Chen and Wang. (2016). Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning. ArXiv.

[2] Borkar and Meyn.  (2000). The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning. SIAM Journal of Optimization and Control.

[3] Schulman. (2017). Trust Region Policy Optimization. ICML.

E. Optimization for Statistical Inference

[1] Juditsky, and Nemirovski.(2009). Nonparametric Estimation by Convex Programming. The Annals of Statistics.

[2] Goldenshluger, Juditsky, and Nemirovski. (2015) Hypothesis Testing by Convex Optimization. Electronic Journal of Statistics.

 [3] Hoffman et al. (2013) Stochastic Variational Inference. JMLR.

 

Project Ideas

 

   Below lists some broad areas and a few specific ideas:

 

General-Purposed Optimization Theory

(e.g., consider a new setting, propose a modified algorithm and analyze convergence)

 

•    Understanding the global convergence of deep learning

 

–      Choromanska, A., LeCun, Y., & Arous, G. B. (2015, June). Open problem: The landscape of the loss surfaces of multilayer networks. COLT, pp. 1756-1760.

–      Soltanolkotabi, M., Javanmard, A., & Lee, J. D. (2017). Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv:1707.04926.

–      Nouiehed, M., & Razaviyayn, M. (2018). Learning Deep Models: Critical Points and Local Openness. ICLR. 

–      Xie, B., Liang, Y., & Song, L. (2016). Diverse neural network learns true target functions. arXiv:1611.03131.

–      Nguyen, Q., & Hein, M. (2018). The loss surface and expressivity of deep convolutional neural networks. ICLR.

–      Yun, C., Sra, S., & Jadbabaie, A. (2018). Global optimality conditions for deep neural networks. ICLR.

 

•    Exploring the convergence of SGD for nonconvex optimization problems

 

–      Ghadimi, S., & Lan, G. (2013). Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2341-2368.

–      Davis, D., & Drusvyatskiy, D. (2018). Stochastic subgradient method converges at the rate $ O (k^{-1/4}) $ on weakly convex functions. arXiv:1802.02988.

–      Raginsky, M., Rakhlin, A., & Telgarsky, M. (2017). Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis.  arXiv:1702.03849.

 

•    Exploring the acceleration of SGD algorithms

           

–      Lee, J. D., Allen-Zhu, Z. (2017, June). Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (pp. 1200-1205). ACM.

–      Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., & Sidford, A. (2017). Accelerating Stochastic Gradient Descent. arXiv preprint arXiv:1704.08227.

–      Gadat, S., Panloup, F., & Saadane, S. (2018). Stochastic heavy ball. Electronic Journal of Statistics, 12(1), 461-529.

 

•      Interpretation of Nesterov’s accelerated algorithms

 

–       Boyd, and E. Candes. A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights. NIPS 2014.

–       L. Lessard, B. Recht, and A. Packward. Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints. SIAM Journal on Optimization 2016.

–       Wibisono, A. Wilson, and M. Jordan. A Variational Perspective on Accelerated Methods in Optimization. ArXiv 2016.

–       Z. Allen-Zhu and L. Orrchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. ArXiv 2014,

–       S. Bubeck, Y W. Su, S.. Lee and M. Singh. A geometric alternative to Nesterovs accelerated gradient descent. ArXiv 2015.

–       N. Flammarion and F. Bach. From Averaging to Acceleration, There is Only a Step-size. Arxiv 2015.

–       W. Krichene, A. Bayen, and P. Bartlett. Accelerated mirror descent in continuous and discrete time. NIPS 2015.

•      Exploiting random permutation/random shuffling

 

–       Gόrbόzbalaban, M., Ozdaglar, A., & Parrilo, P. (2015). Why random reshuffling beats stochastic gradient descentarXiv preprint arXiv:1510.08560.

–       Wright, S. J., & Lee, C. P. (2017). Analyzing Random Permutations for Cyclic Coordinate Descent. arXiv preprint arXiv:1706.00908.

–       Sun, R., Luo, Z. Q., & Ye, Y. (2015). On the expected convergence of randomly permuted ADMMarXiv preprint arXiv:1503.06387.

 

•      Bridging control theory with optimization theory 

–       Boyd, and E. Candes. A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights. NIPS 2014.

–       L. Lessard, B. Recht, and A. Packward. Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints. SIAM Journal on Optimization 2016. 

–       Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., & Recht, B. (2017). First-order Methods Almost Always Avoid Saddle Points. arXiv:1710.07406.

–       Wilson, A. C., Recht, B., & Jordan, M. I. (2016). A Lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635.

–       Hardt, M., Ma, T., & Recht, B. (2016). Gradient descent learns linear dynamical systemsarXiv preprint arXiv:1609.05191.

–       Hu, B., & Lessard, L. (2017). Dissipativity Theory for Nesterov's Accelerated MethodarXiv preprint arXiv:1706.04381.

 

Domain-Specific Optimization Theory 

 

•    Improving and stabilizing the optimization for GAN

 

–      Arjovsky and Bottou.  (2017). Towards Principled Methods for Training Generative Adversarial Networks. ICLR. 

–      Berthelot, D., Schumm, T., & Metz, L. (2017). BEGAN: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717.

–      Sinha, A., Namkoong, H., & Duchi, J. (2017). Certifiable Distributional Robustness with Principled Adversarial Training. arXiv preprint arXiv:1710.10571.

–      Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a Nash equilibrium. arXiv preprint arXiv:1706.08500.

 

•    Provable convergence for continuous control and reinforcement learning

 

–      Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., & Szepesvαri, C. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. NIPS  (pp. 1204-1212).  

–      Karmakar, P., & Bhatnagar, S. (2017). Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Mathematics of Operations Research.

–      Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Chen, J., & Song, L. (2017). Smoothed Dual Embedding Control. arXiv preprint arXiv:1712.10285.

–      Dalal, G., Szorenyi, B., Thoppe, G., & Mannor, S.  (2017). Finite sample analyses for TD(0) with Function Approximation. ArXiv: 1704.01161.

 

•    Iterative algorithms for Bayesian and probabilistic inference

 

–      Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., & Szepesvαri, C. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. NIPS  (pp. 1204-1212). 

–      Dai, B., He, N., Dai, H., & Song, L. (2016, May). Provable Bayesian inference via particle mirror descent. In Artificial Intelligence and Statistics (pp. 985-994).

–      Liu, Q., & Wang, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems (pp. 2378-2386). 

–      Jimenez Rezende, D., & Mohamed, S. (2015). Variational inference with normalizing flows. ICML.

 

 

Modeling and Applications

(e.g., formulate an optimization model from a practical problem in any field, apply algorithms discussed in class to solve it)

 

•    Robust/adversarial machine learning 

–       Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2017). Towards deep learning models resistant to adversarial attacksarXiv preprint arXiv:1706.06083.

–       Carlini, N., & Wagner, D. (2017, May). Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on (pp. 39-57).

 

•    Ranking/ choice model 

–       Radlinski, F., Kleinberg, R., & Joachims, T. (2008, July). Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning (pp. 784-791). ACM.

–       Farias, V., Jagabathula, S., & Shah, D. (2009). A data-driven approach to modeling choice. In Advances in Neural Information Processing Systems (pp. 504-512).

–       Maystre, L., & Grossglauser, M. (2017). ChoiceRank: Identifying Preferences from Node Traffic in Networks. In Proceedings of Machine Learning Research (Vol. 70, No. EPFL-CONF-229164).

 

•    Diffusion networks 

–       Zhou, K., Zha, H., & Song, L. (2013, April). Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In Artificial Intelligence and Statistics (pp. 641-649).

–      Du, N., Wang, Y., He, N., Sun, J., & Song, L. (2015). Time-sensitive recommendation from recurrent user activities. In Advances in Neural Information Processing Systems (pp. 3492-3500).

 

 

Numerical Study and Implementations

(e.g., implement and compare different algorithms and study their practical behaviors)

 

•    Implementation of distributed or parallel synchronous and asynchronous algorithms

–       B. Recht, C. Re, S. Wright, and F. Niu. Hogwild. A lock-free approach to parallelizing stochastic gradient descent. NIPS 2011.

–       J. Duchi, M. Wainwright, and Y. Zhang. Communication-Efficient Algorithms for Statistical Optimization. JMLR 2013.

–       Shamir, N. Srebro, T. Zhang. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method. ICML 2014

•    Computation survey of  various variance-reduced and incremental algorithms

 

–       R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. NIPS 2013.

–       Lin, H., Mairal, J., & Harchaoui, Z. (2015). A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems (pp. 3384-3392).

–       A Defazio, F Bach and S Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. NIPS 2014.

–       G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Arxiv 2015.

–       S.  Reddi, S. Sra, B. Poczos,  and A. Smola. Fast Stochastic Methods for Nonsmooth Nonconvex Optimization. NIPS 2016.

•    Utilization of second-order information

 

–       G. Grapiglia and Y. Nesterov. Globally Convergent Second-order Schemes for Minimizing Twice-differentiable Functions. CORE Discussion 2016.  

–       Y. Nesterov, B. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 2006.

–       R. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A Stochastic Quasi-Newton Method for Large-Scale Optimization. SIAM Journal on Optimization 2016.

–       A, Rodomanov, A. Com, and D. Kropotov. A Superlinearly-Convergent Proximal Newton-Type Method for the Optimization of Finite Sums. ICML 2016.

–       N. Agarwal, B. Bullins, and E. Hazan. Second Order Stochastic Optimization in Linear Time. arXiv 2016.

–       R. Kolte, M. Erdogdu, and A. Oz. Accelerating SVRG via second-order information. OPT 2015