IE 598 BIG DATA OPTIMIZATION (Spring 2018)
Updates
§ 04242018: Final report LaTex
template here.
§ 04162018: Lecture slides on Summary and
Nonconvex Optimization
§ 04162018: Python Demo (Subgradient
method, PG, Smoothing, ADMM) for LASSO here.
§ 03282018: Python Demo (SGD, SVRG) for Logistic
Regression here.
§ 02072018: Python Demo (GD, AGD, CD) for Logistic
Regression here.
§ 01272018: You can sign up your team and presentation
paper here.
§ 01172018: Introduction lecture slide
§ 01172018: Welcome to class!
Course Information
Instructor
Niao He
Email: niaohe@illinois.edu
Office: Transportation Building 211
Office Hour: Monday, 34PM or by appointment via email
Course Description
Nearly every problem in machine learning and highdimensional statistics can be formulated in terms of optimization of some function, possibly under some constraints. In the wake of Big Data era, there is an increasingly strong need to design efficient and scalable algorithms to address optimization problems in large scale  including problems exhibiting inherently high dimensions and problems involving truly massive data sets. The key aim of this course is to expose students to a wide range of modern algorithmic developments in optimization (mainly in convex optimization) and bring them near the frontier of research in largescale optimization and machine learning.
Courtesy warning: The course will be theoryoriented and emphasize deep understanding of structures of optimization problems and computation complexity of numerical algorithms. The course will not cover any software or tools used for big data analytics and is not an application course.
Students are expected to have
strong knowledge of linear algebra, real analysis, and probability theory. Some
prior exposure to optimization at
a graduate level is preferred.
Lecture Notes
I will not post lecture notes on the website this semester,
but you can refer to my past lecture notes: IE598BigDataOptlecturenotesfall2016.pdf
Reference Books
No textbook is required, but you are highly recommended to read the following:
BenTal & Nemirovski. Lectures on Modern Convex Optimization, SIAM. 2011
Amir Beck. FirstOrder Methods in Optimization. MOSSIAM Series on Optimization. 2017
Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. KluwerAcademic. 2003
Sra, Nowozin, Wright (eds). Optimization for Machine Learning. MIT Press. 2011
Bubeck. Convex Optimization: Algorithms and Complexity. In Foundations and Trends in Machine Learning. 2015
Grading Policy
There will be no homework or exams. Students will be divided into groups of up to 3 students depending on class size. Grades will be based on
Paper Presentation (25%): each group will be assigned to present at least one paper.
Final Project (10%
proposal +40% report +25% presentation): The
project may consist of any combination of theoretical analysis, application,
and literature survey (possibly all three) and you should aim for a project
suitable for submission to a conference or journal. For example, this could be
conducting a novel largescale experiment, or thinking about a concrete open
theoretical question, applying optimization techniques to your own field,
formalizing an interesting new topic, or trying to relate several problems. The
end result should be a 1015 page report in Latex with given format, and a
2530 minute presentation.
Deadlines
Group decision: Feb 7^{th}, 2018
o You should email me names of team members and the paper(s) you select to present.
Proposal Submission: Feb 21^{th}, 2018 (Feb 28^{th} , 2018)
o You should submit at least 1 page outline of project topic, names of team members, and a brief description of background information, how you plan to conduct the problem and references.
Paper presentation: Mar 5^{th} Mar 14^{th} , 2018
Proposal resubmission (if needed): Mar 14^{th} , 2018
Final project presentation: Apr 23^{rd} May 2^{nd}, 2018
Final report submission: May 11^{th} , 2018
Final
Presentation Schedule (April 23 May 2)
Dates 
Group 
Members 
Paper 
Time 
Apr. 23 
1 
Menglong
Li 
Online Submodular
Minimization in Real Space [Slide] 
4:004:30PM 
2 
Zhikai Guo Nitin angellamudi
Seo Taek Kong 
Approximate Hessian Second Order Method [Slide] 
4:305:00PM 






Apr. 25 
3 
Myung Hwan Song 
Understanding the Generalization Performance of SGD with Warm
Restart [Slide] 
4:004:30PM 
4 
Peijun Xiao Dawei Li Junchi Yang 
Relative Smoothness and Relative Continuity [Slide] 
4:305:00PM 

5 
Ravi Raman Surya Sankagiri 
Understanding Acceleration
in Stochastic Gradient Descent [Slide] 
5:005:30PM 

Apr. 30 
6 
Amber Srivastava Alireza Askarian 
Reduction of Dynamic Graphs 
4:004:30PM 
7 
Bozun Wang 
Predicting
Length of Waiting Lists for Hospitals in England 
4:305:00PM 

8 
Yifan Hu 
Conditional SGD 
5:005:30PM 

May. 2 
9 
Eli Chien Jianhao Peng Chao Pan 
On the Study of Stochastic Gradient Coordinate Descent(SGCD) 
4:004:30PM 
10 
Siqi Zhang 
Stochastic
Mirror Descent Algorithms in Nonconvex Optimization 
4:305:00PM 


11 
Zheng Liu 
Reinforcement Learning for Radiation Source Detection 
5:005:30PM 
Paper
Presentation Schedule (March 5 March 14)
Dates 
Group 
Members 
Paper 
Time 
Mar. 5 
1 
Eli Chien Jianhao Peng Chao Pan 
Stochastic Dual
Coordinate Ascent Methods for Regularized Loss Minimization. [Slide] 
4:004:30PM 
2 
Siqi Zhang Bozun Wang 
Adaptive Subgradient Methods for Online Learning and Stochastic
Optimization. [Slide] 
4:305:00PM 

3 
Wenjing Yin 
Hypothesis Testing
by Convex Optimization. 
5:005:30PM 

Mar.7 
4 
Amber Srivastava Alireza
Askarian 
Stochastic PrimalDual
Methods and Sample Complexity of Reinforcement Learning. [Slide] 
4:004:30PM 
5 
Ravi Kiran Raman Surya
Sankagiri 
The O.D. E. Method for Convergence
of Stochastic Approximation and Reinforcement Learning. [Slide] 
4:305:00PM 

6 
Judy Gan Yifan Hu 
Trust Region Policy
Optimization. [Slide] 
5:005:30PM 

Mar. 12 
7 
Myung Hwan Song 
An Overview of Gradient
Descent Optimization Algorithms. [Slide] 
4:004:30PM 
8 
Zhikai Guo Nitin Tangellamudi Seo Taek Kong 
Identifying and Attacking
the Saddle Point Problem in Highdimensional Nonconvex Optimization. [Slide] 
4:305:00PM 

9 
Peijun Xiao Dawei Li Junchi Yang 
Theoretical insights into
the optimization landscape of overparameterized shallow neural networks.[Slide] 
5:005:30PM 

Mar. 14 
10 
Mariya Vasileva 
Generative
Adversarial Nets. 
4:004:30PM 
11 
Menglong Li Zheng Liu 
Towards
Principled Methods for Training Generative Adversarial Networks. [Slide] 
4:305:00PM 
Topical Outline and Schedule
Review Topics: Introduction and Fundamentals (2 weeks)
Learning under the lenses of optimization
Basics of convex analysis
Basics of convex optimization theory
Basic Topics: LargeScale Convex Optimization (4 weeks)
Smooth Optimization (gradient descent and acceleration, coordinate descent, etc.)
Nonsmooth Optimization (subgradient method, mirror descent, smoothing techniques, proximal algorithm, etc.)
Stochastic and Online Optimization (sample average approximation, stochastic approximation, online gradient descent, etc.)
Presentation Topics: Optimization for Machine Learning (2 weeks)
Optimization for Empirical Risk Minimization
Optimization for Deep Learning
Optimization for Generative Adversarial Networks
Optimization for Reinforcement Learning
Optimization for Statistical Inference
Advanced Topics: Recent Advances in LargeScale Optimization (4 weeks)
Projectionfree methods
Derivativefree methods
Variancereduced methods
Saddle point optimization
Parallel and distributed optimization
Nonconvex optimization
Final Presentation (2 weeks)
Presentation Papers
A. Optimization in Empirical Risk Minimization
[1] Bousquet and Bottou. (2008). The Tradeoffs of Large Scale Learning. NIPS.
[2] ShalevShwartz and Zhang. (2013). Stochastic Dual Coordinate Ascent Methods for Regularized Loss
Minimization. JMLR.
[3] Zhang
and Xiao. (2017). Stochastic
PrimalDual Coordinate Method for Regularized Empirical Risk Minimization.
JMLR.
B. Optimization in Deep Learning
[1]
Ruder. (2017). An
Overview of Gradient Descent Optimization Algorithms. ArXiv.
[2] Duchi, Hazan, and Singer. (2011).
Adaptive
Subgradient Methods for Online Learning and
Stochastic Optimization. JMLR.
[3]
Wilson el al. (2017). The Marginal Value of Adaptive Gradient Methods in Machine
Learning. NIPS.
[4]
Dauphin et al. (2014). Identifying and Attacking the Saddle Point Problem in
Highdimensional Nonconvex Optimization. NIPS.
[5] Soltanolkotabi et al. (2017). Theoretical Insights into the
Optimization Landscape of Overparameterized Shallow Neural Networks. Arxiv.
C. Optimization for Generative Adversarial Networks
[1] Goodfellow el al. (2014).
Generative Adversarial Nets. NIPS.
[2] Arjovsky and Bottou. (2017). Towards
Principled Methods for Training Generative Adversarial Networks. ICLR.
D. Optimization for Reinforcement Learning
[1] Chen
and Wang. (2016). Stochastic
PrimalDual Methods and Sample Complexity of Reinforcement Learning. ArXiv.
[2] Borkar and Meyn. (2000). The
O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement
Learning. SIAM Journal of Optimization and Control.
[3]
Schulman. (2017). Trust Region Policy
Optimization. ICML.
E. Optimization for Statistical Inference
[1] Juditsky, and Nemirovski.(2009). Nonparametric
Estimation by Convex Programming. The Annals of Statistics.
[2] Goldenshluger, Juditsky, and Nemirovski. (2015) Hypothesis Testing by Convex
Optimization.
Electronic Journal of Statistics.
[3] Hoffman et al. (2013) Stochastic Variational Inference. JMLR.
Project
Ideas
Below lists some broad areas and a few
specific ideas:
GeneralPurposed Optimization
Theory
(e.g., consider a new
setting, propose a modified algorithm and analyze convergence)
Understanding the global
convergence of deep learning
Choromanska, A., LeCun, Y., & Arous, G. B. (2015, June). Open problem: The landscape of the loss surfaces of multilayer networks. COLT, pp. 17561760.
Soltanolkotabi, M., Javanmard, A., & Lee, J. D. (2017). Theoretical insights into the optimization landscape of overparameterized shallow neural networks. arXiv:1707.04926.
Nouiehed, M., & Razaviyayn, M. (2018). Learning Deep Models: Critical Points and Local Openness. ICLR.
Xie, B., Liang, Y., & Song, L. (2016). Diverse neural network learns true target functions. arXiv:1611.03131.
Nguyen, Q., & Hein, M. (2018). The loss surface and expressivity of deep convolutional neural networks. ICLR.
Yun, C., Sra, S., & Jadbabaie, A. (2018). Global optimality conditions for deep neural networks. ICLR.
Exploring the convergence of
SGD for nonconvex optimization problems
Ghadimi, S., & Lan, G. (2013). Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 23412368.
Davis, D., & Drusvyatskiy, D. (2018). Stochastic subgradient method converges at the rate $ O (k^{1/4}) $ on weakly convex functions. arXiv:1802.02988.
Raginsky, M., Rakhlin, A., & Telgarsky, M. (2017). Nonconvex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. arXiv:1702.03849.
Exploring the acceleration
of SGD algorithms
Lee, J. D., AllenZhu, Z. (2017, June). Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (pp. 12001205). ACM.
Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., & Sidford, A. (2017). Accelerating Stochastic Gradient Descent. arXiv preprint arXiv:1704.08227.
Gadat, S., Panloup, F., & Saadane, S. (2018). Stochastic heavy ball. Electronic Journal of Statistics, 12(1), 461529.
Interpretation of Nesterovs accelerated
algorithms
Boyd,
and E. Candes.
A Differential Equation for Modeling Nesterovs
Accelerated Gradient Method: Theory and Insights. NIPS 2014.
L. Lessard, B. Recht, and A. Packward. Analysis and Design of
Optimization Algorithms via Integral Quadratic Constraints. SIAM Journal on
Optimization 2016.
Wibisono, A. Wilson, and
M. Jordan. A Variational
Perspective on Accelerated Methods in Optimization. ArXiv
2016.
Z. AllenZhu and
L. Orrchia. Linear Coupling: An Ultimate Unification of Gradient and
Mirror Descent. ArXiv 2014,
S. Bubeck, Y W. Su, S.. Lee and M. Singh. A geometric alternative to Nesterovs accelerated gradient descent. ArXiv 2015.
N. Flammarion and
F. Bach. From Averaging to
Acceleration, There is Only a Stepsize. Arxiv
2015.
W. Krichene, A. Bayen, and P.
Bartlett. Accelerated mirror descent in continuous and discrete
time. NIPS 2015.
Exploiting random permutation/random shuffling
Gόrbόzbalaban, M., Ozdaglar, A., &
Parrilo, P. (2015). Why random reshuffling beats stochastic
gradient descent. arXiv preprint
arXiv:1510.08560.
Wright,
S. J., & Lee, C. P. (2017). Analyzing
Random Permutations for Cyclic Coordinate Descent. arXiv
preprint arXiv:1706.00908.
Sun,
R., Luo, Z. Q., & Ye, Y. (2015). On
the expected convergence of randomly permuted ADMM. arXiv preprint arXiv:1503.06387.
Bridging control theory with optimization theory
Boyd,
and E. Candes.
A Differential Equation for Modeling Nesterovs Accelerated Gradient Method: Theory and Insights. NIPS
2014.
L.
Lessard, B. Recht, and A. Packward.
Analysis and Design of
Optimization Algorithms via Integral Quadratic Constraints. SIAM Journal on
Optimization 2016.
Lee,
J. D., Panageas, I., Piliouras,
G., Simchowitz, M., Jordan, M. I., & Recht, B. (2017).
Firstorder Methods Almost Always Avoid Saddle Points.
arXiv:1710.07406.
Wilson,
A. C., Recht, B., & Jordan, M. I. (2016). A Lyapunov analysis of momentum methods
in optimization. arXiv preprint arXiv:1611.02635.
Hardt, M., Ma, T., & Recht,
B. (2016). Gradient
descent learns linear dynamical systems. arXiv
preprint arXiv:1609.05191.
Hu,
B., & Lessard, L. (2017). Dissipativity Theory for Nesterov's
Accelerated Method. arXiv preprint
arXiv:1706.04381.
DomainSpecific Optimization
Theory
Improving and stabilizing
the optimization for GAN
Arjovsky and Bottou. (2017). Towards Principled Methods for Training Generative Adversarial Networks. ICLR.
Berthelot, D., Schumm, T., & Metz, L. (2017). BEGAN: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717.
Sinha, A., Namkoong, H., & Duchi, J. (2017). Certifiable Distributional Robustness with Principled Adversarial Training. arXiv preprint arXiv:1710.10571.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., & Hochreiter, S. (2017). GANs trained by a two timescale update rule converge to a Nash equilibrium. arXiv preprint arXiv:1706.08500.
Provable convergence for
continuous control and reinforcement learning
Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., & Szepesvαri, C. (2009). Convergent temporaldifference learning with arbitrary smooth function approximation. NIPS (pp. 12041212).
Karmakar, P., & Bhatnagar, S. (2017). Two timescale stochastic approximation with controlled Markov noise and offpolicy temporaldifference learning. Mathematics of Operations Research.
Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Chen, J., & Song, L. (2017). Smoothed Dual Embedding Control. arXiv preprint arXiv:1712.10285.
Dalal, G., Szorenyi, B., Thoppe, G., & Mannor, S. (2017). Finite sample analyses for TD(0) with Function Approximation. ArXiv: 1704.01161.
Iterative algorithms for
Bayesian and probabilistic inference
Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., & Szepesvαri, C. (2009). Convergent temporaldifference learning with arbitrary smooth function approximation. NIPS (pp. 12041212).
Dai, B., He, N., Dai, H., & Song, L. (2016, May). Provable Bayesian inference via particle mirror descent. In Artificial Intelligence and Statistics (pp. 985994).
Liu, Q., & Wang, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems (pp. 23782386).
Jimenez Rezende, D., & Mohamed, S. (2015). Variational inference with normalizing flows. ICML.
Modeling and Applications
(e.g., formulate an optimization
model from a practical problem in any field, apply algorithms discussed in
class to solve it)
Robust/adversarial machine
learning
Madry, A., Makelov, A., Schmidt, L., Tsipras, D.,
& Vladu, A. (2017). Towards deep learning models
resistant to adversarial attacks. arXiv preprint
arXiv:1706.06083.
Carlini, N., & Wagner, D. (2017, May). Towards evaluating
the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on (pp. 3957).
Ranking/ choice model
Radlinski, F., Kleinberg, R., & Joachims, T. (2008,
July). Learning diverse
rankings with multiarmed bandits. In Proceedings of the 25th international
conference on Machine learning (pp. 784791). ACM.
Farias,
V., Jagabathula, S., & Shah, D. (2009). A
datadriven approach to modeling choice. In Advances in Neural Information
Processing Systems (pp. 504512).
Maystre, L., & Grossglauser, M. (2017). ChoiceRank:
Identifying Preferences from Node Traffic in Networks. In Proceedings of Machine
Learning Research (Vol. 70, No.
EPFLCONF229164).
Diffusion networks
Zhou, K., Zha, H., & Song, L. (2013, April). Learning social infectivity in sparse lowrank networks using multidimensional hawkes processes. In Artificial Intelligence and Statistics (pp. 641649).
Du, N., Wang, Y., He, N., Sun, J., & Song, L. (2015). Timesensitive recommendation from recurrent user activities. In Advances in Neural Information Processing Systems (pp. 34923500).
Numerical Study and
Implementations
(e.g., implement and compare
different algorithms and study their practical behaviors)
Implementation of
distributed or parallel synchronous and asynchronous algorithms
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild. A lockfree approach to parallelizing stochastic
gradient descent. NIPS
2011.
J.
Duchi, M. Wainwright, and Y. Zhang. CommunicationEfficient
Algorithms for Statistical Optimization. JMLR 2013.
Shamir,
N. Srebro, T. Zhang.
CommunicationEfficient Distributed Optimization using an Approximate
Newtontype Method. ICML 2014
Computation survey of various variancereduced and incremental
algorithms
R.
Johnson and T. Zhang. Accelerating stochastic
gradient descent using predictive variance reduction. NIPS 2013.
Lin, H., Mairal, J., & Harchaoui, Z. (2015). A
universal catalyst for firstorder optimization. In Advances in Neural
Information Processing Systems (pp. 33843392).
A Defazio, F Bach and S LacosteJulien. SAGA:
A Fast Incremental Gradient Method With Support for
NonStrongly Convex Composite Objectives. NIPS 2014.
G. Lan and Y.
Zhou. An optimal randomized
incremental gradient method. Arxiv 2015.
S. Reddi, S. Sra, B. Poczos, and A. Smola.
Fast Stochastic Methods for Nonsmooth Nonconvex Optimization. NIPS 2016.
Utilization of secondorder
information
G.
Grapiglia and Y. Nesterov. Globally Convergent
Secondorder Schemes for Minimizing Twicedifferentiable Functions. CORE
Discussion 2016.
Y.
Nesterov, B. Polyak. Cubic regularization
of Newton method and its global performance. Mathematical Programming,
2006.
R.
Byrd, S. Hansen, J. Nocedal, and Y. Singer. A Stochastic QuasiNewton
Method for LargeScale Optimization. SIAM Journal on Optimization 2016.
A,
Rodomanov, A. Com, and D. Kropotov.
A SuperlinearlyConvergent Proximal NewtonType Method for
the Optimization of Finite Sums. ICML 2016.
N.
Agarwal, B. Bullins, and E. Hazan. Second Order Stochastic Optimization in
Linear Time. arXiv 2016.
R.
Kolte, M. Erdogdu, and A. Oz. Accelerating SVRG via
secondorder information. OPT 2015