Large width limits of neural networks

Behavior of a neural network simplifies as it becomes infinitely wide. Left: a Bayesian neural network with two hidden layers, transforming a 3-dimensional input (bottom) into a two-dimensional output

(y_{1},y_{2})

(top). Right: output probability density function

p(y_{1},y_{2})

induced by the random weights of the network. Video: as the width of the network increases, the output distribution simplifies, ultimately converging to a Neural network Gaussian process in the infinite width limit.

Artificial neural networks are a class of models used in machine learning, and inspired by biological neural networks. They are the core component of modern deep learning algorithms. Computation in artificial neural networks is usually organized into sequential layers of artificial neurons. The number of neurons in a layer is called the layer width. Theoretical analysis of artificial neural networks sometimes considers the limiting case that layer width becomes large or infinite. This limit enables simple analytic statements to be made about neural network predictions, training dynamics, generalization, and loss surfaces. This wide layer limit is also of practical interest, since finite width neural networks often perform strictly better as layer width is increased.^[1]^[2]^[3]^[4]^[5]^[6]

^ Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Sensitivity and Generalization in Neural Networks: an Empirical Study". International Conference on Learning Representations. arXiv:1802.08760. Bibcode:2018arXiv180208760N.
^ Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (2016-11-04). "An Analysis of Deep Neural Network Models for Practical Applications". arXiv:1605.07678. Bibcode:2016arXiv160507678C. {{cite journal}}: Cite journal requires |journal= (help)
^ Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. arXiv:1810.05148. Bibcode:2018arXiv181005148N.
^ Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan (2019). "Towards understanding the role of over-parametrization in generalization of neural networks". International Conference on Learning Representations. arXiv:1805.12076. Bibcode:2018arXiv180512076N.
^ Lawrence, Steve; Giles, C. Lee; Tsoi, Ah Chung (1996). "What size neural network gives optimal generalization? convergence properties of backpropagation". CiteSeerX 10.1.1.125.6019. {{cite journal}}: Cite journal requires |journal= (help)
^ Bartlett, P.L. (1998). "The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network". IEEE Transactions on Information Theory. 44 (2): 525–536. doi:10.1109/18.661502. ISSN 1557-9654.

[:7-1] Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Sensitivity and Generalization in Neural Networks: an Empirical Study". International Conference on Learning Representations. arXiv:1802.08760. Bibcode:2018arXiv180208760N.

[:8-2] Canziani, Alfredo; Paszke, Adam; Culurciello, Eugenio (2016-11-04). "An Analysis of Deep Neural Network Models for Practical Applications". arXiv:1605.07678. Bibcode:2016arXiv160507678C. {{cite journal}}: Cite journal requires |journal= (help)

[:1-3] Novak, Roman; Xiao, Lechao; Lee, Jaehoon; Bahri, Yasaman; Yang, Greg; Abolafia, Dan; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018). "Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes". International Conference on Learning Representations. arXiv:1810.05148. Bibcode:2018arXiv181005148N.

[:6-4] Neyshabur, Behnam; Li, Zhiyuan; Bhojanapalli, Srinadh; LeCun, Yann; Srebro, Nathan (2019). "Towards understanding the role of over-parametrization in generalization of neural networks". International Conference on Learning Representations. arXiv:1805.12076. Bibcode:2018arXiv180512076N.

[5] Lawrence, Steve; Giles, C. Lee; Tsoi, Ah Chung (1996). "What size neural network gives optimal generalization? convergence properties of backpropagation". CiteSeerX 10.1.1.125.6019. {{cite journal}}: Cite journal requires |journal= (help)

[6] Bartlett, P.L. (1998). "The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network". IEEE Transactions on Information Theory. 44 (2): 525–536. doi:10.1109/18.661502. ISSN 1557-9654.

[1]

[2]

[3]

[4]

[5]

[6]