Backpropagation

In machine learning, backpropagation is a gradient estimation method commonly used for training a neural network to compute its parameter updates.

It is an efficient application of the chain rule to neural networks. Backpropagation computes the gradient of a loss function with respect to the weights of the network for a single input–output example, and does so efficiently, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule; this can be derived through dynamic programming.^[1]^[2]^[3]

Strictly speaking, the term backpropagation refers only to an algorithm for efficiently computing the gradient, not how the gradient is used; but the term is often used loosely to refer to the entire learning algorithm – including how the gradient is used, such as by stochastic gradient descent, or as an intermediate step in a more complicated optimizer, such as Adaptive Moment Estimation.^[4] The local minimum convergence, exploding gradient, vanishing gradient, and weak control of learning rate are main disadvantages of these optimization algorithms. The Hessian and quasi-Hessian optimizers solve only local minimum convergence problem, and the backpropagation works longer. These problems caused researchers to develop hybrid^[5] and fractional^[6] optimization algorithms.

Backpropagation had multiple discoveries and partial discoveries, with a tangled history and terminology. See the history section for details. Some other names for the technique include "reverse mode of automatic differentiation" or "reverse accumulation".^[7]

^ Kelley, Henry J. (1960). "Gradient theory of optimal flight paths". ARS Journal. 30 (10): 947–954. doi:10.2514/8.5282.
^ Bryson, Arthur E. (1962). "A gradient method for optimizing multi-stage allocation processes". Proceedings of the Harvard Univ. Symposium on digital computers and their applications, 3–6 April 1961. Cambridge: Harvard University Press. OCLC 498866871.
^ Goodfellow, Bengio & Courville 2016, p. 214, "This table-filling strategy is sometimes called dynamic programming."
^ Goodfellow, Bengio & Courville 2016, p. 200, "The term back-propagation is often misunderstood as meaning the whole learning algorithm for multilayer neural networks. Backpropagation refers only to the method for computing the gradient, while other algorithms, such as stochastic gradient descent, is used to perform learning using this gradient."
^ Mohapatra, Rohan; Saha, Snehanshu; Coello, Carlos A. Coello; Bhattacharya, Anwesh; Dhavala, Soma S.; Saha, Sriparna (April 2022). "AdaSwarm: Augmenting Gradient-Based Optimizers in Deep Learning With Swarm Intelligence". IEEE Transactions on Emerging Topics in Computational Intelligence. 6 (2): 329–340. arXiv:2006.09875. doi:10.1109/TETCI.2021.3083428. hdl:20.500.11824/1557. ISSN 2471-285X.
^ Abdulkadirov, Ruslan I.; Lyakhov, Pavel A.; Baboshina, Valentina A.; Nagornov, Nikolay N. (2024). "Improving the Accuracy of Neural Network Pattern Recognition by Fractional Gradient Descent". IEEE Access. 12: 168428–168444. Bibcode:2024IEEEA..12p8428A. doi:10.1109/ACCESS.2024.3491614. ISSN 2169-3536.
^ Goodfellow, Bengio & Courville (2016, p. 217–218), "The back-propagation algorithm described here is only one approach to automatic differentiation. It is a special case of a broader class of techniques called reverse mode accumulation."

[kelley1960-1] Kelley, Henry J. (1960). "Gradient theory of optimal flight paths". ARS Journal. 30 (10): 947–954. doi:10.2514/8.5282.

[bryson1961-2] Bryson, Arthur E. (1962). "A gradient method for optimizing multi-stage allocation processes". Proceedings of the Harvard Univ. Symposium on digital computers and their applications, 3–6 April 1961. Cambridge: Harvard University Press. OCLC 498866871.

[FOOTNOTEGoodfellowBengioCourville2016[httpswwwdeeplearningbookorgcontentsmlphtmlpf33_214]-3] Goodfellow, Bengio & Courville 2016, p. 214, "This table-filling strategy is sometimes called dynamic programming."

[4] Goodfellow, Bengio & Courville 2016, p. 200, "The term back-propagation is often misunderstood as meaning the whole learning algorithm for multilayer neural networks. Backpropagation refers only to the method for computing the gradient, while other algorithms, such as stochastic gradient descent, is used to perform learning using this gradient."

[5] Mohapatra, Rohan; Saha, Snehanshu; Coello, Carlos A. Coello; Bhattacharya, Anwesh; Dhavala, Soma S.; Saha, Sriparna (April 2022). "AdaSwarm: Augmenting Gradient-Based Optimizers in Deep Learning With Swarm Intelligence". IEEE Transactions on Emerging Topics in Computational Intelligence. 6 (2): 329–340. arXiv:2006.09875. doi:10.1109/TETCI.2021.3083428. hdl:20.500.11824/1557. ISSN 2471-285X.

[6] Abdulkadirov, Ruslan I.; Lyakhov, Pavel A.; Baboshina, Valentina A.; Nagornov, Nikolay N. (2024). "Improving the Accuracy of Neural Network Pattern Recognition by Fractional Gradient Descent". IEEE Access. 12: 168428–168444. Bibcode:2024IEEEA..12p8428A. doi:10.1109/ACCESS.2024.3491614. ISSN 2169-3536.

[DL-reverse-mode-7] Goodfellow, Bengio & Courville (2016, p. 217–218), "The back-propagation algorithm described here is only one approach to automatic differentiation. It is a special case of a broader class of techniques called reverse mode accumulation."

[1]

[2]

[3]

[4]

[5]

[6]

[7]