Transformer (deep learning architecture)

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need".^[1] Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table.^[1] At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation,^[2]^[3] and the Fast Weight Controller, similar to a transformer, proposed in 1992.^[4]^[5]^[6]

Transformers have the advantage of having no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM),^[7] and its later variation has been prevalently adopted for training large language models (LLM) on large (language) datasets, such as the Wikipedia corpus and Common Crawl.^[8] As stated by Vartek, Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria^[9]

This architecture is now used not only in natural language processing and computer vision,^[10] but also in audio^[11] and multi-modal processing. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs)^[12] and BERT^[13] (Bidirectional Encoder Representations from Transformers).

^ ^a ^b Cite error: The named reference 2017_Attention_Is_All_You_Need was invoked but never defined (see the help page).
^ Cite error: The named reference inventors was invoked but never defined (see the help page).
^ Cite error: The named reference inventconfirm was invoked but never defined (see the help page).
^ Cite error: The named reference transform1992 was invoked but never defined (see the help page).
^ Cite error: The named reference schlag2021 was invoked but never defined (see the help page).
^ Cite error: The named reference fastlinear2020 was invoked but never defined (see the help page).
^ Hochreiter, Sepp; Schmidhuber, Jürgen (1 November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
^ "Better Language Models and Their Implications". OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25.
^ Vartak, Manasi; Subramanyam, Harihar; Lee, Wei-En; Viswanathan, Srinidhi; Husnoo, Saadiyah; Madden, Samuel; Zaharia, Matei (2016-06-26). "ModelDB: a system for machine learning model management". Proceedings of the Workshop on Human-In-the-Loop Data Analytics. HILDA '16. New York, NY, USA: Association for Computing Machinery: 1–3. doi:10.1145/2939502.2939516. ISBN 978-1-4503-4207-0.
^ He, Cheng (31 December 2021). "Transformer in CV". Transformer in CV. Towards Data Science. Archived from the original on 16 April 2023. Retrieved 19 June 2021.
^ Radford, Alec; Jong Wook Kim; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv:2212.04356 [eess.AS].
^ Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
^ "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. 2 November 2018. Archived from the original on 2021-01-13. Retrieved 2019-08-25.

[2017_Attention_Is_All_You_Need-1] Cite error: The named reference 2017_Attention_Is_All_You_Need was invoked but never defined (see the help page).

[inventors-2] Cite error: The named reference inventors was invoked but never defined (see the help page).

[inventconfirm-3] Cite error: The named reference inventconfirm was invoked but never defined (see the help page).

[transform1992-4] Cite error: The named reference transform1992 was invoked but never defined (see the help page).

[schlag2021-5] Cite error: The named reference schlag2021 was invoked but never defined (see the help page).

[fastlinear2020-6] Cite error: The named reference fastlinear2020 was invoked but never defined (see the help page).

[lstm1997-7] Hochreiter, Sepp; Schmidhuber, Jürgen (1 November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.

[:7-8] "Better Language Models and Their Implications". OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25.

[:3-9] Vartak, Manasi; Subramanyam, Harihar; Lee, Wei-En; Viswanathan, Srinidhi; Husnoo, Saadiyah; Madden, Samuel; Zaharia, Matei (2016-06-26). "ModelDB: a system for machine learning model management". Proceedings of the Workshop on Human-In-the-Loop Data Analytics. HILDA '16. New York, NY, USA: Association for Computing Machinery: 1–3. doi:10.1145/2939502.2939516. ISBN 978-1-4503-4207-0.

[10] He, Cheng (31 December 2021). "Transformer in CV". Transformer in CV. Towards Data Science. Archived from the original on 16 April 2023. Retrieved 19 June 2021.

[Robust_Speech_Recognition_via_Large-Scale_Weak_Supervision-11] Radford, Alec; Jong Wook Kim; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv:2212.04356 [eess.AS].

[wolf2020-12] Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.

[:6-13] "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. 2 November 2018. Archived from the original on 2021-01-13. Retrieved 2019-08-25.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]