Transformer (deep learning architecture)

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need".[1] Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table.[1] At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation,[2][3] and the Fast Weight Controller, similar to a transformer, proposed in 1992.[4][5][6]

Transformers have the advantage of having no recurrent units, and thus requires less training time than previous recurrent neural architectures, such as long short-term memory (LSTM),[7] and its later variation has been prevalently adopted for training large language models (LLM) on large (language) datasets, such as the Wikipedia corpus and Common Crawl.[8] As stated by Vartek, Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria[9]

This architecture is now used not only in natural language processing and computer vision,[10] but also in audio[11] and multi-modal processing. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs)[12] and BERT[13] (Bidirectional Encoder Representations from Transformers).

Timeline of natural language processing models
  1. ^ a b Cite error: The named reference 2017_Attention_Is_All_You_Need was invoked but never defined (see the help page).
  2. ^ Cite error: The named reference inventors was invoked but never defined (see the help page).
  3. ^ Cite error: The named reference inventconfirm was invoked but never defined (see the help page).
  4. ^ Cite error: The named reference transform1992 was invoked but never defined (see the help page).
  5. ^ Cite error: The named reference schlag2021 was invoked but never defined (see the help page).
  6. ^ Cite error: The named reference fastlinear2020 was invoked but never defined (see the help page).
  7. ^ Hochreiter, Sepp; Schmidhuber, Jürgen (1 November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
  8. ^ "Better Language Models and Their Implications". OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25.
  9. ^ Vartak, Manasi; Subramanyam, Harihar; Lee, Wei-En; Viswanathan, Srinidhi; Husnoo, Saadiyah; Madden, Samuel; Zaharia, Matei (2016-06-26). "ModelDB: a system for machine learning model management". Proceedings of the Workshop on Human-In-the-Loop Data Analytics. HILDA '16. New York, NY, USA: Association for Computing Machinery: 1–3. doi:10.1145/2939502.2939516. ISBN 978-1-4503-4207-0.
  10. ^ He, Cheng (31 December 2021). "Transformer in CV". Transformer in CV. Towards Data Science. Archived from the original on 16 April 2023. Retrieved 19 June 2021.
  11. ^ Radford, Alec; Jong Wook Kim; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv:2212.04356 [eess.AS].
  12. ^ Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
  13. ^ "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. 2 November 2018. Archived from the original on 2021-01-13. Retrieved 2019-08-25.

© MMXXIII Rich X Search. We shall prevail. All rights reserved. Rich X Search