UTF-8

UTF-8
StandardUnicode Standard
ClassificationUnicode Transformation Format, extended ASCII, variable-length encoding
ExtendsASCII
Transforms / EncodesISO/IEC 10646 (Unicode)
Preceded byUTF-1

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.[1]

UTF-8 is capable of encoding all 1,112,064[a] valid Unicode code points using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-length encoding with partial ASCII compatibility which lacked some features including self-synchronization and fully ASCII-compatible handling of characters such as slashes. Ken Thompson and Rob Pike produced the first implementation for the Plan 9 operating system in September 1992.[2][3] This led to its adoption by X/Open as its specification for FSS-UTF,[4] which would first be officially presented at USENIX in January 1993[5] and subsequently adopted by the Internet Engineering Task Force (IETF) in RFC 2277 (BCP 18)[6] for future internet standards work, replacing Single Byte Character Sets such as Latin-1 in older RFCs.

UTF-8 results in fewer internationalization issues[7][8] than any alternative text encoding, and it has been implemented in all modern operating systems, including Microsoft Windows, and standards such as JSON, where, as is increasingly the case, it is the only allowed form of Unicode.

UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98.2% of all web pages, 99.1% of the top 100,000 pages, and up to 100% for many languages, as of 2024.[9] Virtually all countries and languages have 95% or more use of UTF-8 encodings on the web.

  1. ^ "Chapter 2. General Structure". The Unicode Standard (6.0 ed.). Mountain View, California, US: The Unicode Consortium. ISBN 978-1-936213-01-6.
  2. ^ Pike, Rob (30 April 2003). "UTF-8 history".
  3. ^ Pike, Rob; Thompson, Ken (1993). "Hello World or Καλημέρα κόσμε or こんにちは 世界" (PDF). Proceedings of the Winter 1993 USENIX Conference.
  4. ^ "File System Safe UCS - Transformation Format (FSS-UTF) - X/Open Preliminary Specification" (PDF). unicode.org.
  5. ^ "USENIX Winter 1993 Conference Proceedings". usenix.org.
  6. ^ Alvestrand, Harald T. (January 1998). IETF Policy on Character Sets and Languages. IETF. doi:10.17487/RFC2277. BCP 18. RFC 2277.
  7. ^ Cite error: The named reference Microsoft GDK was invoked but never defined (see the help page).
  8. ^ Cite error: The named reference whatwg was invoked but never defined (see the help page).
  9. ^ Cite error: The named reference W3TechsWebEncoding was invoked but never defined (see the help page).


Cite error: There are <ref group=lower-alpha> tags or {{efn}} templates on this page, but the references will not show without a {{reflist|group=lower-alpha}} template or {{notelist}} template (see the help page).


© MMXXIII Rich X Search. We shall prevail. All rights reserved. Rich X Search