A vision transformer (ViT) is a transformer designed for computer vision.[1] A ViT breaks down an input image into a series of patches (rather than breaking up text into tokens), serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
ViT has found applications in image recognition, image segmentation, and autonomous driving.[citation needed]
© MMXXIII Rich X Search. We shall prevail. All rights reserved. Rich X Search