Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective.[1]
This method has enabled broad applications across multiple domains, including cross-modal retrieval,[2] text-to-image generation,[3] and aesthetic ranking.[4]
^Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021-07-01). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning. PMLR. pp. 8748–8763.
^"Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. Archived from the original on January 18, 2023. Retrieved 17 September 2022.