Multimodal sentiment analysis

Multimodal sentiment analysis is a technology for traditional text-based sentiment analysis, which includes modalities such as audio and visual data.^[1] It can be bimodal, which includes different combinations of two modalities, or trimodal, which incorporates three modalities.^[2] With the extensive amount of social media data available online in different forms such as videos and images, the conventional text-based sentiment analysis has evolved into more complex models of multimodal sentiment analysis,^[3]^[4] which can be applied in the development of virtual assistants,^[5] analysis of YouTube movie reviews,^[6] analysis of news videos,^[7] and emotion recognition (sometimes known as emotion detection) such as depression monitoring,^[8] among others.

Similar to the traditional sentiment analysis, one of the most basic task in multimodal sentiment analysis is sentiment classification, which classifies different sentiments into categories such as positive, negative, or neutral.^[9] The complexity of analyzing text, audio, and visual features to perform such a task requires the application of different fusion techniques, such as feature-level, decision-level, and hybrid fusion.^[3] The performance of these fusion techniques and the classification algorithms applied, are influenced by the type of textual, audio, and visual features employed in the analysis.^[10]

^ Soleymani, Mohammad; Garcia, David; Jou, Brendan; Schuller, Björn; Chang, Shih-Fu; Pantic, Maja (September 2017). "A survey of multimodal sentiment analysis". Image and Vision Computing. 65: 3–14. doi:10.1016/j.imavis.2017.08.003. S2CID 19491070.
^ Karray, Fakhreddine; Milad, Alemzadeh; Saleh, Jamil Abou; Mo Nours, Arab (2008). "Human-Computer Interaction: Overview on State of the Art" (PDF). International Journal on Smart Sensing and Intelligent Systems. 1: 137–159. doi:10.21307/ijssis-2017-283.
^ ^a ^b Poria, Soujanya; Cambria, Erik; Bajpai, Rajiv; Hussain, Amir (September 2017). "A review of affective computing: From unimodal analysis to multimodal fusion". Information Fusion. 37: 98–125. doi:10.1016/j.inffus.2017.02.003. hdl:1893/25490. S2CID 205433041.
^ Nguyen, Quy Hoang; Nguyen, Minh-Van Truong; Van Nguyen, Kiet (2024-05-01). "New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis". arXiv:2405.00543 [cs.CL].
^ "Google AI to make phone calls for you". BBC News. 8 May 2018. Retrieved 12 June 2018.
^ Wollmer, Martin; Weninger, Felix; Knaup, Tobias; Schuller, Bjorn; Sun, Congkai; Sagae, Kenji; Morency, Louis-Philippe (May 2013). "YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context" (PDF). IEEE Intelligent Systems. 28 (3): 46–53. doi:10.1109/MIS.2013.34. S2CID 12789201.
^ Pereira, Moisés H. R.; Pádua, Flávio L. C.; Pereira, Adriano C. M.; Benevenuto, Fabrício; Dalip, Daniel H. (9 April 2016). "Fusing Audio, Textual and Visual Features for Sentiment Analysis of News Videos". arXiv:1604.02612 [cs.CL].
^ Zucco, Chiara; Calabrese, Barbara; Cannataro, Mario (November 2017). "Sentiment analysis and affective computing for depression monitoring". 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. pp. 1988–1995. doi:10.1109/bibm.2017.8217966. ISBN 978-1-5090-3050-7. S2CID 24408937.
^ Pang, Bo; Lee, Lillian (2008). Opinion mining and sentiment analysis. Hanover, MA: Now Publishers. ISBN 978-1601981509.
^ Cite error: The named reference s7 was invoked but never defined (see the help page).

[1] Soleymani, Mohammad; Garcia, David; Jou, Brendan; Schuller, Björn; Chang, Shih-Fu; Pantic, Maja (September 2017). "A survey of multimodal sentiment analysis". Image and Vision Computing. 65: 3–14. doi:10.1016/j.imavis.2017.08.003. S2CID 19491070.

[2] Karray, Fakhreddine; Milad, Alemzadeh; Saleh, Jamil Abou; Mo Nours, Arab (2008). "Human-Computer Interaction: Overview on State of the Art" (PDF). International Journal on Smart Sensing and Intelligent Systems. 1: 137–159. doi:10.21307/ijssis-2017-283.

[s1-3] Poria, Soujanya; Cambria, Erik; Bajpai, Rajiv; Hussain, Amir (September 2017). "A review of affective computing: From unimodal analysis to multimodal fusion". Information Fusion. 37: 98–125. doi:10.1016/j.inffus.2017.02.003. hdl:1893/25490. S2CID 205433041.

[4] Nguyen, Quy Hoang; Nguyen, Minh-Van Truong; Van Nguyen, Kiet (2024-05-01). "New Benchmark Dataset and Fine-Grained Cross-Modal Fusion Framework for Vietnamese Multimodal Aspect-Category Sentiment Analysis". arXiv:2405.00543 [cs.CL].

[s5-5] "Google AI to make phone calls for you". BBC News. 8 May 2018. Retrieved 12 June 2018.

[s4-6] Wollmer, Martin; Weninger, Felix; Knaup, Tobias; Schuller, Bjorn; Sun, Congkai; Sagae, Kenji; Morency, Louis-Philippe (May 2013). "YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context" (PDF). IEEE Intelligent Systems. 28 (3): 46–53. doi:10.1109/MIS.2013.34. S2CID 12789201.

[7] Pereira, Moisés H. R.; Pádua, Flávio L. C.; Pereira, Adriano C. M.; Benevenuto, Fabrício; Dalip, Daniel H. (9 April 2016). "Fusing Audio, Textual and Visual Features for Sentiment Analysis of News Videos". arXiv:1604.02612 [cs.CL].

[s6-8] Zucco, Chiara; Calabrese, Barbara; Cannataro, Mario (November 2017). "Sentiment analysis and affective computing for depression monitoring". 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. pp. 1988–1995. doi:10.1109/bibm.2017.8217966. ISBN 978-1-5090-3050-7. S2CID 24408937.

[9] Pang, Bo; Lee, Lillian (2008). Opinion mining and sentiment analysis. Hanover, MA: Now Publishers. ISBN 978-1601981509.

[s7-10] Cite error: The named reference s7 was invoked but never defined (see the help page).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]