Trends and Challenges of Multimodal Solutions for Text and Image Context Extraction
Abstract
The number of published images, texts, and other information rapidly increases in today’s digital space. The ability to simultaneously process textual and visual information helps to interpret content more accurately. It enables the application of artificial intelligence in complex situations, such as contextual analysis and real-time monitoring of social networks. Today’s Large Language Models (LLMs) are based on text data, Object Recognition (OR) on visual data, and Vision-language (VL) models use text and visual data. The combinations of these different models can be used to create multimodal solutions to solve various context-extraction tasks. In such a way, this requires images and texts, which are input to the multimodal models. This paper systematically reviews the latest research on existing LLM, OR and VL models. The scientific articles were analysed on the Web of Science and Google Scholar databases. All the papers are free to access and date from 2019 to 2024. The main objective was to summarise the results of scientific research, tasks, and methods used in text, image, and image-text data analysis. The types of datasets and the language of the texts used in the research were also reviewed. Additionally, the results are useful for highlighting trends and challenges in the context extraction field that can be useful to other researchers.
