Recently there has been increasing interest in multimodal applications that integrate text with other data modalities, such as images, audio and video, to facilitate natural language interactions with AI systems and fully express the potential of multimodal models. This could be critical for Remote Sensing (RS) applications like environmental protection, disaster monitoring and land planning. The available solutions, though, lack the ability to account for temporal changes between multiple observations, or are too focused on specific tasks like classification, captioning and retrieval, with few foundational models available.
To address this gap, we propose a novel foundation model for bi-temporal RS image pairs, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval. By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities while maintaining captioning performances that are comparable to the state of the art.
Pretrained weights are available at drive.google.com/RSICRC