Multimodal Models for Remote Sensing Image Change Retrieval and Captioning

Recently there has been increasing interest in multimodal applications that integrate text with other data modalities, such as images, audio and video, to facilitate natural language interactions with AI systems and fully express the potential of multimodal models. This could be critical for Remote Sensing (RS) applications like environmental protection, disaster monitoring and land planning. The available solutions, though, lack the ability to account for temporal changes between multiple observations, or are too focused on specific tasks like classification, captioning and retrieval, with few foundational models available.

To address this gap, we propose a novel foundation model for bi-temporal RS image pairs, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval. By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities while maintaining captioning performances that are comparable to the state of the art.

Pretrained weights are available at drive.google.com/RSICRC

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
README.md		README.md
config.json		config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Models for Remote Sensing Image Change Retrieval and Captioning

About

Releases

Packages

Languages

rogerferrod/RSICRC

Folders and files

Latest commit

History

Repository files navigation

Multimodal Models for Remote Sensing Image Change Retrieval and Captioning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages