This repository contains the implementation of DINO (Distillation with No Labels) with modifications to include an image reconstruction task, inspired by Multi-Concept Self-Supervised Learning (MC-SSL). The project leverages Vision Transformers (ViTs) to learn robust visual representations without labeled data and explores the integration of reconstruction tasks to enhance learning. DINO is a self-supervised learning framework that utilizes Vision Transformers to learn high-quality image representations without labeled data. This repository extends the DINO approach by incorporating an image reconstruction task to investigate the potential benefits of combining global feature learning with local detail reconstruction.
Leverages the DINO framework for learning robust representations without labels.
Integrates a reconstruction head using Group Masked Model Learning (GMML) methods to enhance learning.
Employs a teacher network as a momentum-averaged version of the student network for stable learning.
Uses ViT-Tiny for efficient training and experimentation.
- Python 3.8 or higher
- PyTorch 1.8.0 or higher
- torchvision 0.9.0 or higher
- numpy
- matplotlib
python main_dino.py
python eval_linear.py
DINO Paper https://arxiv.org/abs/2104.14294 MCSSL Paper https://arxiv.org/abs/2104.14294
This project is part of my PhD admission task at the University of Surrey, aiming to explore and enhance self-supervised learning techniques for Vision Transformers. The goal is to investigate the integration of image reconstruction tasks with DINO to improve representation learning.