Ge🌏Galactica

A Scientific Large Language Model in Geoscience

Technical report is HERE!
The data pre-processing toolkits are open sourced on sciparser!

Statement

To clarify with potential confusions, we hereby state that the model with the manuscript "GeoGalactica: A Scientific Large Language Model in Geoscience" is not associated with the DDE Program, nor supported by DDE related fundings. We feel sorry for the unintentional misunderstandings and inconvenience, and we commit to prevent future misunderstandings.

Introduction

GeoGalactica is from further pre-training of Galactica -- a top-performing LLM trained with a large number of scientific documents. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an open-sourced LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain.

Resources

Paper: https://github.com/geobrain-ai/geogalactica
Data: https://huggingface.co/datasets/daven3/geobench, https://huggingface.co/datasets/daven3/geosignal, and https://github.com/zthang/geotools
Model: https://huggingface.co/geobrain-ai/geogalactica
Checkpoints: https://huggingface.co/geobrain-ai/geogalactica-ckpt
Plot: https://github.com/dbylynn/GeoGalactica_Analysis
Sciparser: https://github.com/davendw49/sciparser

Quick Start

A simple script is provided (tools/prediction/demo.py) for the model to predict the output text for a single input. The memory exceeds 140GB. The folder example_data shares data file format during the training.

Contributors

This project was founded by Acemap at Shanghai Jiao Tong University, leading by Zhouhan Lin and a group of students including Cheng Deng* (student leader), Le Zhou, Tianhang Zhang, Yi Xu, Yutong Xu, Beiya Dai, Qiyuan Chen, Yuanyuan Shi and Zhongmou He supervised by Zhouhan Lin, Junxian He, Xinbing Wang, and Chenghu Zhou.

Acknowledgements

GeoGalactica has referred to the following open-source projects. We want to express our gratitude and respect to the researchers of the projects.

Facebook Galactica: https://galactica.org/
Facebook LLaMA: https://github.com/facebookresearch/llama
Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca
alpaca-lora by @tloen: https://github.com/tloen/alpaca-lora
alpaca-gp4 by Chansung Park: tloen/alpaca-lora#340
K2 by Cheng Deng: https://github.com/davendw49/k2

We would also like to express our appreciation for the effort of data processing and annotation from the students in CAS.

License

GeoGalactica is a research preview intended for non-commercial use only, subject to the model License of Galactica and the Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations. The code is released under the Apache License 2.0. The data GeoSignal and GeoBench is open-sourced by K2.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
example_data		example_data
prediction		prediction
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ge🌏Galactica

A Scientific Large Language Model in Geoscience

Statement

Introduction

Resources

Quick Start

Contributors

Acknowledgements

License

About

Releases

Packages

Languages

License

geobrain-ai/geogalactica

Folders and files

Latest commit

History

Repository files navigation

Ge🌏Galactica

A Scientific Large Language Model in Geoscience

Statement

Introduction

Resources

Quick Start

Contributors

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages