Autoregressive transformer language model for drug discovery. (Pre)trained on a large SMILES corpus. Evaluated on molecular property prediction and low-data de novo design tasks.
Set up conda and create a new environment from
environment.yml
(if needed, make corresponding edits for GPU-compatibility).
conda env create -f environment.yml
conda activate smiles-gpt
git clone https://github.com/sanjaradylov/smiles-gpt.git
cd smiles-gpt
notebooks/language-modeling.ipynb
pretrains GPT-2 on 10M Pubchem SMILES data.
notebooks/selfies-anygpt
introduces AnyGPT for pretraining 1D molecular data.
checkpoints/
stores serialized model, tokenizer, and configuration. Do not modify them. Use
from_pretrained
method to load HuggingFace objects, e.g.,
from transformers import GPT2Config, GPT2LMHeadModel, PreTrainedTokenizerFast
checkpoint = "checkpoints/benchmark-5m"
config = GPT2Config.from_pretrained(checkpoint)
model = GPT2LMHeadModel.from_pretrained(checkpoint)
tokenizer = PreTrainedTokenizerFast.from_pretrained(checkpoint)
data stores Blood-Brain Barrier Penetration classification dataset and 10K subset of ChemBERTa's PubChem-10M. See Examples.
output stores generated SMILES strings.
Adapter training for molecular property prediction
(replace data/bbbp.csv
and p_np
arguments with your dataset and taskname(s),
respectively):
python3 scripts/classification.py checkpoints/benchmark-5m data/bbbp.csv p_np
For language model pretraining, see notebooks.
If you use smiles-gpt
in your research, please consider citing