Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BBB permeability - New dataset #174

Open
devanshamin opened this issue Sep 25, 2022 · 7 comments
Open

BBB permeability - New dataset #174

devanshamin opened this issue Sep 25, 2022 · 7 comments
Labels
good first issue Good for newcomers help-wanted new-dataset Request new dataset.

Comments

@devanshamin
Copy link

Describe the problem
Currently, TDC has BBB_martins dataset for Blood Brain Barrier (BBB) permeability consisting of only 2030 compounds. There is a much bigger dataset called Blood-Brain Barrier Database (B3DB) consisting of 7807 compounds.

Describe the solution you'd like
Inclusion of the dataset to the Single-instance Prediction Problem (ADME) and the ADMET Benchmark Group.

from tdc.single_pred import ADME
data = ADME(name="B3DB")

Additional context
B3DB - https://github.com/theochem/B3DB

@kexinhuang12345
Copy link
Collaborator

Hi Devansh! Thanks for the pointer! This definitely sounds relevant! Would you like to contribute to TDC? Let us know, thanks!

@kexinhuang12345 kexinhuang12345 added good first issue Good for newcomers new-dataset Request new dataset. help-wanted labels Nov 9, 2022
@marc-gav
Copy link

I will work on this

@inakineitor
Copy link

inakineitor commented Dec 7, 2023

@kexinhuang12345 Hi Kexin! I am interested in adding the BBB dataset to TDC. So far the steps I identified are:

  1. Add a bbb.py file to the tdc/single_pred folder. I realized that BBB belongs to ADME so no file changes in this folder.
  2. Add the appropriate reexport to tdc/single_pred/__init__.py. For same reason this step is not necessary.
  3. Download the data and give it to you for storing in Dataverse.
  4. Inserting in line 119 of tdc/metadata.py the names for the classification and the regression versions of the B3DB dataset.
adme_dataset_names = [
    # ...
    "clearance_microsome_az",
    "b3db_classification", # Added
    "b3db_regression", # Added
]
  1. Add to the object in line 627:
name2type = {
    # ...
    "bbb_adenot": "tab",
    "b3db_classification": "tab", # Added
    "b3db_regression": "tab", # Added
    "bbb_martins": "tab",
    # ...
}
  1. I am unsure of how to generate the id to put in name2id in line 740. Does one obtain that by adding the dataset to the data server?
  2. Same question, but for name2stats in line 907.

I am new to the package so any guidance or recommendations would be appreciated.

Looking forward to your response!

@ayushnoori
Copy link
Member

Hi @kexinhuang12345, we had a conversation back in February 2022 about adding this dataset to TDC so following up here. I'm working with @inakineitor and we would be happy to help get this dataset included (unless @marc-gav has made progress). We can also open a new issue if needed.

Iñaki – Kexin had previously pointed me to the contribution guide.

@kexinhuang12345
Copy link
Collaborator

Sorry for the late reply - was traveling - this sounds awesome! I think the questions can be answered via the contribution guide pointed out by Ayush. Let me know if you still bump into any questions!

@ayushnoori
Copy link
Member

Hi Kexin, no worries! All steps are now completed except for name2stats, described as a "mapping from dataset names to statistics." How should the statistics IDs be generated?

@ayushnoori
Copy link
Member

ayushnoori commented Dec 20, 2023

Please see ayushnoori@ac35e01 at my fork, https://github.com/ayushnoori/TDC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help-wanted new-dataset Request new dataset.
Projects
None yet
Development

No branches or pull requests

5 participants