Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

murcko scaffold split #18

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

karinazad
Copy link
Collaborator

@karinazad karinazad commented May 15, 2024

Implements Murcko scaffold split based on RDKit

@0x00b1
Copy link
Collaborator

0x00b1 commented May 15, 2024

@karinazad How do you feel about beignet.subsets?

I believe it’s clearer from a discovery perspective and it matches the underlying data structure.

shuffle: bool = True,
include_chirality: bool = False,
) -> tuple[Subset, Subset]:
"""Split a dataset based on Murcko scaffold splitting based
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prepend r

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? Is that for the input SMILES strings? Escape characters shouldn't be present in valid SMILES anyway and isn't it that if a string is formed with escape chars without r initially, prepending it won't change the format? Or maybe I'm just misunderstanding

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, preprend r to the docstring.

@0x00b1
Copy link
Collaborator

0x00b1 commented May 15, 2024

@karinazad You’ll also need to add this to the docs.

for ind, s in enumerate(smiles):
mol = Chem.MolFromSmiles(s)
if mol is not None:
scaffold = MurckoScaffoldSmiles(mol=mol, includeChirality=include_chirality)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird function? It looks like a wrapper around:

Chem.MolToSmiles(GetScaffoldForMol(mol), includeChirality)

No clue about this:

https://github.com/rdkit/rdkit/blob/3457c1eb60846ea821e4a319f3505933027d3cf8/rdkit/Chem/Scaffolds/MurckoScaffold.py#L77-L85

It looks autogenerated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah GetScaffoldForMol looks strange

@karinazad
Copy link
Collaborator Author

@karinazad How do you feel about beignet.subsets?

I believe it’s clearer from a discovery perspective and it matches the underlying data structure.

should we also change the function name? murcko_scaffold_subset or something like that?

from unittest.mock import MagicMock, patch

import pytest
from beignet.subsets._murcko_scaffold_split import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from beignet.subsets import (...)

@@ -0,0 +1,71 @@
from importlib.util import find_spec
from unittest.mock import MagicMock, patch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from unittest.mock import MagicMock
import unittest.mock

Use unittest.mock.patch explicitly.

@@ -0,0 +1,71 @@
from importlib.util import find_spec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make explicit.

@@ -0,0 +1,122 @@
import math
import random
from collections import defaultdict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make explicit.

https://doi.org/10.1021/jm9602928
- "RDKit: Open-source cheminformatics. https://www.rdkit.org"
"""
train_idx, test_idx = _murcko_scaffold_split_indices(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline?


scaffolds = defaultdict(list)

for ind, s in enumerate(smiles):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid abbreviations, e.g., use index and sequence.

Chem, MurckoScaffoldSmiles = None, None


def murcko_scaffold_split(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add generator: Generator arg.

seed: int = 0xDEADBEEF,
shuffle: bool = True,
include_chirality: bool = False,
) -> tuple[Subset, Subset]:
Copy link
Collaborator

@0x00b1 0x00b1 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better design is returning List[Subset] based on a lengths parameter, e.g., see torch.utils.data.random_split.

@0x00b1
Copy link
Collaborator

0x00b1 commented May 20, 2024

Add RDKit as an extra dependency to pyproject.toml.


if shuffle:
if seed is not None:
random.Random(seed).shuffle(scaffolds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants