Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get SMILES and AA sequences #1

Open
vemonet opened this issue Sep 21, 2023 · 7 comments
Open

Get SMILES and AA sequences #1

vemonet opened this issue Sep 21, 2023 · 7 comments

Comments

@vemonet
Copy link
Member

vemonet commented Sep 21, 2023

Get SMILES for PubChem Compount (here for aspirin CID 2244):

Get AA sequence for a protein (check the sequence key):

@sandrine-muller-research

you can get SMILES from MolePro as well (depending on your input types (ID or chemical nemaes) ) you can use some of these endpoints:
https://molepro.broadinstitute.org/molecular_data_provider/assets/lib/swagger-ui/index.html?url=/molecular_data_provider/assets/openapi.json
with a POST query to /compound/by_id, you'll get the following json
We have put in place a curated way to elect best structures given chemical names where some entries have been curated already (the endpoint by name though is still work in progress and has some in progress towards curation but works pretty well).

@vemonet
Copy link
Member Author

vemonet commented Oct 9, 2023

Thanks a lot @sandrine-muller-research !
Just CHEMBL ID is quite limited, so I am interested in anything that will cover a wider ranger of IDs. And MolePro seems to have a really nice API

But I lack of knowledge in the SMILES system, maybe you can enlighten me!

For some compounds the MolePro API is returning multiple elements, e.g. for CHEMBL.COMPOUND:CHEMBL535 we get 2 elements:

  • CCN(CC)CCNC(=O)C1=C(C)NC(\\C=C2/C(=O)NC3=C2C=C(F)C=C3)=C1C
  • CCN(CC)CCNC(=O)C1=C(NC(=C1C)/C=C\\2/C3=C(C=CC(=C3)F)NC2=O)C

When I use the EBI API I get 1 "canonical_smiles" for CHEMBL535: CCN(CC)CCNC(=O)c1c(C)[nH]c(/C=C2\\C(=O)Nc3ccc(F)cc32)c1C

Are canonical smiles different than "regular" smiles? Can I easily generate a compound "canonical smiles" from the smiles of its elements?

@vemonet
Copy link
Member Author

vemonet commented Oct 9, 2023

According to chatty jeepity it should be as simple as this:

from rdkit import Chem

# SMILES representations of the elements
smiles_carbon = 'C'
smiles_hydrogen = 'H'
smiles_oxygen = 'O'

# Combine the SMILES of elements to create a chemical compound
compound_smiles = f'{smiles_carbon}{smiles_hydrogen*4}{smiles_oxygen*2}'

# Generate the canonical SMILES
compound_molecule = Chem.MolFromSmiles(compound_smiles)

if compound_molecule:
    canonical_smiles = Chem.MolToSmiles(compound_molecule, isomericSmiles=False)
    print(f'Canonical SMILES of the compound: {canonical_smiles}')
else:
    print('Invalid SMILES for the compound')

@vemonet
Copy link
Member Author

vemonet commented Nov 7, 2023

One of the problem faced: OpenTargets uses ENSEMBL gene IDs instead of directly using protein IDs (most of the interactions they describe are between drugs and proteins, not drugs and genes)

But a gene can code many proteins, so the interactions shared by OpenTargets are highly not clear and need to be manually fixed. Why could not they directly use protein IDs? That's a big question...

Also the following APIs are not allowing us to send bulk request to find sequences (PubChem, Chembl, ensembl)

So we need to send like 5000 requests to get sequences for all our drugs/proteins. Which is quite intensive for their API, which fails for a lot of requests. It would have been so easy for them to implement bulk calls, but it would have reduced the amount of queries done to their service, which is probably the number they report to get funding (so they want it to be high, even if it means making their service worse)

Not really optimal

@micheldumontier
Copy link
Collaborator

ya, you can find the relationship between genes and protein from the targets data. http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.09/output/etl/json/targets/

there's a field for proteinIds

are there alternative APIs you could use? does monarch give sequence data?

@vemonet
Copy link
Member Author

vemonet commented Nov 8, 2023

Ok, too bad they did not do their own work themselves

EBI CHEMBL seems quite all over the places, for example the ensembl ID ENSG00000198838 can be matched to more than 12 different proteins: https://www.ebi.ac.uk/proteins/api/proteins/Ensembl:ENSG00000198838?offset=0&size=100&format=json

All matches have the same "submittedName" for the protein: "Ryanodine receptor 3"

But the sequences are completely different:

  • XEDEIQFLRTYIPPDLCVCNFVLEQSLSVRALQEMLANTGENGGEG
  • XLEIAGEEEEDGSLEPASAFAMACASVKRNVTDFLKRATLKNLRKQYRNVKKMTAKELVKVLFSFFWMLFVGLFQLLFTILGGIFQILWSTVFGGGLVEGAKNIRVTKILGDMPDPTQFGIHDDTMEAERAEVMEPGITTELVHFIKGEKGDTDIMSDLFGLHPKKEGSLKHGPEVGLGDLSEIIGKDEPPTLESTVQKKRKAQAAEMKAANEAEGKVESEKADMEDGEKEDKDKEEEQAEYLWTEVTKKKKRRCGQKVEKPEAFTANFFKGLEIYQTKLLPGH
  • XGRCAPEMHLIQTGKGEAIRIRSILRSLVPTEDLVGIISIPLKLPSLNKDGSVSEPDMAANFCPDHKAPMVLFLDRVYGIKDQTFLLHLLEVGFLPDLRASASLDTVSLSTTEAALALNRYICSAVLPLLTRCAPLFAGTEHCTSLIDSTLQTIYRLSKGRSLTKAQRDTIEECLLAICNHLRPSMLQQLLRRLVFDVPQLNEYCKMPLKLLTNHYEQCWKYYCLPSGWGSYGLAVEEELHLTEKLFWGIFDSLSHKKYDPDLFRMALPCLSAIAGALPPDYLDTRITATLEKQISVDADGNFDPKPINTMNFSLPEKLEYIVTKYAEHSHDKWACDKSQSGWKYGISLDENVKTHPLIRPFKTLTEKEKEIYRWPARESLKTMLAVGWTVERTKEGEALVQQRENEKLRSVSQANQGNSYSPAPLDLSNVVLSRELQGMVEVVAENYHNIWAKKKKLELESKGGGSHPLLVPYDTLTAKEKFKDREKAQDLFKFLQVNGIIVSRGMKDMELDASSMEKRFAYKFLKKILKYVDSAQEFIAHLEAIVSSGKTEKSPRDQEIKFFAKVLLPLVDQYFTSHCLYFLSSPLKPLSSSGYASHKEKEMVAGLFCKLAALVRHRISLFGSDSTTMVSCLHILAQTLDTRTVMKSGSELVKAGLRAFFENAAEDLEKTSENLKLGKFTHSRTQIKGVSQNINYTTVALLPILTSIFEHVTQHQFGMDLLLGDVQISCYHILCSLYSLGTGKNIYVERQRPALGECLASLAAAIPVAFLEPTLNRYNPLSVFNTKTPRERSILGMPDTVEDMCPDIPQLEGLMKEINDLAESGARYTEMPHVIEVILPMLCNYLSYWWERGPENLPPSTGPCCTKVTSEHLSLILGNILKIINNNLGIDEASWMKRIAVYAQPIISKARPDLLRSHFIPTLEKLKKKAVKTVQEEEQLKADGKGDTQEAELLILDEFAVLCRDLYAFYPMLIRYVDNNRSNWLKSPDADSDQLFRMVAEVFILWCKSHNFKREEQNFVIQNEINNLAFLTGDSKSKMSKAMQVKVQVKCMTCLFCPSIRGAGLWPPLHCDHHGGGREWIFPPGGPPGLLQGRQLPVKE

And no match in the Monarch API: https://api.monarchinitiative.org/api/bioentity/anatomy/ENSEMBL%3AENSG00000198838/genes?rows=100&facet=false&unselect_evidence=false&exclude_automatic_assertions=false&fetch_objects=false&use_compact_associations=false&direct=false&direct_taxon=false

@vemonet
Copy link
Member Author

vemonet commented Nov 8, 2023

We have no other choices than to use the mappings published by opentargets, because only them know which (protein) target they talk about when giving a super ambiguous ensembl ID

The real question now is: can we trust this dataset now that we have seen how it's been made? I guess that's like dutch food, "yes but don't expect it to be good quality"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants