-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get SMILES and AA sequences #1
Comments
you can get SMILES from MolePro as well (depending on your input types (ID or chemical nemaes) ) you can use some of these endpoints: |
Thanks a lot @sandrine-muller-research ! But I lack of knowledge in the SMILES system, maybe you can enlighten me! For some compounds the MolePro API is returning multiple elements, e.g. for
When I use the EBI API I get 1 "canonical_smiles" for CHEMBL535: Are canonical smiles different than "regular" smiles? Can I easily generate a compound "canonical smiles" from the smiles of its elements? |
According to chatty jeepity it should be as simple as this: from rdkit import Chem
# SMILES representations of the elements
smiles_carbon = 'C'
smiles_hydrogen = 'H'
smiles_oxygen = 'O'
# Combine the SMILES of elements to create a chemical compound
compound_smiles = f'{smiles_carbon}{smiles_hydrogen*4}{smiles_oxygen*2}'
# Generate the canonical SMILES
compound_molecule = Chem.MolFromSmiles(compound_smiles)
if compound_molecule:
canonical_smiles = Chem.MolToSmiles(compound_molecule, isomericSmiles=False)
print(f'Canonical SMILES of the compound: {canonical_smiles}')
else:
print('Invalid SMILES for the compound') |
One of the problem faced: OpenTargets uses ENSEMBL gene IDs instead of directly using protein IDs (most of the interactions they describe are between drugs and proteins, not drugs and genes) But a gene can code many proteins, so the interactions shared by OpenTargets are highly not clear and need to be manually fixed. Why could not they directly use protein IDs? That's a big question... Also the following APIs are not allowing us to send bulk request to find sequences (PubChem, Chembl, ensembl) So we need to send like 5000 requests to get sequences for all our drugs/proteins. Which is quite intensive for their API, which fails for a lot of requests. It would have been so easy for them to implement bulk calls, but it would have reduced the amount of queries done to their service, which is probably the number they report to get funding (so they want it to be high, even if it means making their service worse) Not really optimal |
ya, you can find the relationship between genes and protein from the targets data. http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.09/output/etl/json/targets/ there's a field for proteinIds are there alternative APIs you could use? does monarch give sequence data? |
Ok, too bad they did not do their own work themselves EBI CHEMBL seems quite all over the places, for example the ensembl ID All matches have the same "submittedName" for the protein: "Ryanodine receptor 3" But the sequences are completely different:
And no match in the Monarch API: https://api.monarchinitiative.org/api/bioentity/anatomy/ENSEMBL%3AENSG00000198838/genes?rows=100&facet=false&unselect_evidence=false&exclude_automatic_assertions=false&fetch_objects=false&use_compact_associations=false&direct=false&direct_taxon=false |
We have no other choices than to use the mappings published by opentargets, because only them know which (protein) target they talk about when giving a super ambiguous ensembl ID The real question now is: can we trust this dataset now that we have seen how it's been made? I guess that's like dutch food, "yes but don't expect it to be good quality" |
Get SMILES for PubChem Compount (here for aspirin CID 2244):
Get AA sequence for a protein (check the sequence key):
The text was updated successfully, but these errors were encountered: