search.json

[
  {
    "objectID": "datasets/retentiontime/PXD028248_RT.html",
    "href": "datasets/retentiontime/PXD028248_RT.html",
    "title": "PXD028248",
    "section": "",
    "text": "Downloads\n\n\n\nDataset Description\nThe dataset contains 1.056.808 PSMs from 441 MaxQuant evidence files filtered at PEP &lt; 0.001 (Contains only pre-selected columns).\n\n\nAttributes\n\ntitle: PXD028248\ndataset tag: retentiontime/PXD028248_RT\ndata publication: Clinical and Translational Medicine\nmachine learning publication:\ndata source identifier: PXD028248\ndata type: retention time\nformat: CSV\ncolumns: Raw file, Sequence, Modifications, Modified sequence, Retention time, Calibrated retention time, PEP\ninstrument: Q. Exactive HF\norganism: Homo sapiens (human)\nfixed modifications: \nvariable modification: unmodified, methionine oxidation, N-terminal acetylation, and carbamidomethyl\nchromatography separation: RP\npeak measurement: \n\n\n\nSample Protocol\nAscites and greater omentum tissue with metastatic lesions were collected from patients with ovarian HGSC undergoing primary surgery at the University Hospital in Marburg. Clinical courses were evaluated by RECIST criteria33 in patients with measurable disease or profiles of serum CA125 levels according to the recommendations by the Gynecologic Cancer InterGroup (GCIG). Only patients with observations periods ≥12 months after first line surgery were included in the survival.\nPeptides were separated using an UHPLC (EASY-nLC 1000, ThermoFisher Scientific) and 20 cm, in-house packed C18 silica columns (1.9 μm C18 beads, Dr. Maisch GmbH, Ammerbuch, Germany) coupled in line to a QExactive HF orbitrap mass spectrometer (ThermoFisher Scientific) using an electrospray ionization source.\n\n\nData Analysis Protocol\nAnalysis by liquid chromatography/tandem mass spectrometry (LC/MS2) was performed and peptide/spectrum matching as well as label free quantitation used the MaxQuant suite of algorithms against the human Uniprot database (canonical and isoforms; downloaded on 2020/02/05; 1888349 entries).\n\n\nComments"
  },
  {
    "objectID": "datasets/retentiontime/DLOmix_RT.html",
    "href": "datasets/retentiontime/DLOmix_RT.html",
    "title": "DLOmix",
    "section": "",
    "text": "Download\n   \n\n\nDataset Description\nThis is a direct subset of the ProteomeTools dataset with computed iRTs based on the PROCAL. The total data contains ~27.200 peptides and is mainly useful for teaching purposes - Training: Containing 27.160 peptides - Validation: Containing 6.800 peptides - Testing: Containing 6.000 peptides - Train/val: Containing 27.200 peptides.\n\n\nAttributes\n\ntitle: DLOmix deep learning in proteomics python framework for retention time\ndataset tag: retentiontime/DLOmix_RT\ndata publication: ProteomeTools\nmachine learning publication: Prosit\ndata source identifier: PXD004732\ndata type: retention time\nformat: CSV\ncolumns: peptide, sequence, iRT, calibrated, retention, time\ninstrument: Orbitrap Fusion ETD\norganism: Homo sapiens (human)\nvariable modification: unmodified\nchromatography separation: \npeak measurement: \n\n\n\nSample Protocol\nTryptic peptides were individually synthesized by solid phase synthesis, combined into pools of ~1,000 peptides and measured on an Orbitrap Fusion mass spectrometer. For each peptide pool, an inclusion list was generated to target peptides for fragmentation in further LC-MS experiments using five fragmentation methods (HCD, CID, ETD, EThCD, ETciD) with ion trap or Orbitrap readout and HCD spectra were recorded at 6 different collision energies.\n\n\nData Analysis Protocol\nThe ProteomeTools project aims to derive molecular and digital tools from the human proteome to facilitate biomedical and life science research. Here, we describe the generation and multimodal LC-MS/MS analysis of &gt;350,000 synthetic tryptic peptides representing nearly all canonical human gene products. This resource will be extended to 1.4 million peptides within two years and all data will be made available to the public in ProteomicsDB.\n\n\nComments\n\nInternal DLOmix tutorial \nDLOmix GitHub"
  },
  {
    "objectID": "datasets/retentiontime/ProteomeTools_RT.html",
    "href": "datasets/retentiontime/ProteomeTools_RT.html",
    "title": "ProteomeTools",
    "section": "",
    "text": "Downloads\n    \n\n\nDataset Descriptions\nThe full data contains 1.000.000 unmodified peptides and 200.000 oxidized peptides all with MaxQuant scores &gt; 100 (as described in Prosit paper) split into five groups.  - Small: Containing 100.000 unmodified peptides (good for teaching)  - Medium: Containing 250.000 unmodified peptides (good for validating)  - Large: Containing 1.000.000 unmodified peptides (good for training)  - Oxidized: Containing 200.000 all oxidized peptides.  - Mixed: Containing 200.000 oxidized and 150.000 unmodified peptides. \n\n\nAttributes\n\ntitle: ProteomeTools synthetic peptides and iRT calibrated retention times\ndataset tag: ProteomeTools_RT\ndata publication: ProteomeTools\nmachine learning publication: Prosit\ndata source identifier: PXD004732\ndata type: retention time\nformat: CSV\ncolumns: raw file, sequence, retention time, modified sequence, modifications\ninstrument: Orbitrap Fusion ETD\norganism: Homo sapiens (human)\nvariable modification: unmodified & oxidation\nchromatography separation: \npeak measurement: \n\n\n\nSample Protocol\nTryptic peptides were individually synthesized by solid phase synthesis, combined into pools of ~1,000 peptides and measured on an Orbitrap Fusion mass spectrometer. For each peptide pool, an inclusion list was generated to target peptides for fragmentation in further LC-MS experiments using five fragmentation methods (HCD, CID, ETD, EThCD, ETciD) with ion trap or Orbitrap readout and HCD spectra were recorded at 6 different collision energies.\n\n\nData Analysis Protocol\nThe ProteomeTools project aims to derive molecular and digital tools from the human proteome to facilitate biomedical and life science research. Here, we describe the generation and multimodal LC-MS/MS analysis of &gt;350,000 synthetic tryptic peptides representing nearly all canonical human gene products. This resource will be extended to 1.4 million peptides within two years and all data will be made available to the public in ProteomicsDB.\n\n\nComments"
  },
  {
    "objectID": "datasets/fragmentation/nist.html",
    "href": "datasets/fragmentation/nist.html",
    "title": "NIST Peptide libraries",
    "section": "",
    "text": "Downloads\n    \n\n\nDataset Description\nThe original dataset is 646 MB (zipped). After parsing the MSP library into a tabular format while only retaining peak intensities for singly charged b- and y-ions, it was randomly split into test (3.4 MB, 27 036 spectra) and train/validation subsets (30 MB, 243 404 spectra). Files with encoded peptides were processed for ML as described in the fragmentation tutorial NIST (part 2): Traditional ML: Gradient boosting.\n\n\nAttributes\n\ntitle: NIST\ndataset tag: fragmentation/nist\ndata publication: Sheetlin et al. 2020\nmachine learning publication: \ndata source identifier: \ndata type: fragmentation intensity\nformat: MSP\ncolumns: \ninstrument: \norganism: Homo sapiens (human)\nfixed modifications: Carbamidomethylation of C\nvariable modification: unmodified & Oxidation of M\ndissociation method: HCD (beam-type CID)\ncollision energy: various\nmass analyzer type: Orbitrap\nspectra encoding: \n\n\n\nSample Protocol\nSee chemdata.nist.gov for more information.\n\n\nData Analysis Protocol\nConsensus spectral libraries generated by NIST, the US National Institute of Standards and Technology.\n\n\nComments\n/"
  },
  {
    "objectID": "datasets/fragmentation/ProteomeTools_FI.html",
    "href": "datasets/fragmentation/ProteomeTools_FI.html",
    "title": "ProteomeTools",
    "section": "",
    "text": "Downloads\n \n\n\nDataset Description\nThe dataset has been divided up into training (4.87GB) and holdout (250 MB) of annotated ms2 spectra.\n\n\nAttributes\n\ntitle: ProteomeTools synthetic peptides\ndataset tag: fragmentation/ProteomeTools_FI\ndata publication: ProteomeTools\nmachine learning publication: Prosit\ndata source identifier: PXD004732\ndata type: fragmentation intensity\nformat: hdf5\ncolumns: sequence_integer, precursor_charge_onehot, intensities_raw, collision_energy_aligned_normed, collision_energy, precursor_charge sequence_maxquant, sequence_length\ninstrument: Orbitrap Fusion ETD\norganism: Homo sapiens (human)\nfixed modifications: \nvariable modification: unmodified\ndissociation method: CID and HCD\ncollision energy: 35 and 28\nmass analyzer type: ion and orbitrap\nspectra encoding: prosit annotation pipeline\n\n\n\nSample protocol description\nTryptic peptides were individually synthesized by solid phase synthesis, combined into pools of ~1,000 peptides and measured on an Orbitrap Fusion mass spectrometer. For each peptide pool, an inclusion list was generated to target peptides for fragmentation in further LC-MS experiments using five fragmentation methods (HCD, CID, ETD, EThCD, ETciD) with ion trap or Orbitrap readout and HCD spectra were recorded at 6 different collision energies.\n\n\nData analysis protocol\nThe ProteomeTools project aims to derive molecular and digital tools from the human proteome to facilitate biomedical and life science research. Here, we describe the generation and multimodal LC-MS/MS analysis of &gt;350,000 synthetic tryptic peptides representing nearly all canonical human gene products. This resource will be extended to 1.4 million peptides within two years and all data will be made available to the public in ProteomicsDB. LC-MS runs were individually analyzed using MaxQuant 1.5.3.30.\n\n\nComments\n\nSubset FigShare\nFull FigShare\nTrained Model FigShare"
  },
  {
    "objectID": "datasets/index.html",
    "href": "datasets/index.html",
    "title": "Datasets",
    "section": "",
    "text": "On ProteomicsML you will find datasets for beginners and experts in the field alike. Download, and explore the intricate nature of mass spectrometry data."
  },
  {
    "objectID": "datasets/index.html#detectability",
    "href": "datasets/index.html#detectability",
    "title": "Datasets",
    "section": "Detectability",
    "text": "Detectability\n\n\n\n\n\n\nTitle\n\n\nDate\n\n\n\n\n\n\nArabidopsis PeptideAtlas Light and Dark Proteome\n\n\nFeb 18, 2024\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "datasets/index.html#fragmentation",
    "href": "datasets/index.html#fragmentation",
    "title": "Datasets",
    "section": "Fragmentation",
    "text": "Fragmentation\n\n\n\n\n\n\nTitle\n\n\nDate\n\n\n\n\n\n\nNIST Peptide libraries\n\n\nFeb 18, 2024\n\n\n\n\nProteomeTools\n\n\nFeb 18, 2024\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "datasets/index.html#ion-mobility",
    "href": "datasets/index.html#ion-mobility",
    "title": "Datasets",
    "section": "Ion mobility",
    "text": "Ion mobility\n\n\n\n\n\n\nTitle\n\n\nDate\n\n\n\n\n\n\nMeier et al. TIMS\n\n\nFeb 18, 2024\n\n\n\n\nVan Puyvelde et al. TWIMS\n\n\nFeb 18, 2024\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "datasets/index.html#retention-time",
    "href": "datasets/index.html#retention-time",
    "title": "Datasets",
    "section": "Retention time",
    "text": "Retention time\n\n\n\n\n\n\nTitle\n\n\nDate\n\n\n\n\n\n\nDLOmix\n\n\nFeb 18, 2024\n\n\n\n\nPXD028248\n\n\nFeb 18, 2024\n\n\n\n\nProteomeTools\n\n\nFeb 18, 2024\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "datasets/ionmobility/VanPuyvelde_TWIMS.html",
    "href": "datasets/ionmobility/VanPuyvelde_TWIMS.html",
    "title": "Van Puyvelde et al. TWIMS",
    "section": "",
    "text": "Downloads\n\n\n\n\n\n\n\nDataset Description\nThe data consists of 6.268 PSMs.\n\n\nAttributes\n\ntitle: A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics\ndataset tag: ionmobility/VanPuyvelde_TWIMS\ndata publication: Scientific Data\nmachine learning publication: \ndata source identifier: PXD028735\ndata type: ion mobility\nformat: TSV\ncolumns: Modified sequence, Charge, CCS, Ion Mobility, Ion Mobility Units, High Energy, Ion Mobility Offset\ninstrument: maXis, timsTOF Pro,\norganism: Homo sapiens (Human), Saccharomyces cerevisiae (Baker’s yeast), Escherichia coli (E. coli)\nfixed modifications: \nvariable modification:unmodified & oxidation & acetylation & carbamidomethyl\nionmobility type: TWIMS\ncss calibration compounds: \n\n\n\nSample Protocol\nFrom the original paper:\nMass spectrometry-compatible Human K562 (P/N: V6951) and Yeast (P/N: V7461) protein digest extracts were purchased from Promega (Madison, Wisconsin, United States). Lyophilised MassPrep Escherichia.coli digest standard (P/N:186003196) was purchased from Waters Corporation (Milford, Massachusetts, United States). The extracts were reduced with dithiothreitol (DTT), alkylated with iodoacetamide (IAA) and digested with sequencing grade Trypsin(-Lys C) by the respective manufacturers. The digested protein extracts were reconstituted in a mixture of 0.1% Formic acid (FA) in water (Biosolve B.V, Valkenswaard, The Netherlands) and spiked with iRT peptides (Biognosys, Schlieren, Switzerland) at a ratio of 1:20 v/v. Two master samples A and B were created similar to Navarro et al., each in triplicate, as shown in Fig. 1. Sample A was prepared by mixing Human, Yeast and E.coli at 65%, 30% and 5% weight for weight (w/w), respectively. Sample B was prepared by mixing Human, Yeast and E.coli protein digests at 65%, 15%, 20% w/w, respectively. The resulting samples have logarithmic fold changes (log2FCs) of 0, −1 and 2 for respectively Human, Yeast and E.coli. One sixth of each of the triplicate master batches of A and B were mixed to create a QC sample, containing 65% w/w Human, 22.5% w/w Yeast and 12.5% w/w E.coli.\n\n\nData Analysis Protocol\nFrom the original paper:\nAn M-class LC system (Waters Corporation, Milford, MA) was equipped with a 1.7 µm CSH 130 C18 300 µm × 100 mm column, operating at 5 µL/min with a column temperature of 55 °C. Mobile phase A was UPLC-grade water containing 0.1% (v/v) FA and 3% DMSO, mobile phase B was ACN containing 0.1% (v/v) FA. Peptides were separated using a linear gradient of 3−30% mobile phase B over 120 minutes. All experiments were conducted on a Synapt G2-Si mass spectrometer (Waters Corporation, Wilmslow, UK). The ESI Low Flow probe capillary voltage was 3 kV, sampling cone 60 V, source offset 60 V, source temperature 80 °C, desolvation temperature 350 °C, cone gas 80 L/hr, desolvation gas 350 L/hr, and nebulizer pressure 2.5 bar. A lock mass reference signal of GluFibrinopeptide B (m/z 785.8426) was sampled every 30 s.\n\n\nComments:"
  },
  {
    "objectID": "datasets/ionmobility/Meier_TIMS.html",
    "href": "datasets/ionmobility/Meier_TIMS.html",
    "title": "Meier et al. TIMS",
    "section": "",
    "text": "Downloads\n\n\n\n\n\n\n\nDataset Description\nThe data consists of 718.917 PSMs.\n\n\nAttributes\n\ntitle: Deep learning the collisional cross-sections of the peptide universe from a million experimental values\ndataset tag: ionmobility/Meier_TIMS\ndata publication: MSP\nmachine learning publication: Nature Communications\ndata source identifier: PXD010012, PXD019086, PXD017703\ndata type: ion mobility\nformat: CSV\ncolumns: index, Modified sequence, Charge, Mass, Intensity, Retention time, CCS, PT\ninstrument: maXis, timsTOF Pro,\norganism: Homo sapiens (Human), Saccharomyces cerevisiae (Baker’s yeast)\nfixed modifications: \nvariable modification:unmodified & oxidation & acetylation & carbamidomethyl\nionmobility type: TIMS\ncss calibration compounds: \n\n\n\nSample Protocol\nIn bottom-up proteomics, peptides are separated by liquid chromatography with elution peak widths in the range of seconds, while mass spectra are acquired in about 100 microseconds with time-of-fight (TOF) instruments. This allows adding ion mobility as a third dimension of separation. Among several formats, trapped ion mobility spectrometry (TIMS) is attractive due to its small size, low voltage requirements and high efficiency of ion utilization. We have recently demonstrated a scan mode termed parallel accumulation – serial fragmentation (PASEF), which multiplies the sequencing speed without any loss in sensitivity (Meier et al., PMID: 26538118). Here we introduce the timsTOF Pro instrument, which optimally implements online PASEF. It features an orthogonal ion path into the ion mobility device, limiting the amount of debris entering the instrument and making it very robust in daily operation. We investigate different precursor selection schemes for shotgun proteomics to optimally allocate in excess of 100 fragmentation events per second. More than 800,000 fragmentation spectra in standard 120 min LC runs are easily achievable, which can be used for near exhaustive precursor selection in complex mixtures or re-sequencing weak precursors. MaxQuant identified more than 6,000 proteins in single run HeLa analyses without matching to a library, and with high quantitative reproducibility (R &gt; 0.97). Online PASEF achieves a remarkable sensitivity with more than 2,000 proteins identified in 30 min runs of only 10 ng HeLa digest. We also show that highly reproducible collisional cross sections can be acquired on a large scale (R &gt; 0.99). PASEF on the timsTOF Pro is a valuable addition to the technological toolbox in proteomics, with a number of unique operating modes that are only beginning to be explored.\n\n\nData Analysis Protocol\nMS raw files were analyzed with MaxQuant version 1.6.5.0, which extracts 4D isotope patterns (‘features’) and associated MS/MS spectra. The built-in search engine Andromeda74 was used to match observed fragment ions to theoretical peptide fragment ion masses derived from in silico digests of a reference proteome and a list of 245 potential contaminants using the appropriate digestion rules for each proteolytic enzyme (trypsin, LysC or LysN). We allowed a maximum of two missing values and required a minimum sequence length of 7 amino acids while limiting the maximum peptide mass to 4600 Da. Carbamidomethylation of cysteine was defined as a fixed modification, and oxidation of methionine and acetylation of protein N-termini were included in the search as variable modification. Reference proteomes for each organism including isoforms were accessed from UniProt (Homo sapiens: 91,618 entries, 2019/05; E. coli: 4403 entries, 2019/01; C. elegans: 28,403 entries, 2019/01; S. cerevisiae: 6049 entries, 2019/01; D. melanogaster: 23,304 entries, 2019/01). The synthetic peptide library (ProteomeTools54) was searched against the entire human reference proteome. The maximum mass tolerances were set to 20 and 40 ppm for precursor and fragment ions, respectively. False discovery rates were controlled at 1% on both the peptide spectrum match and protein level with a target-decoy approach. The analyses were performed separately for each organism and each set of synthetic peptides (‘proteotypic set’, ‘SRM atlas’, and ‘missing gene set’). To demonstrate the utility of CCS prediction, we re-analyzed three diaPASEF experiments from Meier et al.55 with Spectronaut 14.7.201007.47784 (Biognosys AG), replacing experimental ion mobility values in the spectral library with our predictions. Singly charged peptide precursors were excluded from this analysis as the neural network was exclusively trained with multiply charged peptides.\n\n\nComments"
  },
  {
    "objectID": "datasets/detectability/ArabidopsisLightDarkProteome.html",
    "href": "datasets/detectability/ArabidopsisLightDarkProteome.html",
    "title": "Arabidopsis PeptideAtlas Light and Dark Proteome",
    "section": "",
    "text": "Downloads\n\n\n\nDataset Description\nThe dataset contains 32.674 rows totalling 3.3 MB from the Arabidopsis PeptideAtlas build (http://www.peptideatlas.org/builds/arabidopsis/) we have extracted all the “canonical” proteins, which have been observed with at least 2 uniquely mapping peptides of length 9+AA and providing at least 18AA of coverage. We have also extracted “not observed” proteins that have no peptide detections (that pass PeptideAtlas’s stringent thresholds) at all. Physicochemical properties and RNA-seq-based properties are also computed and provided in the dataset.\n\n\nAttributes\n\ntitle: Arabidopsis PeptideAtlas Light and Dark Proteome\ndataset tag: detectability/ArabidopsisLightDarkProteome\ndata publication: Plant Cell\nmachine learning publication: None\ndata source identifier: 52 PXDs as listed at PeptideAtlas\ndata type: protein detectability\nformat: TSV\ncolumns: protein_identifier, gene_symbol, chromosome, number_of_observations, molecular_weight, gravy_score, isoelectric_point, rna_detected_percent, highest_tpm, protein_description\ninstrument: various\norganism: Arabidopsis thaliana (arabidopsis)\nfixed modifications: various\nvariable modification: various\ndissociation method: CID and HCD\ncollision energy: various\nmass analyzer type: various\n\n\n\nSample Protocol\nNo sample protocol is known for the dataset\n\n\nData analysis protocol\n52 public datasets were downloaded from ProteomeXchange repositories, processed through the PeptideAtlas processing pipeline, and protein categories were computed based on the ensemble data, as described in van Wijk et al. 2021. The number of observations is the number of peptide-spectrum matches in the PeptideAtlas build based on a threshold that aims for a 1% false dicovery rate at the protein level. The molecular weight, gravy score (hydrophobicity), and isoelectric point (pI) are computed in Python via the Pyteomics library. The RNA-seq-based values are computed based on a re-analysis of over 5000 RNA-seq samples as described in Kearly et al. (submitted). The metrics are the percentage of RNA-seq samples with a positive detection of transcripts corresponding to the protein (a measure of how pervasive the transcripts are), and the highest RNA abundance in transcripts per million (TPM) in the highest sample (a measure of the highest possibly abundance at least under some conditions).\n\n\nComments\n\nNone"
  },
  {
    "objectID": "index.html",
    "href": "index.html",
    "title": "Home",
    "section": "",
    "text": "ProteomicsML provides ready-made datasets for machine learning models accompanied by tutorials on how to work with even the most complex data types in the field of proteomics. The resource is set up to evolve together with the field, and we welcome everyone to contribute to the project by adding new datasets and accompanying notebooks.\nProteomicsML was set up as a joint effort of SDU, CompOmics, LUMC, PeptideAtlas, NIST, PRIDE, and MSAID. We believe that ProteomicsML is solid step forward for the field towards more open and reproducible science!\nWant to learn more about the project? Read our publication:\n\nProteomicsML: An Online Platform for Community-Curated Data Sets and Tutorials for Machine Learning in Proteomics. Tobias G. Rehfeldt*, Ralf Gabriels*, Robbin Bouwmeester*, Siegfried Gessulat, Benjamin A. Neely, Magnus Palmblad, Yasset Perez-Riverol, Tobias Schmidt, Juan Antonio Vizcaı́no§, and Eric W. Deutsch§. J. Proteome Res. 2023, 22, 2, 632–636. doi:10.1021/acs.jproteome.2c00629.\n\nLearn 📒 Explore all tutorials and datasets 🙏 Ask or answer questions about the tutorials in Tutorials Q&A\nDiscuss 📄 Discuss the existing datasets or the addition of a new dataset in Dataset Discussions 💬 Join the ProteomicsML General Discussions\nContribute 💡 Have an idea on how to improve the project? Open an issue 🧑‍🔧 Learn how to Contribute 🤝 Read the Code of Conduct"
  },
  {
    "objectID": "tutorials/retentiontime/index.html",
    "href": "tutorials/retentiontime/index.html",
    "title": "Retention time",
    "section": "",
    "text": "Title\n\n\nAuthor\n\n\nDate\n\n\n\n\n\n\nDLOmix embedding of Prosit model on ProteomeTools data\n\n\nTobias Greisager Rehfeldt\n\n\nSep 21, 2022\n\n\n\n\nManual embedding of Bi-LSTM model on ProteomeTools data\n\n\nTobias Greisager Rehfeldt\n\n\nSep 21, 2022\n\n\n\n\nPreparing a retention time data set for machine learning\n\n\nRobbin Bouwmeester\n\n\nSep 23, 2022\n\n\n\n\nTransfer learning with DeepLC\n\n\nRobbin Bouwmeester\n\n\nFeb 3, 2023\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "tutorials/retentiontime/manual-prosit-rt.html",
    "href": "tutorials/retentiontime/manual-prosit-rt.html",
    "title": "Manual embedding of Bi-LSTM model on ProteomeTools data",
    "section": "",
    "text": "!pip install pandas==1.3.5 sklearn==0.0.post1 tensorflow==2.9.2 numpy==1.21.6 matplotlib==3.2.2 requests==2.23.0 --quiet\n\n\n# Import and normalize/standarize data\nimport pandas as pd\nimport numpy as np\n# Import and normalize the data\ndata = pd.read_csv('https://github.com/ProteomicsML/ProteomicsML/blob/main/datasets/retentiontime/ProteomeTools/small.zip?raw=true', compression='zip')\n\n# shuffle and split dataset into internal (80%) and external (20%) datasets\ndata = data.sample(frac=1)\ntest_data = data[int(len(data)*0.8):]\ndata = data[:int(len(data)*0.8)]\n\n\n# Split the internal dataset into training and validation\n# We have to split the data based on Sequences, to make sure we dont have cross-over sequences in the training and validation splits.\nunique_sequences = list(set(data['sequence']))\n# Shuffle the data to ensure unbiased data splitting\nfrom random import shuffle\nshuffle(unique_sequences)\n# Split sequence 80-10-10 training, validation and testing split\ntrain = unique_sequences[0:int(len(unique_sequences) * 0.8)]\nvalidation = unique_sequences[int(len(unique_sequences) * 0.8):]\n# Transfer the sequence split into data split\ntrain = data[data['sequence'].isin(train)]\nvalidation = data[data['sequence'].isin(validation)]\nprint('Training data points:', len(train),'  Validation data points:',  len(validation),'  Testing data points:',  len(test_data))\n# Here we use test as an external dataset unlike the one used for training.\n\nTraining data points: 64355   Validation data points: 15645   Testing data points: 20000\n\n\n\nnormalize = True\nif normalize:\n  # Normalize\n  train_val_min, train_val_max = min(train['retention time'].min(), validation['retention time'].min()), max(train['retention time'].max(), validation['retention time'].max())\n  train['retention time'] = list((train['retention time'] - train_val_min) / (train_val_max - train_val_min))\n  validation['retention time'] = list((validation['retention time'] - train_val_min) / (train_val_max - train_val_min))\n  test_data['retention time'] = list((test_data['retention time'] - test_data['retention time'].min()) / (test_data['retention time'].max() - test_data['retention time'].min()))\nelse:\n  # Standardize\n  train_val_mean, train_val_std = np.mean(list(train['retention time']) + list(validation['retention time'])), np.std(list(train['retention time']) + list(validation['retention time']))\n  train['retention time'] = (train['retention time'] - train_val_mean) / train_val_std\n  validation['retention time'] = (validation['retention time'] - train_val_mean) / train_val_std\n  test_data['retention time'] = (test_data['retention time'] - np.mean(test_data['retention time'])) / np.std(test_data['retention time'])\n\n/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n  \"\"\"\n/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n  \n\n\n\n# Setup parameters\nsequence_length = 30\nbatch_size = 64\nepochs=10\n\n\n# Manual sequence embedding\n# Remove sequences longer than our maximum sequence length\ntrain = train[train['sequence'].str.len()&lt;=sequence_length]\nvalidation = validation[validation['sequence'].str.len()&lt;=sequence_length]\ntest_data = test_data[test_data['sequence'].str.len()&lt;=sequence_length]\n\n# Create an alphabet to convert from string to numeric\nAA_alphabet = {\"A\": 1, \"C\": 2, \"D\": 3, \"E\": 4, \"F\": 5, \"G\": 6, \"H\": 7, \"I\": 8, \"K\": 9, \"L\": 10, \"M\": 11, \"N\": 12, \"P\": 13, \"Q\": 14, \"R\": 15, \"S\": 16, \"T\": 17, \"V\": 18, \"W\": 19, \"Y\": 20}\n# Convert sequences from string to numberic\nembedded_sequences_train = [[AA_alphabet[g] for g in f] for f in train['sequence']]\nembedded_sequences_validation = [[AA_alphabet[g] for g in f] for f in validation['sequence']]\nembedded_sequences_test = [[AA_alphabet[g] for g in f] for f in test_data['sequence']]\n\n# Make sure every sequence is the same length\nfrom tensorflow.keras.preprocessing.sequence import pad_sequences\nembedded_sequences_train = pad_sequences(sequences=embedded_sequences_train, maxlen=sequence_length)\nembedded_sequences_validation = pad_sequences(sequences=embedded_sequences_validation, maxlen=sequence_length)\nembedded_sequences_test = pad_sequences(sequences=embedded_sequences_test, maxlen=sequence_length)\n\n\n# Import the needed layers and tensorflow model requirements\nfrom tensorflow.keras.layers import Dense, Embedding, LSTM, Input, Concatenate, Bidirectional, Dropout\nfrom tensorflow.keras.models import Model\n\ninputs = Input(shape=(sequence_length,), name='Input')\n# Embed the sequnces in a 20 x 8 matrix\ninput_embedding = Embedding(input_dim=len(AA_alphabet)+2, output_dim=8, name='Sequence_Embedding')(inputs)\nx = Bidirectional(LSTM(32, return_sequences=True), name='Bi_LSTM_1')(input_embedding)\nx = Dropout(0.25, name='LSTM_Dropout')(x)\nx = Bidirectional(LSTM(32), name='Bi_LSTM_2')(x)\noutput = Dense(1, activation=\"linear\", name='Output')(x)\nmodel = Model(inputs, output)\nmodel.summary()\n\nModel: \"model\"\n_________________________________________________________________\n Layer (type)                Output Shape              Param #   \n=================================================================\n Input (InputLayer)          [(None, 30)]              0         \n                                                                 \n Sequence_Embedding (Embeddi  (None, 30, 8)            176       \n ng)                                                             \n                                                                 \n Bi_LSTM_1 (Bidirectional)   (None, 30, 64)            10496     \n                                                                 \n LSTM_Dropout (Dropout)      (None, 30, 64)            0         \n                                                                 \n Bi_LSTM_2 (Bidirectional)   (None, 64)                24832     \n                                                                 \n Output (Dense)              (None, 1)                 65        \n                                                                 \n=================================================================\nTotal params: 35,569\nTrainable params: 35,569\nNon-trainable params: 0\n_________________________________________________________________\n\n\n\nimport tensorflow as tf\n# Compiling the keras model with loss function, metrics and optimizer\nmodel.compile(loss='mse', metrics=['mae'], optimizer=tf.keras.optimizers.Adam(learning_rate=0.005))\n# Train the model\nhistory = model.fit(x=embedded_sequences_train, y=train['retention time'], epochs=epochs,\n                    batch_size=batch_size, validation_data=(embedded_sequences_validation, validation['retention time']))\n\nEpoch 1/10\n1004/1004 [==============================] - 30s 16ms/step - loss: 0.0078 - mae: 0.0571 - val_loss: 0.0039 - val_mae: 0.0439\nEpoch 2/10\n1004/1004 [==============================] - 13s 13ms/step - loss: 0.0036 - mae: 0.0399 - val_loss: 0.0033 - val_mae: 0.0400\nEpoch 3/10\n1004/1004 [==============================] - 13s 13ms/step - loss: 0.0029 - mae: 0.0349 - val_loss: 0.0027 - val_mae: 0.0323\nEpoch 4/10\n1004/1004 [==============================] - 13s 13ms/step - loss: 0.0026 - mae: 0.0330 - val_loss: 0.0023 - val_mae: 0.0302\nEpoch 5/10\n1004/1004 [==============================] - 14s 14ms/step - loss: 0.0023 - mae: 0.0311 - val_loss: 0.0020 - val_mae: 0.0286\nEpoch 6/10\n1004/1004 [==============================] - 13s 13ms/step - loss: 0.0022 - mae: 0.0301 - val_loss: 0.0020 - val_mae: 0.0292\nEpoch 7/10\n1004/1004 [==============================] - 13s 13ms/step - loss: 0.0022 - mae: 0.0293 - val_loss: 0.0019 - val_mae: 0.0273\nEpoch 8/10\n1004/1004 [==============================] - 13s 13ms/step - loss: 0.0020 - mae: 0.0286 - val_loss: 0.0024 - val_mae: 0.0312\nEpoch 9/10\n1004/1004 [==============================] - 13s 13ms/step - loss: 0.0020 - mae: 0.0280 - val_loss: 0.0020 - val_mae: 0.0277\nEpoch 10/10\n1004/1004 [==============================] - 13s 13ms/step - loss: 0.0019 - mae: 0.0274 - val_loss: 0.0018 - val_mae: 0.0262\n\n\n\nimport matplotlib.pyplot as plt\n# Plotting the training history\nplt.plot(range(epochs), history.history['loss'], '-', color='r', label='Training loss')\nplt.plot(range(epochs), history.history['val_loss'], '--', color='r', label='Validation loss')\nplt.title(f'Training and validation loss across epochs')\nplt.xlabel('Epochs')\nplt.ylabel('Loss')\nplt.legend()\nplt.show()\n\n\n\n\n\n# Initially we trained on just one gradient, to transfer this model to external datasets,\n# we refine the model by using the model we just trained as a pre-trained model, and then further train it with the test/external dataset\nhistory = model.fit(x=embedded_sequences_test, y=test_data['retention time'], epochs=epochs, batch_size=batch_size)\n# The model can now be used for other datasets with the same gradient set-up\n# We then plot the history of this model, and see the initial performance is much better,\n# as the model already has some gradient agnostic knowledge, and it simply has to learn the new gradients\nplt.plot(range(epochs), history.history['loss'], '-', color='r', label='Training loss')\nplt.title(f'Training and validation loss of the refined model')\nplt.xlabel('Epochs')\nplt.ylabel('Loss')\nplt.legend()\nplt.show()\n\nEpoch 1/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0023 - mae: 0.0291\nEpoch 2/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0022 - mae: 0.0287\nEpoch 3/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0021 - mae: 0.0285\nEpoch 4/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0021 - mae: 0.0281\nEpoch 5/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0020 - mae: 0.0276\nEpoch 6/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0019 - mae: 0.0272\nEpoch 7/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0020 - mae: 0.0277\nEpoch 8/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0019 - mae: 0.0271\nEpoch 9/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0020 - mae: 0.0278\nEpoch 10/10\n312/312 [==============================] - 4s 12ms/step - loss: 0.0019 - mae: 0.0272"
  },
  {
    "objectID": "tutorials/retentiontime/dlomix-prosit-rt.html",
    "href": "tutorials/retentiontime/dlomix-prosit-rt.html",
    "title": "DLOmix embedding of Prosit model on ProteomeTools data",
    "section": "",
    "text": "!pip install pandas==1.3.5 sklearn==0.0.post1 tensorflow==2.9.2 dlomix==0.0.3 numpy==1.21.6 matplotlib==3.2.2 requests==2.23.0 --quiet\n\n\n# Import and normalize/standarize data\nimport pandas as pd\nimport numpy as np\n# Import and normalize the data\ndata = pd.read_csv('https://github.com/ProteomicsML/ProteomicsML/blob/main/datasets/retentiontime/ProteomeTools/small.zip?raw=true', compression='zip')\n\n# shuffle and split dataset into internal (80%) and external (20%) datasets\ndata = data.sample(frac=1)\ntest_data = data[int(len(data)*0.8):]\ndata = data[:int(len(data)*0.8)]\n\n\n# Split the internal dataset into training and validation\n# We have to split the data based on Sequences, to make sure we dont have cross-over sequences in the training and validation splits.\nunique_sequences = list(set(data['sequence']))\n# Shuffle the data to ensure unbiased data splitting\nfrom random import shuffle\nshuffle(unique_sequences)\n# Split sequence 80-10-10 training, validation and testing split\ntrain = unique_sequences[0:int(len(unique_sequences) * 0.8)]\nvalidation = unique_sequences[int(len(unique_sequences) * 0.8):]\n# Transfer the sequence split into data split\ntrain = data[data['sequence'].isin(train)]\nvalidation = data[data['sequence'].isin(validation)]\nprint('Training data points:', len(train),'  Validation data points:',  len(validation),'  Testing data points:',  len(test_data))\n# Here we use test as an external dataset unlike the one used for training.\n\nTraining data points: 63955   Validation data points: 16045   Testing data points: 20000\n\n\n\nnormalize = True\nif normalize:\n  # Normalize\n  train_val_min, train_val_max = min(train['retention time'].min(), validation['retention time'].min()), max(train['retention time'].max(), validation['retention time'].max())\n  train['retention time'] = list((train['retention time'] - train_val_min) / (train_val_max - train_val_min))\n  validation['retention time'] = list((validation['retention time'] - train_val_min) / (train_val_max - train_val_min))\n  test_data['retention time'] = list((test_data['retention time'] - test_data['retention time'].min()) / (test_data['retention time'].max() - test_data['retention time'].min()))\nelse:\n  # Standardize\n  train_val_mean, train_val_std = np.mean(list(train['retention time']) + list(validation['retention time'])), np.std(list(train['retention time']) + list(validation['retention time']))\n  train['retention time'] = (train['retention time'] - train_val_mean) / train_val_std\n  validation['retention time'] = (validation['retention time'] - train_val_mean) / train_val_std\n  test_data['retention time'] = (test_data['retention time'] - np.mean(test_data['retention time'])) / np.std(test_data['retention time'])\n\n/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n  \"\"\"\n/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame.\nTry using .loc[row_indexer,col_indexer] = value instead\n\nSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n  \n\n\n\n# Setup parameters\nsequence_length = 30\nbatch_size = 64\nepochs=5\n\n\n# Setup data\nfrom dlomix.data import RetentionTimeDataset\ntrain_input = RetentionTimeDataset(data_source=tuple([np.array(train['sequence']), np.array(train['retention time'])]),\n                                        seq_length=sequence_length, batch_size=batch_size, test=False).train_data\n\nval_input = RetentionTimeDataset(data_source=tuple([np.array(validation['sequence']), np.array(validation['retention time'])]),\n                                        seq_length=sequence_length, batch_size=batch_size, test=False).train_data\n\ntest_input = RetentionTimeDataset(data_source=tuple([np.array(test_data['sequence']), np.array(test_data['retention time'])]),\n                                        seq_length=sequence_length, batch_size=batch_size, test=False).train_data\n\n# Setup PROSIT model from DLOmix\nfrom dlomix.models.prosit import PrositRetentionTimePredictor\nmodel = PrositRetentionTimePredictor(seq_length=sequence_length)\nmodel.build((None, sequence_length))\nmodel.summary()\n\nModel: \"prosit_retention_time_predictor_2\"\n_________________________________________________________________\n Layer (type)                Output Shape              Param #   \n=================================================================\n string_lookup_2 (StringLook  multiple                 0         \n up)                                                             \n                                                                 \n embedding_2 (Embedding)     multiple                  352       \n                                                                 \n sequential_4 (Sequential)   (None, 30, 512)           1996800   \n                                                                 \n attention_layer_2 (Attentio  multiple                 542       \n nLayer)                                                         \n                                                                 \n sequential_5 (Sequential)   (None, 512)               262656    \n                                                                 \n dense_5 (Dense)             multiple                  513       \n                                                                 \n=================================================================\nTotal params: 2,260,863\nTrainable params: 2,260,863\nNon-trainable params: 0\n_________________________________________________________________\n\n\n\nfrom dlomix.eval.rt_eval import TimeDeltaMetric\nimport tensorflow as tf\n# Compiling the keras model with loss function, metrics and optimizer\nmodel.compile(loss='mse', metrics=['mae', TimeDeltaMetric()], optimizer=tf.keras.optimizers.Adam(learning_rate=0.005))\n# Train the model\nhistory = model.fit(x=train_input, epochs=epochs, batch_size=batch_size, validation_data=val_input)\n\nEpoch 1/5\n998/998 [==============================] - 26s 22ms/step - loss: 0.6175 - mae: 0.1161 - timedelta: 0.1140 - val_loss: 0.0040 - val_mae: 0.0427 - val_timedelta: 0.0484\nEpoch 2/5\n998/998 [==============================] - 21s 21ms/step - loss: 0.0055 - mae: 0.0526 - timedelta: 0.0522 - val_loss: 0.0038 - val_mae: 0.0428 - val_timedelta: 0.0467\nEpoch 3/5\n998/998 [==============================] - 21s 21ms/step - loss: 0.0047 - mae: 0.0474 - timedelta: 0.0464 - val_loss: 0.0039 - val_mae: 0.0459 - val_timedelta: 0.0480\nEpoch 4/5\n998/998 [==============================] - 21s 21ms/step - loss: 0.6041 - mae: 0.2064 - timedelta: 0.1935 - val_loss: 0.0537 - val_mae: 0.1940 - val_timedelta: 0.1972\nEpoch 5/5\n998/998 [==============================] - 21s 21ms/step - loss: 0.0544 - mae: 0.1961 - timedelta: 0.1900 - val_loss: 0.0536 - val_mae: 0.1943 - val_timedelta: 0.1967\n\n\n\nfrom dlomix.reports import RetentionTimeReport\nreport = RetentionTimeReport(output_path=\"./output\", history=history)\n\n\nreport.plot_keras_metric(\"loss\")\n\n\n\n\n\nreport.plot_keras_metric(\"timedelta\")\n\n\n\n\n\ny_real = np.concatenate([y for x, y in val_input], axis=0)\ny_pred = model.predict(validation['sequence'][:len(y_real)])\nreport.plot_residuals(y_real, y_pred, xrange=(-1, 1))\n\n501/501 [==============================] - 3s 3ms/step\n\n\n\n\n\n\nhistory = model.fit(x=test_input, epochs=epochs, batch_size=batch_size)\nimport matplotlib.pyplot as plt\nplt.plot(range(epochs), history.history['loss'], '-', color='r', label='Training loss')\nplt.title(f'Training and validation loss of the refined model')\nplt.xlabel('Epochs')\nplt.ylabel('Loss')\nplt.legend()\nplt.show()\n\nEpoch 1/5\n312/312 [==============================] - 6s 19ms/step - loss: 0.0560 - mae: 0.1987 - timedelta: 0.1993\nEpoch 2/5\n312/312 [==============================] - 6s 19ms/step - loss: 0.0559 - mae: 0.1986 - timedelta: 0.1987\nEpoch 3/5\n312/312 [==============================] - 6s 19ms/step - loss: 0.0559 - mae: 0.1985 - timedelta: 0.1988\nEpoch 4/5\n312/312 [==============================] - 6s 19ms/step - loss: 0.0559 - mae: 0.1985 - timedelta: 0.1991\nEpoch 5/5\n312/312 [==============================] - 6s 19ms/step - loss: 0.0559 - mae: 0.1985 - timedelta: 0.1982"
  },
  {
    "objectID": "tutorials/retentiontime/mq-evidence-to-ml.html#reading-and-formatting-input-data",
    "href": "tutorials/retentiontime/mq-evidence-to-ml.html#reading-and-formatting-input-data",
    "title": "Preparing a retention time data set for machine learning",
    "section": "Reading and formatting input data",
    "text": "Reading and formatting input data\nWe will not need all the columns, define those that might be useful:\n\nsel_columns = ['Raw file', 'Sequence', 'Modifications', 'Modified sequence',\n               'Retention time','Calibrated retention time', 'PEP']\n\nRead the input files, here a csv. If you read the standard txt you need to modify the read_csv with:\npd.read_csv(\"evid_files/PXD028248_evidence_selected_columns.csv\",sep=\"\\t\",low_memory=False)\nFill all the NA values with 0.0 and filter on only the most confident identifications (PEP &lt;= 0.001).\n\nevid_df = pd.read_csv(\"https://github.com/ProteomicsML/ProteomicsML/blob/main/datasets/retentiontime/PXD028248/PXD028248_evidence_selected_columns.zip?raw=true\",compression=\"zip\",low_memory=False)\nevid_df.fillna(0.0,inplace=True)\nevid_df = evid_df[evid_df[\"PEP\"] &lt;= 0.001][sel_columns]\n\nThe file in a pandas dataframe looks like this:\n\nevid_df\n\n\n\n\n\n\n\n\nRaw file\nSequence\nLength\nModifications\nModified sequence\nRetention time\nRetention length\nCalibrated retention time\nCalibrated retention time start\nCalibrated retention time finish\nRetention time calibration\nMatch time difference\nIntensity\nPEP\n\n\n\n\n0\n20191028_SJ_QEx_LC1200_4_Sommerfeld_OC218_ADP_...\nAAAAAAAAAAGAAGGR\n16\nAcetyl (Protein N-term)\n_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_\n90.531\n0.58251\n90.531\n90.202\n90.784\n0.000000e+00\n0.0\n40715000.0\n9.822100e-15\n\n\n1\n20191126_SJ_QEx_LC1200_4_Sommerfeld_OC218_Tumo...\nAAAAAAAAAAGAAGGR\n16\nAcetyl (Protein N-term)\n_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_\n97.132\n0.51271\n97.132\n96.824\n97.337\n0.000000e+00\n0.0\n19359000.0\n4.269700e-21\n\n\n2\n20191129_SJ_QEx_LC1200_4_Sommerfeld_OC193_Tumo...\nAAAAAAAAAAGAAGGR\n16\nAcetyl (Protein N-term)\n_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_\n97.495\n0.82283\n97.495\n96.848\n97.671\n0.000000e+00\n0.0\n173850000.0\n1.198900e-42\n\n\n3\n20191129_SJ_QEx_LC1200_4_Sommerfeld_OC193_Tumo...\nAAAAAAAAAAGAAGGR\n16\nAcetyl (Protein N-term)\n_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_\n96.580\n0.46060\n96.580\n96.321\n96.781\n0.000000e+00\n0.0\n10126000.0\n3.280800e-05\n\n\n4\n20191204_SJ_QEx_LC1200_4_Sommerfeld_OC217_Tumo...\nAAAAAAAAAAGAAGGR\n16\nAcetyl (Protein N-term)\n_(Acetyl (Protein N-term))AAAAAAAAAAGAAGGR_\n91.611\n0.58345\n91.611\n91.341\n91.924\n0.000000e+00\n0.0\n16703000.0\n4.950100e-17\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n1056802\n20191204_SJ_QEx_LC1200_4_Sommerfeld_OC189_Tumo...\nYYYVCQYCPAGNWANR\n16\nUnmodified\n_YYYVCQYCPAGNWANR_\n93.503\n0.45542\n93.503\n93.299\n93.754\n0.000000e+00\n0.0\n4054800.0\n8.095500e-04\n\n\n1056803\n20191204_SJ_QEx_LC1200_4_Sommerfeld_OC196_Tumo...\nYYYVCQYCPAGNWANR\n16\nUnmodified\n_YYYVCQYCPAGNWANR_\n93.772\n0.57550\n93.772\n93.492\n94.067\n0.000000e+00\n0.0\n13780000.0\n2.618200e-04\n\n\n1056804\n20191204_SJ_QEx_LC1200_4_Sommerfeld_OC217_Tumo...\nYYYVCQYCPAGNWANR\n16\nUnmodified\n_YYYVCQYCPAGNWANR_\n93.183\n0.60296\n93.183\n92.890\n93.493\n-1.421100e-14\n0.0\n9741300.0\n1.367500e-06\n\n\n1056805\n20191210_SJ_QEx_LC1200_4_Sommerfeld_OC221_Tumo...\nYYYVCQYCPAGNWANR\n16\nUnmodified\n_YYYVCQYCPAGNWANR_\n95.546\n0.50088\n95.546\n95.292\n95.793\n-1.421100e-14\n0.0\n7791200.0\n1.803700e-11\n\n\n1056806\n20200229_SJ_QEx_LC1200_4_Sommerfeld_OC221_CAF_...\nYYYVCQYCPAGNWANR\n16\nUnmodified\n_YYYVCQYCPAGNWANR_\n69.370\n0.21532\n69.370\n69.250\n69.466\n0.000000e+00\n0.0\n6157000.0\n7.497600e-04\n\n\n\n\n1056807 rows × 14 columns\n\n\n\nAs you can see in this example there are many of the same peptidoforms (minus charge) for the different runs. We will want to create a single value for each peptidoform per run in a matrix instead of a single peptidoform+run combo per row.\n\nretention_dict = {}\n\n# Group by the raw file\nfor gidx,g in evid_df.groupby(\"Raw file\"):\n    # Group by peptidoform and take the mean for each group\n    retention_dict[gidx]  = g.groupby(\"Modified sequence\").mean()[\"Calibrated retention time\"].to_dict()\n\n#Transform the dictionary in a df where each row is a peptidoform and each column a run\nretention_df = pd.DataFrame(retention_dict)\n\nretention_df\n\n\n\n\n\n\n\n\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_73\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_74\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_75\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_76\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_77\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_78\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_79\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_80\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_ADP_81\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_HPMC_FCS_82\n...\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC221_HPMC_50Asc_522\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_523\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_524\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_525\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_526\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_527\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_528\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_529\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_530\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC222_HPMC_50Asc_531\n\n\n\n\n_(Acetyl (Protein N-term))ACGLVASNLNLKPGECLR_\n106.540\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\nNaN\n73.638\nNaN\nNaN\nNaN\n72.596\nNaN\nNaN\nNaN\n\n\n_(Acetyl (Protein N-term))AEEGIAAGGVM(Oxidation (M))DVNTALQEVLK_\n138.075\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\n121.063333\n121.605\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n_(Acetyl (Protein N-term))AGWNAYIDNLM(Oxidation (M))ADGTCQDAAIVGYK_\n136.850\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\n107.397250\n107.130\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n_(Acetyl (Protein N-term))SDAAVDTSSEITTK_\n62.757\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\n34.013000\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n_AADDTWEPFASGK_\n90.319\n89.749\nNaN\n88.939\nNaN\nNaN\nNaN\nNaN\nNaN\n105.24\n...\nNaN\n55.282000\n54.873\n52.801\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n_TALLTWTEPPVR_\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n76.442\n\n\n_TQFNNNEYSQDLDAYNTKDK_\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n45.327\n\n\n_VATGTDLLSGTR_\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n36.155\n\n\n_VNWMPPPSR_\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n40.657\n\n\n_VTDIDSDDHQVM(Oxidation (M))YIM(Oxidation (M))K_\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n...\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n32.140\n\n\n\n\n49983 rows × 441 columns\n\n\n\nWe can than have a look at the absence of each peptidoform in all runs (value = absence in that many runs):\n\nprevelence_peptides = retention_df.isna().sum(axis=1)\nprint(prevelence_peptides)\n\n_(Acetyl (Protein N-term))ACGLVASNLNLKPGECLR_                          385\n_(Acetyl (Protein N-term))AEEGIAAGGVM(Oxidation (M))DVNTALQEVLK_       408\n_(Acetyl (Protein N-term))AGWNAYIDNLM(Oxidation (M))ADGTCQDAAIVGYK_    397\n_(Acetyl (Protein N-term))SDAAVDTSSEITTK_                              430\n_AADDTWEPFASGK_                                                        359\n                                                                      ... \n_TALLTWTEPPVR_                                                         440\n_TQFNNNEYSQDLDAYNTKDK_                                                 440\n_VATGTDLLSGTR_                                                         440\n_VNWMPPPSR_                                                            440\n_VTDIDSDDHQVM(Oxidation (M))YIM(Oxidation (M))K_                       440\nLength: 49983, dtype: int64\n\n\nWe can penalize the absence the absence of highly abundant peptidoforms per run (lower = more abundant peptidoforms present) by taking the dot product of presence/absence in the matrix and the above absence scores:\n\nscore_per_run = retention_df.isna().astype(int).T.dot(prevelence_peptides)\nscore_per_run.sort_values(ascending=True)\n\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC186_HPMC_50Asc_498     18417985\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC186_HPMC_50Asc_499     18633580\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC186_HPMC_50Asc_502     18802678\n20200319_SJ_QEx_LC1200_4_Sommerfeld_OC186_HPMC_50Asc_501     18807279\n20191129_SJ_QEx_LC1200_4_Sommerfeld_OC193_Tumor_50Asc_138    18877939\n                                                               ...   \n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_HPMC_FCS_85        21261614\n20191121_SJ_QEx_LC1200_4_Sommerfeld_0Wert_CAF_FCS_100        21262661\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_HPMC_FCS_83        21263993\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_HPMC_FCS_82        21265683\n20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_HPMC_FCS_87        21268762\nLength: 441, dtype: int64\n\n\nWe will use a single run to align all the first experiments against, this is the one with the lowest penalty score:\n\nrun_highest_overlap_score = score_per_run.sort_values(ascending=True).index[0]"
  },
  {
    "objectID": "tutorials/retentiontime/mq-evidence-to-ml.html#retention-time-alignment-between-runs",
    "href": "tutorials/retentiontime/mq-evidence-to-ml.html#retention-time-alignment-between-runs",
    "title": "Preparing a retention time data set for machine learning",
    "section": "Retention time alignment between runs",
    "text": "Retention time alignment between runs\nThe first step after reading and loading the data is to align retention times between runs. Here we will use splines in a GAM for that. The algorithm below follows these steps:\n\nIterate over all runs, sorted by the earlier defined penalty score\nObtain the overlapping peptidoforms between runs\nIf there are less than 20 peptidoforms skip that run\nDivide the overlapping peptides into equidistant bins and enforce a percentage of the bins to be filled with a least one peptidoform (now 200 bins and 75 % occupancy). If requirements are not met skip that run.\nFit the GAM with splines between the reference set and the selected run\nCalculate the error between aligned and values in the reference set. If selected it will run a second stage of the GAM filtering out any data points that were selected to have an error that is too high\nAssign aligned values to a new matrix\nChange the reference dataset to be the median of all aligned runs and the initial reference run\n\nIn the next code block we will define two kinds of plots, first a performance scatter plot. Here we plot the retention time of the selected set against the reference set; before and after alignment. Next is the residual plot that subtracts the diagonal from the performance scatter plot and essentially shows the errors before and after alignment. The residual plot is generated for both the first and second stage GAM.\n\ndef plot_performance(retention_df,run_highest_overlap_score,align_name,non_na_sel):\n        plt.scatter(\n                    retention_df[run_highest_overlap_score][non_na_sel],\n                    retention_df[align_name][non_na_sel],\n                    alpha=0.05,\n                    s=10,\n                    label=\"Reference+selected set unaligned\"\n                )\n\n        plt.scatter(\n                    retention_df[run_highest_overlap_score][non_na_sel],\n                    gam_model_cv.predict(retention_df[align_name][non_na_sel]),\n                    alpha=0.05,\n                    s=10,\n                    label=\"Reference+selected set aligned\"\n                )\n        plt.plot(\n            [\n            min(retention_df[run_highest_overlap_score][non_na_sel]),\n            max(retention_df[run_highest_overlap_score][non_na_sel])\n\n            ],\n            [\n                min(retention_df[run_highest_overlap_score][non_na_sel]),\n                max(retention_df[run_highest_overlap_score][non_na_sel])\n\n            ],\n            c=\"black\",\n            linestyle=\"--\",\n            linewidth=1.0\n        )\n        plt.xlabel(\"Retention time reference set\")\n        plt.ylabel(\"Retention time selected set\")\n        leg = plt.legend()\n        for lh in leg.legendHandles:\n            lh.set_alpha(1)\n\n        plt.show()\n\n\ndef plot_residual(run_highest_overlap_score,align_name,non_na_sel,title=\"Residual plot\"):\n        plt.scatter(\n                    retention_df[run_highest_overlap_score][non_na_sel],\n                    retention_df[align_name][non_na_sel]-retention_df[run_highest_overlap_score][non_na_sel],\n                    alpha=0.05,\n                    s=10\n                )\n\n        plt.scatter(\n                    retention_df[run_highest_overlap_score][non_na_sel],\n                    gam_model_cv.predict(retention_df[align_name][non_na_sel])-retention_df[run_highest_overlap_score][non_na_sel],\n                    alpha=0.05,\n                    s=10\n                )\n\n        plt.title(title)\n\n        plt.axhline(\n            y = 0.0,\n            color = \"black\",\n            linewidth=1.0,\n            linestyle = \"--\"\n        )\n\n        plt.ylabel(\"Residual\")\n        plt.xlabel(\"Retention time reference\")\n\n        plt.show()\n\n\n#constraints = \"monotonic_inc\"\nconstraints = \"none\"\n\n# Align parameters\nperform_second_stage_robust = True\nerror_filter_perc = 0.005\nnum_splines = 150\nmin_coverage = 0.75\ncoverage_div = 200\nplot_res_every_n = 100\nmin_overlap = 20\n\nrun_highest_overlap_score = score_per_run.sort_values(ascending=True).index[0]\n\nunique_peptides = []\nunique_peptides.extend(list(retention_df[retention_df[run_highest_overlap_score].notna()].index))\n\nretention_df_aligned = retention_df.copy()\n\nkeep_cols = [run_highest_overlap_score]\n\nerror_filter_perc_threshold = max(retention_df[run_highest_overlap_score])*error_filter_perc\n\n# Iterate over runs sorted by a penalty score\n# For version 3.8 or later uncomment the for loop below and comment the other for loop; also uncomment the line after update progressbar\n#for idx,align_name in (pbar := tqdm(enumerate(score_per_run.sort_values(ascending=True)[1:].index))):\nfor idx,align_name in tqdm(enumerate(score_per_run.sort_values(ascending=True)[1:].index)):\n    # Update progressbar\n    #pbar.set_description(f\"Processing {align_name}\")\n\n    # Check overlap between peptidoforms\n    non_na_sel = (retention_df[align_name].notna()) & (retention_df[run_highest_overlap_score].notna())\n\n    # Continue if insufficient overlapping peptides\n    if len(retention_df[run_highest_overlap_score][non_na_sel].index) &lt; min_overlap:\n        continue\n\n    # Check spread of overlapping peptidoforms, continue if not sufficient\n    if (len(set(pd.cut(retention_df[align_name][non_na_sel], coverage_div, include_lowest = True))) / coverage_div) &lt; min_coverage:\n        continue\n\n    # Fit the GAM\n    gam_model_cv = LinearGAM(s(0, n_splines=num_splines), constraints=constraints, verbose=True).fit(\n                                                            retention_df[align_name][non_na_sel],\n                                                            retention_df[run_highest_overlap_score][non_na_sel])\n\n\n    # Plot results alignment\n    if idx % plot_res_every_n == 0 or idx == 0:\n        plot_performance(\n            retention_df,\n            run_highest_overlap_score,\n            align_name,\n            non_na_sel\n        )\n        plot_residual(\n            run_highest_overlap_score,\n            align_name,\n            non_na_sel\n        )\n\n\n    # Calculate errors and create filter that can be used in the second stage\n    errors = abs(gam_model_cv.predict(retention_df[align_name][non_na_sel])-retention_df[run_highest_overlap_score][non_na_sel])\n    error_filter = errors &lt; error_filter_perc_threshold\n\n    # Perform a second stage GAM removing high error from previous fit\n    if perform_second_stage_robust:\n        gam_model_cv = LinearGAM(s(0, n_splines=num_splines), constraints=constraints, verbose=True).fit(\n                                                                retention_df[align_name][non_na_sel][error_filter],\n                                                                retention_df[run_highest_overlap_score][non_na_sel][error_filter])\n\n        if idx % plot_res_every_n == 0  or idx == 0:\n            plot_residual(\n            run_highest_overlap_score,\n            align_name,\n            non_na_sel,\n            title=\"Residual plot second stage GAM\"\n        )\n\n\n    # Write alignment to new matrix\n    retention_df_aligned.loc[retention_df[align_name].notna(),align_name] = gam_model_cv.predict(retention_df.loc[retention_df[align_name].notna(),align_name])\n\n    unique_peptides.extend(list(retention_df[retention_df[align_name].notna()].index))\n\n    keep_cols.append(align_name)\n\n    # Create reference set based on aligned retention times\n    retention_df[\"median_aligned\"] = retention_df_aligned[keep_cols].median(axis=1)\n    run_highest_overlap_score = \"median_aligned\"\n\nProcessing 20200319_SJ_QEx_LC1200_4_Sommerfeld_OC186_HPMC_50Asc_499: : 0it [00:00, ?it/s]Processing 20200109_SJ_QEx_LC1200_4_Sommerfeld_OC221_Tumor_5FCS_340: : 100it [01:23,  1.29it/s]Processing 20200109_SJ_QEx_LC1200_4_Sommerfeld_OC217_Tumor_5FCS_318: : 200it [02:42,  1.42it/s] Processing 20191210_SJ_QEx_LC1200_4_Sommerfeld_OC195_Tumor_50Asc_280: : 300it [03:54,  1.37it/s]Processing 20191028_SJ_QEx_LC1200_4_Sommerfeld_0Wert_HPMC_FCS_87: : 440it [04:40,  1.57it/s]    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe data points acquired looks as the following vector:\n\nretention_df_aligned[keep_cols].median(axis=1)\n\n_(Acetyl (Protein N-term))ACGLVASNLNLKPGECLR_                           70.639004\n_(Acetyl (Protein N-term))AEEGIAAGGVM(Oxidation (M))DVNTALQEVLK_       118.580954\n_(Acetyl (Protein N-term))AGWNAYIDNLM(Oxidation (M))ADGTCQDAAIVGYK_    106.010118\n_(Acetyl (Protein N-term))SDAAVDTSSEITTK_                               28.451475\n_AADDTWEPFASGK_                                                         54.339311\n                                                                          ...    \n_TALLTWTEPPVR_                                                          70.065188\n_TQFNNNEYSQDLDAYNTKDK_                                                  39.432309\n_VATGTDLLSGTR_                                                          30.655512\n_VNWMPPPSR_                                                             35.224075\n_VTDIDSDDHQVM(Oxidation (M))YIM(Oxidation (M))K_                        27.263047\nLength: 49983, dtype: float64\n\n\nIf we look at the standard deviation we can see that this is still relatively large for some peptidoforms:\n\nplt.hist(retention_df_aligned[keep_cols].std(axis=1),bins=500)\nplt.xlabel(\"Standard deviation retention time\")\nplt.show()\n\nplt.hist(retention_df_aligned[keep_cols].std(axis=1),bins=500)\nplt.xlim(0,7.5)\nplt.xlabel(\"Standard deviation retention time (zoomed)\")\nplt.show()\n\n\n\n\n\n\n\nIn addition to the std there is another factor that can play a big role, the amount of times a peptidoform was observed:\n\nplt.hist(retention_df_aligned[keep_cols].notna().sum(axis=1),bins=100)\nplt.xlabel(\"Count peptidoforms across runs\")\nplt.show()\n\nplt.hist(retention_df_aligned[keep_cols].notna().sum(axis=1),bins=100)\nplt.xlabel(\"Count peptidoforms across runs (zoomed)\")\nplt.xlim(0,20)\nplt.show()\n\n\n\n\n\n\n\nIf we plot both values against each other we get the following plot (the lines indicate possible thresholds):\n\nplt.scatter(\n    retention_df_aligned[keep_cols].notna().sum(axis=1),\n    retention_df_aligned[keep_cols].std(axis=1),\n    s=5,\n    alpha=0.1\n)\n\nplt.ylabel(\"Standard deviation retention time\")\nplt.xlabel(\"Count peptidoforms across runs\")\n\nplt.axhline(\n    y = 2.0,\n    color = \"black\",\n    linewidth=1.0,\n    linestyle = \"--\"\n)\n\nplt.axvline(\n    x = 5.0,\n    color = \"black\",\n    linewidth=1.0,\n    linestyle = \"--\"\n)\n\nplt.show()\n\n\n\n\nIf we set a threshold for a minimum of 5 observations and a maximum standard deviation of 2 we get the following final data set. Here we take the median for each peptidoform across all runs:\n\nmin_observations = 5\nmax_std = 2.0\n\nobservation_filter = retention_df_aligned[keep_cols].notna().sum(axis=1) &gt; min_observations\nstd_filter = retention_df_aligned[keep_cols].std(axis=1) &lt; max_std\n\nretention_df_aligned[keep_cols][(observation_filter) & (std_filter)].median(axis=1)\n\n_(Acetyl (Protein N-term))SDAAVDTSSEITTK_     28.451475\n_AADDTWEPFASGK_                               54.339311\n_AAGVNVEPFWPGLFAK_                           104.666562\n_AAPSVTLFPPSSEELQANK_                         64.668819\n_ADDKETCFAEEGKK_                               6.963496\n                                                ...    \n_AEPYCSVLPGFTFIQHLPLSER_                     106.951930\n_AFMTADLPNELIELLEK_                          131.490540\n_IAQLRPEDLAGLAALQELDVSNLSLQALPGDLSGLFPR_     132.956933\n_VETNMAFSPFSIASLLTQVLLGAGENTK_               136.516511\n_VNTFSALANIDLALEQGDALALFR_                   133.205437\nLength: 21588, dtype: float64"
  },
  {
    "objectID": "tutorials/retentiontime/deeplc-transfer-learning.html#transfer-learning-with-deeplc",
    "href": "tutorials/retentiontime/deeplc-transfer-learning.html#transfer-learning-with-deeplc",
    "title": "Transfer learning with DeepLC",
    "section": "Transfer learning with DeepLC",
    "text": "Transfer learning with DeepLC\n\n# install library for transfer learning\n!pip install deeplc\n!pip install deeplcretrainer\n\n\n# import deeplc packages\nfrom deeplc import DeepLC\nfrom deeplcretrainer import deeplcretrainer\n\n# Default\nfrom collections import Counter\nimport os\nimport urllib.request\n\n# specific packages\nimport pandas as pd\nfrom matplotlib import pyplot as plt\nfrom scipy.stats import pearsonr\nimport numpy as np\n\nimport tensorflow as tf\nfrom tensorflow.python.eager import context\n\nimport warnings\nwarnings.filterwarnings('ignore')\n\n\n# obtain three models for deeplc\nurllib.request.urlretrieve(\n    \"https://github.com/compomics/DeepLC/raw/master/deeplc/mods/full_hc_hela_hf_psms_aligned_1fd8363d9af9dcad3be7553c39396960.hdf5\",\n    \"full_hc_train_pxd001468_1fd8363d9af9dcad3be7553c39396960.hdf5\"\n)\n\nurllib.request.urlretrieve(\n    \"https://github.com/compomics/DeepLC/raw/master/deeplc/mods/full_hc_hela_hf_psms_aligned_8c22d89667368f2f02ad996469ba157e.hdf5\",\n    \"full_hc_train_pxd001468_8c22d89667368f2f02ad996469ba157e.hdf5\"\n)\n\nurllib.request.urlretrieve(\n    \"https://github.com/compomics/DeepLC/raw/master/deeplc/mods/full_hc_hela_hf_psms_aligned_cb975cfdd4105f97efa0b3afffe075cc.hdf5\",\n    \"full_hc_train_pxd001468_cb975cfdd4105f97efa0b3afffe075cc.hdf5\"\n)\n\n('full_hc_train_pxd001468_cb975cfdd4105f97efa0b3afffe075cc.hdf5',\n &lt;http.client.HTTPMessage&gt;)\n\n\nIn this tutorial you will learn how to apply transfer learning to DeepLC models. In previous versions of DeepLC the retention time was calibrated to the LC system that the researcher wants to apply the predictions to. This calibration was performed with either a piecewise linear function or in later versions with a GAM. However, this calibration works under the assumption the that elution order is preserved. Transfer learning has been shown to accurately model changes in chromatographic setup while requiring only a small number of peptides.\n\n# read the input csv file\ncombined_df = pd.read_csv(\n    \"https://github.com/ProteomicsML/ProteomicsML/raw/combined_datasets_retention_time/datasets/retentiontime/PRIDE_MQ/PRIDE_MQ.zip?raw=true\",\n    compression=\"zip\",\n    low_memory=False\n)\n\nWe have the following columns in the downloaded file:\n\ncombined_df.columns\n\nIndex(['MQ_seq', 'seq', 'modifications', 'tr', 'project'], dtype='object')\n\n\nIn this file there are multiple projects, some of these have many peptides and retention times associated:\n\ncombined_counter = Counter(combined_df[\"project\"])\ncombined_counter.most_common()[0:10]\n\n[('PXD028028_A_G_I', 148501),\n ('PXD020019', 98447),\n ('PXD010606_IB', 94779),\n ('PXD023559', 91220),\n ('PXD034196_pro', 88077),\n ('PXD030406', 87908),\n ('PXD019362', 87493),\n ('PXD022614', 86643),\n ('PXD034187_pro', 74438),\n ('PXD021742', 74035)]\n\n\nThe smallest projects in the data set still have over 20 000 peptides:\n\ncombined_counter.most_common()[-10:]\n\n[('PXD024045', 21399),\n ('PXD005346_HeLa_pAA_Rep2', 21112),\n ('PXD020987_third_batch', 20890),\n ('PXD023679', 20837),\n ('PXD022149', 20671),\n ('PXD028125', 20636),\n ('PXD012891', 20531),\n ('PXD010248', 20274),\n ('PXD019957', 20267),\n ('PXD002549', 20175)]\n\n\nLets select the project ‘PXD002549’ which is the project with the smallest number of peptides:\n\ndf = combined_df[combined_df[\"project\"] == combined_counter.most_common()[-1][0]]\n\nFrom this project we take 90 % of the data for training and early stopping (5 % of this 90 %). The remaining 10 % is used for testing prediction accuracy on unseen peptides.\n\ndf_train = df.sample(frac=0.9)\ndf_test = df.loc[df.index.difference(df_train.index)]\n\ndf_train.fillna(\"\",inplace=True)\ndf_test.fillna(\"\",inplace=True)"
  },
  {
    "objectID": "tutorials/retentiontime/deeplc-transfer-learning.html#calibration",
    "href": "tutorials/retentiontime/deeplc-transfer-learning.html#calibration",
    "title": "Transfer learning with DeepLC",
    "section": "Calibration",
    "text": "Calibration\nIn this section we will use calibration to predict retention times for our project\n\n%%capture\n\n# The following code is not required in most cases, but here it is used to clear variables that might cause problems\n_ = tf.Variable([1])\n\ncontext._context = None\ncontext._create_context()\n\ntf.config.threading.set_inter_op_parallelism_threads(1)\n\n# Make sure we have no NA in the dataframes\ndf_test['modifications'] = df_test['modifications'].fillna(\"\")\ndf_train['modifications'] = df_train['modifications'].fillna(\"\")\n\n# Call DeepLC with the downloaded models, say that we use GAM calibration\ndlc = DeepLC(\n        path_model=[\"full_hc_train_pxd001468_1fd8363d9af9dcad3be7553c39396960.hdf5\",\n                    \"full_hc_train_pxd001468_8c22d89667368f2f02ad996469ba157e.hdf5\",\n                    \"full_hc_train_pxd001468_cb975cfdd4105f97efa0b3afffe075cc.hdf5\"],\n        batch_num=1024000,\n        pygam_calibration=True\n)\n\n# Perform calibration, make predictions and calculate metrics\ndlc.calibrate_preds(seq_df=df_train)\npreds_calib = dlc.make_preds(seq_df=df_test)\n\nmae_calib = sum(abs(df_test[\"tr\"]-preds_calib))/len(df_test[\"tr\"].index)\nperc95_calib = np.percentile(abs(df_test[\"tr\"]-preds_calib),95)*2\ncor_calib = pearsonr(df_test[\"tr\"],preds_calib)[0]\n\nLets plot the results! These were fitted with a pretrained model and those predictions were calibrated with a GAM model.\n\n%matplotlib inline\n\nplt.title(f\"MAE: {round(mae_calib,2)} 95th percentile: {round(perc95_calib,2)} R: {round(cor_calib,3)}\")\nplt.scatter(df_test[\"tr\"],preds_calib,s=1,alpha=0.5)\nplt.plot([15,115],[15,115],c=\"grey\")\nplt.show()"
  },
  {
    "objectID": "tutorials/retentiontime/deeplc-transfer-learning.html#new-train",
    "href": "tutorials/retentiontime/deeplc-transfer-learning.html#new-train",
    "title": "Transfer learning with DeepLC",
    "section": "New train",
    "text": "New train\nThere are quite a few data points so we might actually be able to train a model from the ground up. This means it starts with random parameters in the network and will change it accordingly to the training data.\nThis can be slow - in order to speed this process up go to edit on the top bar, select notebook settings and select GPU as a hardware accelerator\n\n%%capture\n\n# The following code is not required in most cases, but here it is used to clear variables that might cause problems\n_ = tf.Variable([1])\n\ncontext._context = None\ncontext._create_context()\n\ntf.config.threading.set_inter_op_parallelism_threads(1)\n\n# For training new models we need to use a file, so write the train df to a file\ndf_train.to_csv(\"train.csv\",index=False)\ndf_train_file = \"train.csv\"\n\n# Here we will train a new model so we keep the 'mods_transfer_learning' empty\nmodels_subtr = deeplcretrainer.retrain(\n    [df_train_file],\n    mods_transfer_learning=[],\n    freeze_layers=False,\n    n_epochs=100\n)\n\n# The following code is not required in most cases, but here it is used to clear variables that might cause problems\n_ = tf.Variable([1])\n\ncontext._context = None\ncontext._create_context()\n\ntf.config.threading.set_inter_op_parallelism_threads(1)\n\n# Make a DeepLC object with the models trained previously\ndlc = DeepLC(\n        path_model=models_subtr,\n        batch_num=1024000,\n        pygam_calibration=False\n)\n\n# Perform calibration, make predictions and calculate metrics\ndlc.calibrate_preds(seq_df=df_train)\npreds_newtrain = dlc.make_preds(seq_df=df_test)\n\nmae_newtrain = sum(abs(df_test[\"tr\"]-preds_newtrain))/len(df_test[\"tr\"].index)\nperc95_newtrain = np.percentile(abs(df_test[\"tr\"]-preds_newtrain),95)*2\ncor_newtrain = pearsonr(df_test[\"tr\"],preds_newtrain)[0]\n\nINFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)\nINFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)\nINFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)\n\n\nLets plot the results of the newly trained model. As you will see the results are very comparable to calibration. In most cases with less training data the calibration strategy will outperform a newly trained model. What strategy works best, calibration or newly trained, depends highly on the data set.\n\n%matplotlib inline\n\nplt.title(f\"MAE: {round(mae_newtrain,2)} 95th percentile: {round(perc95_newtrain,2)} R: {round(cor_newtrain,3)}\")\nplt.scatter(df_test[\"tr\"],preds_newtrain,s=1,alpha=0.5)\nplt.plot([15,115],[15,115],c=\"grey\")\nplt.show()"
  },
  {
    "objectID": "tutorials/retentiontime/deeplc-transfer-learning.html#transfer-learning",
    "href": "tutorials/retentiontime/deeplc-transfer-learning.html#transfer-learning",
    "title": "Transfer learning with DeepLC",
    "section": "Transfer learning",
    "text": "Transfer learning\nTransfer learning is a proven strategy when making predictions for a smaller data set that does not have exactly the same objective. Instead of starting with random starting parameters the training starts with previously trained parameters. This means that it can use information from previously trained on data to obtain a better solution for the current data set.\nThis can be slow - in order to speed this process up go to edit on the top bar, select notebook settings and select GPU as a hardware accelerator\n\n%%capture\n\n# The following code is not required in most cases, but here it is used to clear variables that might cause problems\n_ = tf.Variable([1])\n\ncontext._context = None\ncontext._create_context()\n\ntf.config.threading.set_inter_op_parallelism_threads(1)\n\n# For training new models we need to use a file, so write the train df to a file\ndf_train.to_csv(\"train.csv\",index=False)\ndf_train_file = \"train.csv\"\n\n# Here we will apply transfer learning we specify previously trained models in the 'mods_transfer_learning'\nmodels = deeplcretrainer.retrain(\n    [df_train_file],\n    mods_transfer_learning=[\n        \"full_hc_train_pxd001468_1fd8363d9af9dcad3be7553c39396960.hdf5\",\n        \"full_hc_train_pxd001468_8c22d89667368f2f02ad996469ba157e.hdf5\",\n        \"full_hc_train_pxd001468_cb975cfdd4105f97efa0b3afffe075cc.hdf5\"\n    ],\n    freeze_layers=True,\n    n_epochs=10,\n    freeze_after_concat=1\n);\n\n# The following code is not required in most cases, but here it is used to clear variables that might cause problems\n_ = tf.Variable([1])\n\ncontext._context = None\ncontext._create_context()\n\ntf.config.threading.set_inter_op_parallelism_threads(1)\n\n# Make a DeepLC object with the models trained previously\ndlc = DeepLC(\n        path_model=models,\n        batch_num=1024000,\n        pygam_calibration=False\n)\n\n# Perform calibration, make predictions and calculate metrics\ndlc.calibrate_preds(seq_df=df_train)\npreds_transflearn = dlc.make_preds(seq_df=df_test)\n\nmae_transflearn = sum(abs(df_test[\"tr\"]-preds_transflearn))/len(df_test[\"tr\"].index)\nperc95_transflearn = np.percentile(abs(df_test[\"tr\"]-preds_transflearn),95)*2\ncor_transflearn = pearsonr(df_test[\"tr\"],preds_transflearn)[0]\n\nLets have a look at the transfer learning results. As you can see in the following plot the performance is substantially higher compared to calibration or training a new model.\n\n%matplotlib inline\n\nplt.title(f\"MAE: {round(mae_transflearn,2)} 95th percentile: {round(perc95_transflearn,2)} R: {round(cor_transflearn,3)}\")\nplt.scatter(df_test[\"tr\"],preds_transflearn,s=1,alpha=0.5)\nplt.plot([15,115],[15,115],c=\"grey\")\nplt.show()"
  },
  {
    "objectID": "tutorials/fragmentation/index.html",
    "href": "tutorials/fragmentation/index.html",
    "title": "Fragmentation",
    "section": "",
    "text": "Title\n\n\nAuthor\n\n\nDate\n\n\n\n\n\n\nNIST (part 1): Preparing a spectral library for ML\n\n\nRalf Gabriels\n\n\nOct 5, 2022\n\n\n\n\nNIST (part 2): Traditional ML: Gradient boosting\n\n\nRalf Gabriels\n\n\nOct 5, 2022\n\n\n\n\nProsit-style GRU with pre-annotated ProteomeTools data\n\n\nSiegfried Gessulat\n\n\nSep 28, 2022\n\n\n\n\nRaw file processing with PROSIT style annotation\n\n\nTobias Greisager Rehfeldt\n\n\nSep 21, 2022\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "tutorials/fragmentation/nist-2-traditional-ml-gradient-boosting.html#introduction",
    "href": "tutorials/fragmentation/nist-2-traditional-ml-gradient-boosting.html#introduction",
    "title": "NIST (part 2): Traditional ML: Gradient boosting",
    "section": "1. Introduction",
    "text": "1. Introduction\nThis is the second part in a three-part tutorial. We recommend you to to start with the first section, where the NIST spectral library is parsed and prepared for use in the second and third parts.\n\nPreparing a spectral library for ML\nTraditional ML: Gradient boosting\nDeep learning: BiLSTM\n\nIn this tutorial, you will learn how to build a fragmentation intensity predictor similar to MS²PIP v3 (Gabriels, Martens, and Degroeve 2019) with traditional machine learning (ML) feature engineering and Gradient Boosting (Friedman 2002).\n\n# Installing required python packages\n! pip install rich~=12.5 numpy~=1.21 pandas~=1.3 matplotlib~=3.5 seaborn~=0.11 scikit-learn~=1.0 pyarrow~=15.0 hyperopt~=0.2 --quiet"
  },
  {
    "objectID": "tutorials/fragmentation/nist-2-traditional-ml-gradient-boosting.html#data-preparation",
    "href": "tutorials/fragmentation/nist-2-traditional-ml-gradient-boosting.html#data-preparation",
    "title": "NIST (part 2): Traditional ML: Gradient boosting",
    "section": "2 Data preparation",
    "text": "2 Data preparation\nWe will use the spectral library that was already parsed in part 1 of this tutorial series.\n\nimport pandas as pd\n\ntrain_val_spectra = pd.read_feather(\"http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomicsml/fragmentation/nist-humanhcd20160503-parsed-trainval.feather\")\ntest_spectra = pd.read_feather(\"http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomicsml/fragmentation/nist-humanhcd20160503-parsed-test.feather\")\n\n\n2.1 Feature engineering\nIn traditional ML, the input features for the algorithm usually require some engineering. For fragmentation intensity prediction this is not different. Following the MS²PIP methods, we will calculate the distributions of several amino acid properties across the peptide and fragment ion sequences.\nUsing the distribution of these properties instead of the actual properties per amino acid allows MS²PIP to get a fixed length feature matrix for input peptides with varying lengths.\n\nimport numpy as np\nimport pandas as pd\nfrom rich import progress\n\n\namino_acids = list(\"ACDEFGHIKLMNPQRSTVWY\")\nproperties = np.array([\n    [37,35,59,129,94,0,210,81,191,81,106,101,117,115,343,49,90,60,134,104],  # basicity\n    [68,23,33,29,70,58,41,73,32,73,66,38,0,40,39,44,53,71,51,55],  # helicity\n    [51,75,25,35,100,16,3,94,0,94,82,12,0,22,22,21,39,80,98,70],  # hydrophobicity\n    [32,23,0,4,27,32,48,32,69,32,29,26,35,28,79,29,28,31,31,28],  # pI\n])\n\npd.DataFrame(properties, columns=amino_acids, index=[\"basicity\", \"helicity\", \"hydrophobicity\", \"pI\"])\n\n\n\n\n\n\n\n\nA\nC\nD\nE\nF\nG\nH\nI\nK\nL\nM\nN\nP\nQ\nR\nS\nT\nV\nW\nY\n\n\n\n\nbasicity\n37\n35\n59\n129\n94\n0\n210\n81\n191\n81\n106\n101\n117\n115\n343\n49\n90\n60\n134\n104\n\n\nhelicity\n68\n23\n33\n29\n70\n58\n41\n73\n32\n73\n66\n38\n0\n40\n39\n44\n53\n71\n51\n55\n\n\nhydrophobicity\n51\n75\n25\n35\n100\n16\n3\n94\n0\n94\n82\n12\n0\n22\n22\n21\n39\n80\n98\n70\n\n\npI\n32\n23\n0\n4\n27\n32\n48\n32\n69\n32\n29\n26\n35\n28\n79\n29\n28\n31\n31\n28\n\n\n\n\n\n\n\n\ndef encode_peptide(sequence, charge):\n    # 4 properties * 5 quantiles * 3 ion types + 4 properties * 4 site + 2 global\n    n_features = 78\n    quantiles = [0, 0.25, 0.5, 0.75, 1]\n    n_ions = len(sequence) - 1\n\n    # Encode amino acids as integers to index amino acid properties for peptide sequence\n    aa_indices = {aa: i for i, aa in  enumerate(\"ACDEFGHIKLMNPQRSTVWY\")}\n    aa_to_index = np.vectorize(lambda aa: aa_indices[aa])\n    peptide_indexed = aa_to_index(np.array(list(sequence)))\n    peptide_properties = properties[:, peptide_indexed]\n\n    # Empty peptide_features array\n    peptide_features = np.full((n_ions, n_features), np.nan)\n\n    for b_ion_number in range(1, n_ions + 1):\n        # Calculate quantiles of features across peptide, b-ion, and y-ion\n        peptide_quantiles = np.hstack(\n            np.quantile(peptide_properties, quantiles, axis=1).transpose()\n        )\n        b_ion_quantiles = np.hstack(\n            np.quantile(peptide_properties[:,:b_ion_number], quantiles, axis=1).transpose()\n        )\n        y_ion_quantiles = np.hstack(\n            np.quantile(peptide_properties[:,b_ion_number:], quantiles, axis=1).transpose()\n        )\n\n        # Properties on specific sites: nterm, frag-1, frag+1, cterm\n        specific_site_indexes = np.array([0, b_ion_number - 1, b_ion_number, -1])\n        specific_site_properties = np.hstack(peptide_properties[:, specific_site_indexes].transpose())\n\n        # Global features: Length and charge\n        global_features = np.array([len(sequence), int(charge)])\n\n        # Assign to peptide_features array\n        peptide_features[b_ion_number - 1, 0:20] = peptide_quantiles\n        peptide_features[b_ion_number - 1, 20:40] = b_ion_quantiles\n        peptide_features[b_ion_number - 1, 40:60] = y_ion_quantiles\n        peptide_features[b_ion_number - 1, 60:76] = specific_site_properties\n        peptide_features[b_ion_number - 1, 76:78] = global_features\n\n    return peptide_features\n\n\ndef generate_feature_names():\n    feature_names = []\n    for level in [\"peptide\", \"b\", \"y\"]:\n        for aa_property in [\"basicity\", \"helicity\", \"hydrophobicity\", \"pi\"]:\n            for quantile in [\"min\", \"q1\", \"q2\", \"q3\", \"max\"]:\n                feature_names.append(\"_\".join([level, aa_property, quantile]))\n    for site in [\"nterm\", \"fragmin1\", \"fragplus1\", \"cterm\"]:\n        for aa_property in [\"basicity\", \"helicity\", \"hydrophobicity\", \"pi\"]:\n            feature_names.append(\"_\".join([site, aa_property]))\n\n    feature_names.extend([\"length\", \"charge\"])\n    return feature_names\n\nLet’s test it with a single peptide. Feel free to use your own name as a “peptide”; as long as it does not contain any non-amino acid characters.\n\npeptide_features = pd.DataFrame(encode_peptide(\"RALFGARIELS\", 2), columns=generate_feature_names())\npeptide_features\n\n\n\n\n\n\n\n\npeptide_basicity_min\npeptide_basicity_q1\npeptide_basicity_q2\npeptide_basicity_q3\npeptide_basicity_max\npeptide_helicity_min\npeptide_helicity_q1\npeptide_helicity_q2\npeptide_helicity_q3\npeptide_helicity_max\n...\nfragplus1_basicity\nfragplus1_helicity\nfragplus1_hydrophobicity\nfragplus1_pi\ncterm_basicity\ncterm_helicity\ncterm_hydrophobicity\ncterm_pi\nlength\ncharge\n\n\n\n\n0\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n37.0\n68.0\n51.0\n32.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n1\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n81.0\n73.0\n94.0\n32.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n2\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n94.0\n70.0\n100.0\n27.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n3\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n0.0\n58.0\n16.0\n32.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n4\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n37.0\n68.0\n51.0\n32.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n5\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n343.0\n39.0\n22.0\n79.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n6\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n81.0\n73.0\n94.0\n32.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n7\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n129.0\n29.0\n35.0\n4.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n8\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n81.0\n73.0\n94.0\n32.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n9\n0.0\n43.0\n81.0\n111.5\n343.0\n29.0\n41.5\n68.0\n71.5\n73.0\n...\n49.0\n44.0\n21.0\n29.0\n49.0\n44.0\n21.0\n29.0\n11.0\n2.0\n\n\n\n\n10 rows × 78 columns\n\n\n\n\n\n2.2 Getting the target intensities\nThe target intensities are the observed intensities which the model will learn to predict. Let’s first try with a single spectrum.\n\ntest_spectrum = train_val_spectra.iloc[4]\n\npeptide_targets =  pd.DataFrame({\n    \"b_target\": test_spectrum[\"parsed_intensity\"][\"b\"],\n    \"y_target\": test_spectrum[\"parsed_intensity\"][\"y\"],\n})\npeptide_targets\n\n\n\n\n\n\n\n\nb_target\ny_target\n\n\n\n\n0\n0.000000\n0.118507\n\n\n1\n0.229717\n0.079770\n\n\n2\n0.294631\n0.088712\n\n\n3\n0.234662\n0.145900\n\n\n4\n0.185732\n0.205005\n\n\n5\n0.134395\n0.261630\n\n\n6\n0.081856\n0.305119\n\n\n7\n0.043793\n0.296351\n\n\n8\n0.000000\n0.205703\n\n\n9\n0.000000\n0.155991\n\n\n10\n0.000000\n0.000000\n\n\n\n\n\n\n\nThese are the intensities for the b- and y-ions, each ordered from 1 to 9. In MS²PIP, however, a clever trick is applied to reuse the computed features for each fragment ion pair. Doing so makes perfect sense, as both ions in such a fragment ion pair originated from the same fragmentation event. For this peptide, the fragment ion pairs are b1-y9, b2-y8, b3-y7, etc. To match all of the pairs, we simply have to reverse the y-ion series intensities:\n\npeptide_targets =  pd.DataFrame({\n    \"b_target\": test_spectrum[\"parsed_intensity\"][\"b\"],\n    \"y_target\": test_spectrum[\"parsed_intensity\"][\"y\"][::-1],\n})\npeptide_targets\n\n\n\n\n\n\n\n\nb_target\ny_target\n\n\n\n\n0\n0.000000\n0.000000\n\n\n1\n0.229717\n0.155991\n\n\n2\n0.294631\n0.205703\n\n\n3\n0.234662\n0.296351\n\n\n4\n0.185732\n0.305119\n\n\n5\n0.134395\n0.261630\n\n\n6\n0.081856\n0.205005\n\n\n7\n0.043793\n0.145900\n\n\n8\n0.000000\n0.088712\n\n\n9\n0.000000\n0.079770\n\n\n10\n0.000000\n0.118507\n\n\n\n\n\n\n\n\n\n2.3 Bringing it all together\n\nfeatures = encode_peptide(test_spectrum[\"sequence\"], test_spectrum[\"charge\"])\ntargets = np.stack([test_spectrum[\"parsed_intensity\"][\"b\"], test_spectrum[\"parsed_intensity\"][\"y\"][::-1]], axis=1)\nspectrum_id = np.full(shape=(targets.shape[0], 1), fill_value=test_spectrum[\"index\"])  # Repeat id for all ions\n\n\npd.DataFrame(np.hstack([spectrum_id, features, targets]), columns=[\"spectrum_id\"] + generate_feature_names() + [\"b_target\",  \"y_target\"])\n\n\n\n\n\n\n\n\nspectrum_id\npeptide_basicity_min\npeptide_basicity_q1\npeptide_basicity_q2\npeptide_basicity_q3\npeptide_basicity_max\npeptide_helicity_min\npeptide_helicity_q1\npeptide_helicity_q2\npeptide_helicity_q3\n...\nfragplus1_hydrophobicity\nfragplus1_pi\ncterm_basicity\ncterm_helicity\ncterm_hydrophobicity\ncterm_pi\nlength\ncharge\nb_target\ny_target\n\n\n\n\n0\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.000000\n0.000000\n\n\n1\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.229717\n0.155991\n\n\n2\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.294631\n0.205703\n\n\n3\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.234662\n0.296351\n\n\n4\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.185732\n0.305119\n\n\n5\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.134395\n0.261630\n\n\n6\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.081856\n0.205005\n\n\n7\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.043793\n0.145900\n\n\n8\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n80.0\n31.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.000000\n0.088712\n\n\n9\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n21.0\n29.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.000000\n0.079770\n\n\n10\n5.0\n37.0\n37.0\n37.0\n40.0\n343.0\n39.0\n68.0\n68.0\n68.0\n...\n22.0\n79.0\n343.0\n39.0\n22.0\n79.0\n12.0\n2.0\n0.000000\n0.118507\n\n\n\n\n11 rows × 81 columns\n\n\n\nThe following function applies these steps over a collection of spectra and returns the full feature/target table:\n\ndef generate_ml_input(spectra):\n    tables = []\n    for spectrum in progress.track(spectra.to_dict(orient=\"records\")):\n        features = encode_peptide(spectrum[\"sequence\"], spectrum[\"charge\"])\n        targets = np.stack([spectrum[\"parsed_intensity\"][\"b\"], spectrum[\"parsed_intensity\"][\"y\"][::-1]], axis=1)\n        spectrum_id = np.full(shape=(targets.shape[0], 1), fill_value=spectrum[\"index\"])  # Repeat id for all ions\n        table = np.hstack([spectrum_id, features, targets])\n        tables.append(table)\n\n    full_table = np.vstack(tables)\n    spectra_encoded = pd.DataFrame(full_table, columns=[\"spectrum_id\"] + generate_feature_names() + [\"b_target\",  \"y_target\"])\n    return spectra_encoded\n\nNote that this might take some time, sometimes up to 30 minutes. To skip this step, simple download the file with pre-encoded features and targets, and load in two cells below.\n\ntrain_val_encoded = generate_ml_input(train_val_spectra)\ntrain_val_encoded.to_feather(\"fragmentation-nist-humanhcd20160503-parsed-trainval-encoded.feather\")\n\ntest_encoded = generate_ml_input(test_spectra)\ntest_encoded.to_feather(\"fragmentation-nist-humanhcd20160503-parsed-test-encoded.feather\")\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n# Uncomment this step to load pre-encoded features from a file:\n\n#train_val_encoded = pd.read_feather(\"ftp://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomicsml/fragmentation/nist-humanhcd20160503-parsed-trainval-encoded.feather\")\n#test_encoded = pd.read_feather(\"ftp://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomicsml/fragmentation/nist-humanhcd20160503-parsed-test-encoded.feather\")\n\n\ntrain_val_encoded\n\n\n\n\n\n\n\n\nspectrum_id\npeptide_basicity_min\npeptide_basicity_q1\npeptide_basicity_q2\npeptide_basicity_q3\npeptide_basicity_max\npeptide_helicity_min\npeptide_helicity_q1\npeptide_helicity_q2\npeptide_helicity_q3\n...\nfragplus1_hydrophobicity\nfragplus1_pi\ncterm_basicity\ncterm_helicity\ncterm_hydrophobicity\ncterm_pi\nlength\ncharge\nb_target\ny_target\n\n\n\n\n0\n0.0\n0.0\n37.0\n37.0\n37.0\n191.0\n32.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n191.0\n32.0\n0.0\n69.0\n22.0\n2.0\n0.000000\n0.000000\n\n\n1\n0.0\n0.0\n37.0\n37.0\n37.0\n191.0\n32.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n191.0\n32.0\n0.0\n69.0\n22.0\n2.0\n0.094060\n0.000000\n\n\n2\n0.0\n0.0\n37.0\n37.0\n37.0\n191.0\n32.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n191.0\n32.0\n0.0\n69.0\n22.0\n2.0\n0.180642\n0.000000\n\n\n3\n0.0\n0.0\n37.0\n37.0\n37.0\n191.0\n32.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n191.0\n32.0\n0.0\n69.0\n22.0\n2.0\n0.204203\n0.050476\n\n\n4\n0.0\n0.0\n37.0\n37.0\n37.0\n191.0\n32.0\n68.0\n68.0\n68.0\n...\n51.0\n32.0\n191.0\n32.0\n0.0\n69.0\n22.0\n2.0\n0.233472\n0.094835\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n3321136\n398372.0\n81.0\n104.0\n104.0\n153.0\n343.0\n39.0\n48.5\n55.0\n55.0\n...\n70.0\n28.0\n343.0\n39.0\n22.0\n79.0\n8.0\n3.0\n0.000000\n0.180938\n\n\n3321137\n398372.0\n81.0\n104.0\n104.0\n153.0\n343.0\n39.0\n48.5\n55.0\n55.0\n...\n98.0\n31.0\n343.0\n39.0\n22.0\n79.0\n8.0\n3.0\n0.000000\n0.203977\n\n\n3321138\n398372.0\n81.0\n104.0\n104.0\n153.0\n343.0\n39.0\n48.5\n55.0\n55.0\n...\n3.0\n48.0\n343.0\n39.0\n22.0\n79.0\n8.0\n3.0\n0.000000\n0.169803\n\n\n3321139\n398372.0\n81.0\n104.0\n104.0\n153.0\n343.0\n39.0\n48.5\n55.0\n55.0\n...\n94.0\n32.0\n343.0\n39.0\n22.0\n79.0\n8.0\n3.0\n0.000000\n0.120565\n\n\n3321140\n398372.0\n81.0\n104.0\n104.0\n153.0\n343.0\n39.0\n48.5\n55.0\n55.0\n...\n22.0\n79.0\n343.0\n39.0\n22.0\n79.0\n8.0\n3.0\n0.000000\n0.169962\n\n\n\n\n3321141 rows × 81 columns\n\n\n\nThis is the data we will use for training. Note that each spectrum comprises of multiple lines: One line per b/y-ion couple."
  },
  {
    "objectID": "tutorials/fragmentation/nist-2-traditional-ml-gradient-boosting.html#training-the-model",
    "href": "tutorials/fragmentation/nist-2-traditional-ml-gradient-boosting.html#training-the-model",
    "title": "NIST (part 2): Traditional ML: Gradient boosting",
    "section": "3 Training the model",
    "text": "3 Training the model\n\nfrom sklearn.ensemble import GradientBoostingRegressor\n\nLet’s first try to train a simple model on the train set and evaluate its performance on the test set.\n\nreg =  GradientBoostingRegressor()\n\nX_train = train_val_encoded.drop(columns=[\"spectrum_id\", \"b_target\",  \"y_target\"])\ny_train = train_val_encoded[\"y_target\"]\nX_test = test_encoded.drop(columns=[\"spectrum_id\", \"b_target\",  \"y_target\"])\ny_test = test_encoded[\"y_target\"]\n\nreg.fit(X_train, y_train)\n\nGradientBoostingRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.  GradientBoostingRegressor?Documentation for GradientBoostingRegressoriFittedGradientBoostingRegressor() \n\n\n\ny_test_pred = reg.predict(X_test)\nnp.corrcoef(y_test, y_test_pred)[0][1]\n\n0.7545700864493669\n\n\nNot terrible. Let’s see if we can do better after hyperparameters optimization. For this, we can use the hyperopt package.\n\nfrom hyperopt import fmin, hp, tpe, STATUS_OK\n\n\ndef objective(n_estimators):\n    # Define algorithm\n    reg =  GradientBoostingRegressor(n_estimators=n_estimators)\n\n    # Fit model\n    reg.fit(X_train, y_train)\n\n    # Test model\n    y_test_pred = reg.predict(X_test)\n    correlation = np.corrcoef(y_test, y_test_pred)[0][1]\n\n    return {'loss': -correlation, 'status': STATUS_OK}\n\n\nbest_params = fmin(\n  fn=objective,\n  space=10 + hp.randint('n_estimators', 980),\n  algo=tpe.suggest,\n  max_evals=10,\n)\n\n100%|██████████| 10/10 [27:32:34&lt;00:00, 9915.41s/trial, best loss: -0.8238390276421177]   \n\n\n\nbest_params\n\n{'n_estimators': 912}\n\n\nInitially, the default value of 100 estimators was used. According to this hyperopt run, using 912 estimators results in a more performant model.\nNow we can train the model again with this new hyperparameter value:\n\nreg =  GradientBoostingRegressor(n_estimators=946)\n\nX_train = train_val_encoded.drop(columns=[\"spectrum_id\", \"b_target\",  \"y_target\"])\ny_train = train_val_encoded[\"y_target\"]\nX_test = test_encoded.drop(columns=[\"spectrum_id\", \"b_target\",  \"y_target\"])\ny_test = test_encoded[\"y_target\"]\n\nreg.fit(X_train, y_train)\n\nGradientBoostingRegressor(n_estimators=946)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.  GradientBoostingRegressor?Documentation for GradientBoostingRegressoriFittedGradientBoostingRegressor(n_estimators=946) \n\n\n\ny_test_pred = reg.predict(X_test)\nnp.corrcoef(y_test, y_test_pred)[0][1]\n\n0.8245284609614351\n\n\nMuch better already. To get a more accurate view of the model performance, we should calculate the correlation per spectrum, instead of across the full dataset:\n\nprediction_df_y = pd.DataFrame({\n    \"spectrum_id\": test_encoded[\"spectrum_id\"],\n    \"target_y\": y_test,\n    \"prediction_y\": y_test_pred,\n})\nprediction_df_y\n\n\n\n\n\n\n\n\nspectrum_id\ntarget_y\nprediction_y\n\n\n\n\n0\n9.0\n0.000000\n-0.001765\n\n\n1\n9.0\n0.000000\n-0.001524\n\n\n2\n9.0\n0.000000\n-0.000848\n\n\n3\n9.0\n0.000000\n0.000110\n\n\n4\n9.0\n0.000000\n0.004115\n\n\n...\n...\n...\n...\n\n\n367683\n398369.0\n0.000000\n0.157859\n\n\n367684\n398369.0\n0.224074\n0.195466\n\n\n367685\n398369.0\n0.283664\n0.160815\n\n\n367686\n398369.0\n0.185094\n0.138235\n\n\n367687\n398369.0\n0.192657\n0.223232\n\n\n\n\n367688 rows × 3 columns\n\n\n\n\ncorr_y = prediction_df_y.groupby(\"spectrum_id\").corr().iloc[::2]['prediction_y']\ncorr_y.index = corr_y.index.droplevel(1)\ncorr_y = corr_y.reset_index().rename(columns={\"prediction_y\": \"correlation\"})\ncorr_y\n\n\n\n\n\n\n\n\nspectrum_id\ncorrelation\n\n\n\n\n0\n9.0\n0.839347\n\n\n1\n16.0\n0.806933\n\n\n2\n39.0\n0.113225\n\n\n3\n95.0\n-0.078798\n\n\n4\n140.0\n0.744942\n\n\n...\n...\n...\n\n\n27031\n398328.0\n0.856521\n\n\n27032\n398341.0\n0.358528\n\n\n27033\n398342.0\n0.839326\n\n\n27034\n398368.0\n0.452030\n\n\n27035\n398369.0\n0.204123\n\n\n\n\n27036 rows × 2 columns\n\n\n\nMedian correlation:\n\ncorr_y[\"correlation\"].median()\n\n0.8328406567933057\n\n\nCorrelation distribution:\n\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nsns.set_style(\"whitegrid\")\n\n\nsns.catplot(\n    data=corr_y, x=\"correlation\",\n    fliersize=1,\n    kind=\"box\", aspect=4, height=2\n)\nplt.show()\n\n\n\n\nNot bad! With some more hyperparameter optimization (optimizing only the number of trees is a bit crude) a lot more performance gains could be made. Take a look at the Scikit Learn documentation to learn more about the various hyperparameters for the GradientBoostingRegressor. Alternatively, you could switch to the XGBoost algorithm, which is currently used by MS²PIP.\nAnd of course, this model can only predict y-ion intensities. You can repeat the training and optimization steps to train a model for b-ion intensities.\nGood luck!"
  },
  {
    "objectID": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#introduction",
    "href": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#introduction",
    "title": "NIST (part 1): Preparing a spectral library for ML",
    "section": "1 Introduction",
    "text": "1 Introduction\n\n1.1 Fragmentation peak intensities\nIn bottom-up proteomics, a peptide fragmentation spectrum (MS2) is the most central source of information to identify a peptide. In traditional identification workflows, only the presence and location (x-axis) of peaks in the spectrum are used to identify the peptide that generated the spectrum. The intensity of these peaks (y-axis) are, however, seldomly used in a comprehensive manner. At most, traditional approaches naively assume that higher intensity is always better.\nThis lack of usage of fragmentation peak intensity patterns can mainly be attributed to their complexity. While the location of certain peaks (e.g., b- and y-ions) can easily be calculated from the amino acid masses, fragment peak intensity follow complex, yet predictable patterns. This makes fragmentation peak intensity values a perfect candidate for machine learning.\nML-predicted fragmentation peak intensities have proven their value in many applications, for instance, manual spectrum validation, peptide identification (re)scoring, and for generating in silico predicted spectral libraries for data-independant acquisition (DIA) identification.\n\n\n1.2 About this tutorial\nIn this three-part tutorial you will learn the basic steps in developing a machine learning (ML) predictor for peptide fragmentation intensity prediction, using a NIST spectral library. The first part handles the preparation and parsing of training data; the second part handles training a traditional ML model with XGBoost, similar to MS²PIP (Gabriels, Martens, and Degroeve 2019), and the third part handles training a deep learning BiLSTM predictor.\n\nPreparing a spectral library for ML\nTraditional ML: Gradient boosting\nDeep learning: BiLSTM\n\nTo avoid an overly complex tutorial, some aspects to intensity prediction are simplified or not handled. For example, the resulting models will only be able to predict singly charged b- and y-ions for unmodified peptides.\n\n# Installing required python packages (versions tested with Python 3.10.11)\n! pip install rich~=12.5 numpy~=1.21 pandas~=1.3 pyarrow~=15.0 matplotlib~=3.5.0 seaborn~=0.11 scikit-learn~=1.0 spectrum_utils==0.3.5 --quiet"
  },
  {
    "objectID": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#finding-spectral-libraries",
    "href": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#finding-spectral-libraries",
    "title": "NIST (part 1): Preparing a spectral library for ML",
    "section": "2 Finding spectral libraries",
    "text": "2 Finding spectral libraries\nTraining data for peptide fragmentation spectrum intensity prediction consists of spectra that were already identified. The most convenient source of such information are spectral libraries. These are datasets that were compiled from a collection of mass spectrometry runs and usually consist of a single representative spectrum for each peptide that was identified.\nMany precompiled spectral libraries are available online. You can also generate your own from a collection of proteomics experiments, using software such as SpectraST (Lam et al. 2008).\nSpectral libraries can be downloaded, for instance, from NIST, the US National Institute of Standards and Technology . For this part of the practical, we will download the 2020 Human HCD library of “best” tryptic spectra. For ease-of-use, we will download it in the text-based NIST MSP format.\nThe following code cell automatically downloads and extracts the spectral library file.\n\nimport tarfile\nimport urllib\n\nurl = \"https://chemdata.nist.gov/download/peptide_library/libraries/human/HCD/2020_05_19/human_hcd_tryp_best.msp.tar.gz\"\nlibrary_file = \"human_hcd_tryp_best.msp\"\n\n# Download file\n_ = urllib.request.urlretrieve(url, f\"{library_file}.tar.gz\")\n\n# # Extract\nwith tarfile.open(f\"{library_file}.tar.gz\") as f:\n    f.extractall(\".\")\n\nLet’s explore the MSP spectral library file by printing the first 10 lines of the file:\n\nwith open(library_file, \"rt\") as f:\n    for i, line in enumerate(f):\n        print(line.strip())\n        if i &gt; 10:\n            break\n\nName: AAAAAAAAAAAAAAAGAGAGAK/2_0\nComment: Consensus Pep=Tryptic Peptype=&lt;Protein&gt;&lt;Peptide&gt;&lt;Protein&gt; Mods=0 Fullname=R.AAAAAAAAAAAAAAAGAGAGAK.Q Charge=2 Parent=798.9263 CE=42.09 NCE=29.43 Q-value=0.0000 Nprot=1 Protein=\"sp|P55011|S12A2_HUMAN(pre=R,post=Q)\" Nrep=134/200 Theo_mz_diff=1.2ppm Quality=7/7 MC=0 MCtype=Normal Unassigned_all_20ppm=0.1424 Unassigned_20ppm=0.0416 num_unassigned_peaks_20ppm=44 max_unassigned_ab_20ppm=0.41 top_20_num_unassigned_peaks_20ppm=1/20\nNum peaks: 117\n110.0712    259243.2    \"? 143/200\"\n115.0864    97764.4 \"a2/-1.6ppm 145/200\"\n116.0704    26069.5 \"? 80/200\"\n120.0806    208924.4    \"? 148/200\"\n129.0657    25535.9 \"Int/AG/-1.2ppm,Int/GA/-1.2ppm 86/200\"\n129.1021    361336.8    \"IKD/-1.1ppm,y1-H2O/-1.1ppm 172/200\"\n130.0860    120990.5    \"y1-NH3/-1.9ppm 123/200\"\n136.0754    401263.5    \"? 147/200\"\n141.1019    54146.8 \"? 113/200\"\n\n\nThis shows the beginning of the first spectrum in the spectral library. Each spectrum entry consists of a header with identification data and metadata, and a peak list with three columns:\n\nm/z values\nintensity values\npeak annotation info\n\nAs the sequence of the first peptide is AAAAAAAAAAAAAAAGAGAGAK, we can assume that this library is ordered alphabetically. You can read through the file to verify this assumption. When preparing datasets for ML, it is important to be aware of such properties, especially when splitting the data into train, test, and validation sets."
  },
  {
    "objectID": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#parsing-the-msp-spectral-library-file",
    "href": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#parsing-the-msp-spectral-library-file",
    "title": "NIST (part 1): Preparing a spectral library for ML",
    "section": "3 Parsing the MSP spectral library file",
    "text": "3 Parsing the MSP spectral library file\nPyteomics is a Python package for proteomics that contains readers for many proteomics-related file formats (Levitsky et al. 2018). Unfortunately, MSP is not one of the supported formats. So first, we need a custom MSP reader function.\n\nfrom rich import print, progress  # Rich is a pretty cool library. Google it ;)\nimport numpy as np\nimport pandas as pd\n\nThis function iterates over each line in the MSP file. Once it has gathered all information for a single spectrum, it uses yield to return a dictionary. This means that we can iterate over the function using a for loop, and process spectra one-by-one.\nIf you do not fully understand the function, no problem! This is not the important part of the tutorial.\n\ndef read_msp(filename):\n    \"\"\"Iterate over MSP spectral library file and return spectra as dicts.\"\"\"\n    spectrum = {}\n    mz = []\n    intensity = []\n    annotation = []\n\n    with progress.open(filename, \"rt\") as f:\n        for line in f:\n            # `Name: ` is the first line of a new entry in the file\n            if line.startswith(\"Name: \"):\n                if spectrum:\n                    # Finalize and yield previous spectrum\n                    spectrum[\"sequence\"] = spectrum[\"Fullname\"].split(\".\")[1]  # Remove the previous/next amino acids\n                    spectrum[\"mz\"] = np.array(mz, dtype=\"float32\")\n                    spectrum[\"intensity\"] = np.array(intensity, dtype=\"float32\")\n                    spectrum[\"annotation\"] = np.array(annotation, dtype=\"str\")\n                    yield spectrum\n\n                    # Define new spectrum\n                    spectrum = {}\n                    mz = []\n                    intensity = []\n                    annotation = []\n\n                # Extract everything after `Name: `\n                spectrum[\"Name\"] = line.strip()[6:]\n\n            elif line.startswith(\"Comment: \"):\n                # Parse all comment items as metadata\n                metadata = [i.split(\"=\") for i in line[9:].split(\" \")]\n                for item in metadata:\n                    if len(item) == 2:\n                        spectrum[item[0]] = item[1]\n\n            elif line.startswith(\"Num peaks: \"):\n                spectrum[\"Num peaks\"] = int(line.strip()[11:])\n\n            elif len(line.split(\"\\t\")) == 3:\n                # Parse peak list items one-by-one\n                line = line.strip().split(\"\\t\")\n                mz.append(line[0])\n                intensity.append(line[1])\n                annotation.append(line[2].strip('\"'))\n\n    # Final spectrum\n    spectrum[\"sequence\"] = spectrum[\"Fullname\"].split(\".\")[1]  # Remove the previous/next amino acids\n    spectrum[\"mz\"] = np.array(mz, dtype=\"float32\")\n    spectrum[\"intensity\"] = np.array(intensity, dtype=\"float32\")\n    spectrum[\"annotation\"] = np.array(annotation, dtype=\"str\")\n    yield spectrum\n\nLet’s explore the first spectrum:\n\n# break allows us to only stop after the first spectrum is defined\nfor spectrum in read_msp(\"human_hcd_tryp_best.msp\"):\n    print(spectrum[\"Name\"])\n    break\n\n\n\n\n\n\n\n\u001b[2KAAAAAAAAAAAAAAAGAGAGAK/2_0\nReading... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/2.0 GB -:--:--\n\n\n\n\n\nWe can format the peak list as a Pandas DataFrame:\n\npd.DataFrame({\n    \"mz\": spectrum[\"mz\"],\n    \"intensity\": spectrum[\"intensity\"],\n    \"annotation\": spectrum[\"annotation\"],\n})\n\n\n\n\n\n\n\n\nmz\nintensity\nannotation\n\n\n\n\n0\n110.071198\n259243.203125\n? 143/200\n\n\n1\n115.086403\n97764.398438\na2/-1.6ppm 145/200\n\n\n2\n116.070396\n26069.500000\n? 80/200\n\n\n3\n120.080597\n208924.406250\n? 148/200\n\n\n4\n129.065704\n25535.900391\nInt/AG/-1.2ppm,Int/GA/-1.2ppm 86/200\n\n\n...\n...\n...\n...\n\n\n112\n1170.621338\n442693.312500\ny16/-1.1ppm 180/200\n\n\n113\n1171.624146\n173247.703125\ny16+i/-1.0ppm 133/200\n\n\n114\n1241.657959\n264065.593750\ny17/-1.3ppm 170/200\n\n\n115\n1242.660156\n112235.101562\ny17+i/-1.8ppm 125/200\n\n\n116\n1312.693848\n74808.500000\ny18/-2.3ppm 116/200\n\n\n\n\n117 rows × 3 columns\n\n\n\nThe right-most column denotes the peak annotation. This tells us which ion generated the peak, according to the search engine or library generation software. Note that many peaks (highlighted with a question mark) are not annotated, even though the spectrum was confidently identified.\nUsing the Python package spectrum_utils (Bittremieux 2019) , we can easily visualize the spectrum:\n\nimport matplotlib.pyplot as plt\n\nimport spectrum_utils.spectrum as sus\nimport spectrum_utils.plot as sup\n\n\nplt.figure(figsize=(10,5))\nsup.spectrum(\n    sus.MsmsSpectrum(\n        identifier=spectrum[\"Name\"],\n        precursor_mz=float(spectrum[\"Parent\"]),\n        precursor_charge=int(spectrum[\"Charge\"]),\n        mz=spectrum[\"mz\"],\n        intensity=spectrum[\"intensity\"]\n    )\n)\nplt.title(spectrum[\"Name\"])\nplt.show()"
  },
  {
    "objectID": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#preparing-spectra-for-training",
    "href": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#preparing-spectra-for-training",
    "title": "NIST (part 1): Preparing a spectral library for ML",
    "section": "4 Preparing spectra for training",
    "text": "4 Preparing spectra for training\nTo use a peptide fragmentation spectrum such as this one as training target for a machine learning model, it needs some preparation and parsing. Usually this comprises of the following steps:\n\nNormalize the intensities\nTransform the intensities\nAnnotate the peaks\nParse the relevant peak intensities to an format suitable for machine learning\n\nFor each of these steps, we will write a function that can be reused later on in the tutorial.\n\n4.1 Normalize the intensities\nDepending on the file format, peak intensities can range from 0 to 1, from 0 to 100, from 0 from 10 000… Machine learning algorithms require the target (and feature) values to be normalized in a specific range. For fragmentation spectra, there are two common options: total ion current (TIC) normalization and base peak normalization. For the former, all intensity values are divided by the total sum of all intensity values in the spectrum. The sum of all normalized intensities will be 1. For the latter, all intensity values are divided by the most intense peak in the spectrum, resulting in that peak to have normalized intensity 1. Here we will implement TIC-normalization.\n\ndef tic_normalize(msp_spectrum):\n    tic = np.sum(msp_spectrum[\"intensity\"])\n    msp_spectrum[\"intensity\"] = msp_spectrum[\"intensity\"] / tic\n\n\n# Before normalization\nspectrum[\"intensity\"][:10]\n\narray([259243.2,  97764.4,  26069.5, 208924.4,  25535.9, 361336.8,\n       120990.5, 401263.5,  54146.8, 259764.2], dtype=float32)\n\n\n\ntic_normalize(spectrum)\n\n# After normalization\nspectrum[\"intensity\"][:10]\n\narray([0.00882945, 0.00332971, 0.00088789, 0.00711566, 0.00086972,\n       0.0123066 , 0.00412076, 0.01366645, 0.00184416, 0.00884719],\n      dtype=float32)\n\n\n\n\n4.2 Transform the intensities\nThe distribution of peak intensities shows us that most peptide fragmentation peaks have a relatively low intensity, while only a few peaks are more intense:\n\nimport seaborn as sns\nsns.set_style(\"whitegrid\")\n\n# Before transform\nsns.displot(spectrum[\"intensity\"], bins=20)\nplt.show()\n\n\n\n\nTo make the intensities follow a more linear distribution — which is better for machine learning algorithms — we can transform the intensity values. Two methods are often used: square root-tranform, and log-transform. While both methods mostly have the same effect, we will here opt for square root transform, as log-transform results in negative values, which can be cumbersome to deal with.\n\ndef sqrt_transform(msp_spectrum):\n    msp_spectrum[\"intensity\"] = np.sqrt(msp_spectrum[\"intensity\"])\n\n\nsqrt_transform(spectrum)\n\n# After transform\nsns.displot(spectrum[\"intensity\"], bins=20)\nplt.show()\n\n\n\n\n\n\n4.3 Annotate the peaks\nWith the NIST spectral libraries, this step is pretty easy, as peak annotations are already present. If this would not be the case, we can make use of spectrum_utils, which can annotate peaks given the peptide sequence and any modifications. See the spectrum_utils documentation for more info.\nHere, we use spectrum_utils to annotate the peaks:\n\nplt.figure(figsize=(12,6))\nsup.spectrum(\n    sus.MsmsSpectrum(\n        identifier=spectrum[\"Name\"],\n        precursor_mz=float(spectrum[\"Parent\"]),\n        precursor_charge=int(spectrum[\"Charge\"]),\n        mz=spectrum[\"mz\"],\n        intensity=spectrum[\"intensity\"],\n        peptide=spectrum[\"sequence\"],\n    ).annotate_peptide_fragments(25, \"ppm\")\n)\nplt.title(spectrum[\"Name\"])\nplt.show()\n\n\n\n\n\n\n4.4 Parse the relevant peak intensities to an format suitable for machine learning\nNote in the visualization above that spectrum_utils only annotated b- and y-ions, while in the MSP file many other ion types are also annotated. For simplicity’s sake, in this tutorial we will train a model to only predict singly charged b- and y-ions.\nLet’s filter the spectrum for only those peaks. This can be done with regular expressions (regex) and numpy. The regex ^(b|y)([0-9]+)\\/ only matches peak annotations for singly charged b- and y-ions.\n\n\n\n\n\n\nTip\n\n\n\nregex101.com is a great website for building and testing regular expressions. You can try out the above mentioned regex at You can investigate it at regex101.com/r/bgZ7EG/1.\n\n\nIn the filter_peaks function below, numpy.vectorize is used. What do you think it does and why do we use it here?\n\nimport re\n\ndef filter_peaks(msp_spectrum):\n    \"\"\"Filter spectrum peaks to only charge 1 b- and y ions.\"\"\"\n    # Generate the boolean mask\n    get_mask = np.vectorize(lambda x: bool(re.match(\"^(b|y)([0-9]+)\\/\", x)))\n    mask = get_mask(msp_spectrum[\"annotation\"])\n\n    # Apply the mask to each peak array\n    msp_spectrum[\"annotation\"] = msp_spectrum[\"annotation\"][mask]\n    msp_spectrum[\"mz\"] = msp_spectrum[\"mz\"][mask]\n    msp_spectrum[\"intensity\"] = msp_spectrum[\"intensity\"][mask]\n\nfilter_peaks(spectrum)\n\n\nplt.figure(figsize=(12,6))\nsup.spectrum(\n    sus.MsmsSpectrum(\n        identifier=spectrum[\"Name\"],\n        precursor_mz=float(spectrum[\"Parent\"]),\n        precursor_charge=int(spectrum[\"Charge\"]),\n        mz=spectrum[\"mz\"],\n        intensity=spectrum[\"intensity\"],\n        peptide=spectrum[\"sequence\"]\n    ).annotate_peptide_fragments(25, \"ppm\")\n)\nplt.title(spectrum[\"Name\"])\nplt.show()\n\n\n\n\nNow, the spectrum indeed only contains singly charged b- and y-ions. Note the nice gausian-like distributions of equally-distanced b- and y-ions. This is a feature specific for this peptide spectrum. Can you guess why? Tip: Take a look at the peptide sequence.\nCurrently, all peaks are listed together in single numpy arrays, sorted by m/z values. For training a machine learning model, we need the intensity values in a more suitable structure. As we are planning to only predict simple singly charged b- and y-ions, we can create two arrays — one for each ion type — with the ions sorted by ion number. For example:\nb: [b1, b2, b3, b4 ... bN]\ny: [y1, y2, y3, y4 ... yN]\nwhere N is the total number of possible fragments for that peptide sequence. Quick question: What value will N have for our peptide with sequence AAAAAAAAAAAAAAAGAGAGAK?\nThe following function builds upon the filter_peaks function to not only filter the correct ion types, but also order them properly:\n\ndef parse_peaks(msp_spectrum, ion_type):\n    # Generate vectorized functions\n    get_ions = np.vectorize(lambda x: bool(re.match(f\"^({ion_type})([0-9]+)\\/\", x)))\n    get_ion_order = np.vectorize(lambda x: re.match(f\"^({ion_type})([0-9]+)\\/\", x)[2])\n\n    # Get mask with requested ion types\n    mask = get_ions(msp_spectrum[\"annotation\"])\n\n    # Create empty array with for all possible ions\n    n_ions = len(msp_spectrum[\"sequence\"]) - 1\n    parsed_intensity = np.zeros(n_ions)\n\n    # Check if any ions of this type are present\n    if mask.any():\n        # Filter for ion type and sort\n        ion_order = get_ion_order(msp_spectrum[\"annotation\"][mask]).astype(int) - 1\n        # Add ions to correct positions in new array\n        parsed_intensity[ion_order] = msp_spectrum[\"intensity\"][mask]\n\n    try:\n        msp_spectrum[\"parsed_intensity\"][ion_type] = parsed_intensity\n    except KeyError:\n        msp_spectrum[\"parsed_intensity\"] = {}\n        msp_spectrum[\"parsed_intensity\"][ion_type] = parsed_intensity\n\nparse_peaks(spectrum, \"b\")\nparse_peaks(spectrum, \"y\")\n\n\nspectrum['parsed_intensity']\n\n{'b': array([0.        , 0.0940595 , 0.18064232, 0.20420307, 0.23347196,\n        0.2457854 , 0.23112106, 0.20064339, 0.16306745, 0.1246587 ,\n        0.08999325, 0.05416884, 0.        , 0.        , 0.        ,\n        0.        , 0.        , 0.        , 0.        , 0.        ,\n        0.        ]),\n 'y': array([0.09027135, 0.03876459, 0.09092397, 0.07086667, 0.1299265 ,\n        0.09038813, 0.15890096, 0.13701038, 0.13768263, 0.14171469,\n        0.15388304, 0.16281605, 0.16425258, 0.15970773, 0.1443574 ,\n        0.12279043, 0.09483507, 0.05047642, 0.        , 0.        ,\n        0.        ])}\n\n\nGreat! These values are now ready to be used as prediction targets for a machine learning algorithm."
  },
  {
    "objectID": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#parsing-the-full-spectral-library",
    "href": "tutorials/fragmentation/nist-1-parsing-spectral-library.html#parsing-the-full-spectral-library",
    "title": "NIST (part 1): Preparing a spectral library for ML",
    "section": "5 Parsing the full spectral library",
    "text": "5 Parsing the full spectral library\nNow that all functions for spectrum preparation are written, we can parse the full spectral library. Let’s first explore some of the basic statistics of this library.\n\n5.1 Exploring basic spectral library statistics\n\nReading the full spectrum file\nLet’s read the full spectrum file to extract some statistics. To limit the amount of data we keep in memory (this full MSP file is almost 2GB!), we can process the intensity values of each spectrum while parsing and only keep the parsed data:\n\nspectrum_list = []\nfor msp_spectrum in read_msp(\"human_hcd_tryp_best.msp\"):\n    # Process intensities\n    tic_normalize(msp_spectrum)\n    sqrt_transform(msp_spectrum)\n    parse_peaks(msp_spectrum, \"b\")  # Adds `parsed_intensity` &gt; `b`\n    parse_peaks(msp_spectrum, \"y\")  # Adds `parsed_intensity` &gt; `y`\n\n    # Parse metadata\n    spectrum = {\n        \"sequence\": msp_spectrum[\"sequence\"],\n        \"modifications\": msp_spectrum[\"Mods\"],\n        \"charge\": int(msp_spectrum[\"Charge\"]),\n        \"nce\": float(msp_spectrum[\"NCE\"]),\n        \"parsed_intensity\": msp_spectrum[\"parsed_intensity\"]\n    }\n\n    # Append to list\n    spectrum_list.append(spectrum)\n\n\n\n\n\n\n\n\n\n\nGenerating a Pandas DataFrame from the list of spectrum dictionaries, allows us to easily explore the full dataset:\n\nspectrum_df = pd.DataFrame(spectrum_list)\nspectrum_df\n\n\n\n\n\n\n\n\nsequence\nmodifications\ncharge\nnce\nparsed_intensity\n\n\n\n\n0\nAAAAAAAAAAAAAAAGAGAGAK\n0\n2\n29.43\n{'b': [0.0, 0.09405950456857681, 0.18064232170...\n\n\n1\nAAAAAAAAAAAAAAAGAGAGAK\n0\n3\n29.22\n{'b': [0.0, 0.21546243131160736, 0.21998108923...\n\n\n2\nAAAAAAAAAAAPPAPPEGASPGDSAR\n0\n2\n27.80\n{'b': [0.0, 0.0, 0.056045547127723694, 0.10302...\n\n\n3\nAAAAAAAAAAAPPAPPEGASPGDSAR\n0\n3\n30.00\n{'b': [0.0, 0.04407356679439545, 0.07545641809...\n\n\n4\nAAAAAAAAAASGAAIPPLIPPR\n0\n3\n0.00\n{'b': [0.0, 0.10330961644649506, 0.15637055039...\n\n\n...\n...\n...\n...\n...\n...\n\n\n398368\nYYYYHR\n0\n2\n30.20\n{'b': [0.0, 0.14489535987377167, 0.0, 0.0, 0.0...\n\n\n398369\nYYYYHR\n0\n3\n37.52\n{'b': [0.018267542123794556, 0.076188296079635...\n\n\n398370\nYYYYMWK\n0\n2\n31.00\n{'b': [0.0, 0.22406582534313202, 0.11588517576...\n\n\n398371\nYYYYMWK\n1(4,M,Oxidation)\n2\n30.00\n{'b': [0.0, 0.14110229909420013, 0.0, 0.0, 0.0...\n\n\n398372\nYYYYWHLR\n0\n3\n36.22\n{'b': [0.0, 0.0886630266904831, 0.0, 0.0, 0.0,...\n\n\n\n\n398373 rows × 5 columns\n\n\n\nMaking a Pandas DataFrame out of spectrum_list is so simple because it is a list of consistent dictionaries.\n\n\nTotal number of specta\nLow-hanging fruit first: How many spectra are in the full library?\n\nlen(spectrum_list)\n\n398373\n\n\n\n\nPrecursor charge state\nA different precursor charge state can heavily alter peptide fragmentation. It is therefore important to have a representative amount of peptide spectra for each charge state in the spectral library.\n\nsns.countplot(data=spectrum_df, x=\"charge\")\nplt.show()\n\n\n\n\n\n\nPeptide length\nIdem for the length of the peptide sequence. It usually makes sense to filter the train dataset for peptides within a certain length range.\n\nsns.kdeplot(spectrum_df[\"sequence\"].str.len())\nplt.xlabel(\"Sequence length\")\nplt.show()\n\n\n\n\n\nspectrum_df[\"sequence\"].str.len().describe()\n\ncount    398373.000000\nmean         15.541467\nstd           6.506968\nmin           6.000000\n25%          11.000000\n50%          14.000000\n75%          19.000000\nmax          50.000000\nName: sequence, dtype: float64\n\n\n\n(spectrum_df[\"sequence\"].str.len() &gt; 35).value_counts(normalize=True)\n\nFalse    0.98759\nTrue     0.01241\nName: sequence, dtype: float64\n\n\nFor this dataset, the minimum peptide length is 6, while the maximum is 50. Nevertheless, only 1.2% have a peptide lenght higher than 35.\n\nPeptide modifications\nLikewise, peptide modifications can influence peptide fragmentation. How many of the spectra in our library come from modified peptides?\n\nmodification_state = (spectrum_df[\"modifications\"] == \"0\").map({True: \"Unmodified\", False: \"Modified\"})\nsns.countplot(x=modification_state)\nplt.show()\n\n\n\n\n\n\n\nCollision energy\nSimilarly, the fragmentation collision energy (CE) might influence the observed fragmentation patterns.\n\nsns.histplot(spectrum_df[\"nce\"], bins=30)\nplt.xlabel(\"NCE\")\nplt.show()\n\n\n\n\nNote the range of the x-axis, which was automatically chosen by the plotting library. It seems to start at 0, which indicates that some values are very low…\n\n(spectrum_df[\"nce\"] == 0.0).value_counts()\n\nFalse    398103\nTrue        270\nName: nce, dtype: int64\n\n\nIndeed, it seems that some peptide spectra have CE 0, which most likely means that the true CE setting is not known. We can either opt to not use CE as a feature for training, or to remove these spectra from the dataset. Including these values would introduce unwanted noise in the training data.\n\n\nDuplicate entries?\nAn important aspect to compiling training data for machine learning is whether or not entries are duplicated. With spectral libraries, matters are complicated by multiple levels of “uniqueness”:\n\nPeptide level: Unique sequence\nPeptidoform level: Unique sequence & modifications\nPrecursor level: Unique sequence & modifications & charge\n\nMore parameters can be included for “uniqueness”, such as instrument and acquisition properties: CE, fragmentation method (beam-type CID (“HCD”), trap-type CID, ETD, EAD…), acquisition method (Orbitrap, ion trap, TOF…). In this tutorial, we are using only HCD Orbitrap data, which makes things a bit simpler. Nevertheless, this will impact the application domain of the final models.\n\ncounts = pd.DataFrame({\n    \"Level\": [\n        \"Full library\",\n        \"Precursor\",\n        \"Peptidoform\",\n        \"Peptide\",\n    ],\n    \"Count\": [\n        spectrum_df.shape[0],\n        spectrum_df[[\"sequence\", \"modifications\", \"charge\"]].drop_duplicates().shape[0],\n        spectrum_df[[\"sequence\", \"modifications\"]].drop_duplicates().shape[0],\n        spectrum_df[\"sequence\"].unique().shape[0],\n    ],\n})\n\n\nsns.barplot(data=counts, x=\"Level\", y=\"Count\")\nplt.show()\n\n\n\n\n\ncounts\n\n\n\n\n\n\n\n\nLevel\nCount\n\n\n\n\n0\nFull library\n398373\n\n\n1\nPrecursor\n398373\n\n\n2\nPeptidoform\n292061\n\n\n3\nPeptide\n257202\n\n\n\n\n\n\n\nSeems like this library was already filtered for uniqueness on the precursor level.\n\n\n\n5.2 Selecting data\nFor selecting training data, we will apply some additional filters:\n\nWhile plain amino acid sequences are straightforward to encode, peptide modifications complicate matters. For simplicity’s sake, we will therefore not open the “can of modifications” in this tutorial.\nAs we might want to use CE as a feature, we can remove the small amount of entries that are missing the a CE value\nTo make the training task a bit less complex, we can limit peptide length to 35. Although the maximum peptide length in this library is 50, only 4944 spectra have a peptide length of over 35.\n\n\nspectrum_df = spectrum_df[\n    (modification_state == \"Unmodified\") &\n    (spectrum_df[\"sequence\"].str.len() &lt;= 35) &\n    (spectrum_df[\"nce\"] != 0)\n]\n\nLet’s see how many spectra we retained:\n\nspectrum_df.shape[0]\n\n270440\n\n\n\n\n5.3 Train / validation / test split\nNow that we have our data, we can filter it to a final set for training and validation and a final set for testing. A small reminder of what these terms mean:\n\nTraining data: For training the model\nValidation data: For validating the model while optimizing hyperparameters\nTesting data: For final testing of model that was trained with the best hyperparameters (according to the validation data), right before deployment\n\nThe testing data cannot be used until a final model is trained, and serves as a last test before deployment. It should not be used before a final model is selected.\n\nfrom sklearn.model_selection import train_test_split\n\nnp.random.seed(42)\n\ntrain_val_peptides, test_peptides = train_test_split(spectrum_df[\"sequence\"].unique(), train_size=0.9)\ntrain_val_spectra = spectrum_df[spectrum_df[\"sequence\"].isin(train_val_peptides)]\ntest_spectra = spectrum_df[spectrum_df[\"sequence\"].isin(test_peptides)]\n\nWhy do we not apply train_test_split() directly on spectrum_df, but instead on spectrum_df[\"sequence\"].unique()?\n\n\n5.4 Saving the parsed library for the next tutorial parts\nWe will be saving the parsed spectral library to Arrow Feather files, a fast and efficient binary format that can easily be read and written from Pandas.\n\ntrain_val_spectra.reset_index().to_feather(\"fragmentation-nist-humanhcd20160503-parsed-trainval.feather\")\ntest_spectra.reset_index().to_feather(\"fragmentation-nist-humanhcd20160503-parsed-test.feather\")\n\nContinue with part 2 of this tutorial: 👉Traditional ML: Gradient boosting"
  },
  {
    "objectID": "tutorials/fragmentation/raw-to-prosit.html",
    "href": "tutorials/fragmentation/raw-to-prosit.html",
    "title": "Raw file processing with PROSIT style annotation",
    "section": "",
    "text": "This notebook contains the simplest steps to turn any raw data into a format thats fragmentation prediction ready. This notebook retrieve a ProteomeTools file from PRIDE to make it as easy to copy as possible, but retrieving the files might take time.\nThis method uses the MaxQuant file to get the modified sequence, charge, and scan number. It then uses fisher_py to interact with the raw files and retrieve the ms2 scans and the mass analyzer.\nThe annotation pipeline comes from the TUM annotation github\n\n%%capture\n# In order to interact with fisher raw files, we need to interact with the python .NET implementation.\n# This requires CONDA on all UNIX systems, and for this reason we need to install conda in the colab.\n# If this is not run on colab do not run this code block, but install conda in the given environment.\n!pip install -q condacolab\nimport condacolab\ncondacolab.install()\n\n\n%%capture\n!conda install pythonnet==2.5.2\n!pip install fisher_py==1.0.10\n!pip install fundamentals@git+https://github.com/wilhelm-lab/spectrum_fundamentals@proteomicsml\n\n\n!wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw\n!wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip\n\n--2022-11-01 10:51:16--  https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw\nResolving ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)... 193.62.193.138\nConnecting to ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)|193.62.193.138|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 687962554 (656M)\nSaving to: ‘01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw’\n\n01625b_GA1-TUM_firs 100%[===================&gt;] 656.09M   676KB/s    in 16m 45s \n\n2022-11-01 11:08:02 (669 KB/s) - ‘01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw’ saved [687962554/687962554]\n\n--2022-11-01 11:08:02--  https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip\nResolving ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)... 193.62.193.138\nConnecting to ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)|193.62.193.138|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 15581179 (15M) [application/zip]\nSaving to: ‘TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip’\n\nTUM_first_pool_1_01 100%[===================&gt;]  14.86M   685KB/s    in 23s     \n\n2022-11-01 11:08:25 (671 KB/s) - ‘TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip’ saved [15581179/15581179]\n\n\n\n\nfrom zipfile import ZipFile\nimport pandas as pd\nwith ZipFile(f'TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip', 'r') as zip_file:\n  msms = pd.read_csv(zip_file.open('msms.txt'), sep='\\t')\n# Current PROSIT pipeline does not accomodate modified peptides, so we remove all of the oxidized peptides\nmsms = msms[msms['Modifications'] == 'Unmodified']\n\n\nfrom fisher_py import RawFile\nraw = RawFile('01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw')\n# Get the scan numbers from the msms file and save the scan + info in a dictionary\nfrom fisher_py.data.business import Scan\nimport numpy as np\nscan_mzs = []\nscan_ints = []\nscan_mass_analyzers = []\nscan_collison_energy = []\nfor scan in msms['Scan number']:\n  raw_scan = Scan.from_file(raw._raw_file_access, scan)\n  scan_mzs.append(np.array(raw_scan.preferred_masses))\n  scan_ints.append(np.array(raw_scan.preferred_intensities))\n  scan_mass_analyzers.append(raw_scan.scan_type.split(' + ')[0])\n  frag_infos = [f.split(' ')[0] for f in raw_scan.scan_type.split('@')[1:]]\n  splits = [[i for i, g in enumerate(f) if g.isnumeric()][0] for f in frag_infos]\n  NCEs = [float(frag[split:]) for split, frag in zip(splits, frag_infos)]\n  scan_collison_energy.append(NCEs[0])\n\nWe need to create a sub-set of the MaxQuant dataframe that we can insert into the annotation pipeline. For this we need the have 6 columns (with specific names): MODIFIED_SEQUENCE, PERCURSOR_CHARGE, MASS_ANALYZER, SCAN_NUMBER, MZ, INTENSITIES\n\nannotation_df = pd.DataFrame(msms[['Modified sequence', 'Charge', 'Scan number']].values, columns=['MODIFIED_SEQUENCE', 'PRECURSOR_CHARGE', 'SCAN_NUMBER'])\nannotation_df['MZ'] = scan_mzs\nannotation_df['INTENSITIES'] = scan_ints\nannotation_df['MASS_ANALYZER'] = scan_mass_analyzers\nannotation_df['COLLISION_ENERGY'] = scan_collison_energy\n\nfrom fundamentals.mod_string import maxquant_to_internal\nannotation_df['MODIFIED_SEQUENCE'] = maxquant_to_internal(annotation_df['MODIFIED_SEQUENCE'].values)\n\nfrom fundamentals.annotation.annotation import annotate_spectra\nannotation = annotate_spectra(annotation_df)\n\n2022-11-01 11:49:04,339 - INFO - fundamentals.annotation.annotation::annotate_spectra Removed count    11970.00000\nmean         0.00802\nstd          0.09287\nmin          0.00000\n25%          0.00000\n50%          0.00000\n75%          0.00000\nmax          2.00000\nName: removed_peaks, dtype: float64 redundant peaks\n\n\nThe annotation element contains the annotated intensities nad m/zs, along with the theoretical mass and removed peaks\n\nannotation\n\n\n  \n    \n      \n\n\n\n\n\n\nINTENSITIES\nMZ\nCALCULATED_MASS\nremoved_peaks\n\n\n\n\n0\n[0.36918813165578857, 0.0, -1.0, 0.0, 0.0, -1....\n[175.11929321289062, 0.0, -1.0, 0.0, 0.0, -1.0...\n796.423175\n0\n\n\n1\n[0.028514689782729, 0.0, -1.0, 0.0, 0.0, -1.0,...\n[175.25360107421875, 0.0, -1.0, 0.0, 0.0, -1.0...\n796.423175\n0\n\n\n2\n[0.3452339640378655, 0.0, -1.0, 0.0, 0.0, -1.0...\n[175.11927795410156, 0.0, -1.0, 0.0, 0.0, -1.0...\n796.423175\n0\n\n\n3\n[0.030064791591335877, 0.0, -1.0, 0.0, 0.0, -1...\n[175.16168212890625, 0.0, -1.0, 0.0, 0.0, -1.0...\n796.423175\n0\n\n\n4\n[0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 0.07584115481...\n[0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 262.248901367...\n1370.559481\n0\n\n\n...\n...\n...\n...\n...\n\n\n11965\n[0.009784486409648692, 0.0, -1.0, 0.0, 0.0, -1...\n[147.1424102783203, 0.0, -1.0, 0.0, 0.0, -1.0,...\n914.474935\n0\n\n\n11966\n[0.23857646569260368, 0.0, -1.0, 0.0, 0.0, -1....\n[147.11309814453125, 0.0, -1.0, 0.0, 0.0, -1.0...\n914.474935\n0\n\n\n11967\n[0.012048242613237779, 0.0, -1.0, 0.0, 0.0, -1...\n[147.1204376220703, 0.0, -1.0, 0.0, 0.0, -1.0,...\n914.474935\n0\n\n\n11968\n[0.39071905153057307, 0.0, -1.0, 0.0, 0.0, -1....\n[147.11328125, 0.0, -1.0, 0.0, 0.0, -1.0, 276....\n914.474935\n0\n\n\n11969\n[0.02029996314040732, 0.0, -1.0, 0.0, 0.0, -1....\n[147.19485473632812, 0.0, -1.0, 0.0, 0.0, -1.0...\n914.474935\n0\n\n\n\n\n\n11970 rows × 4 columns\n\n      \n        \n  \n    \n    \n  \n      \n      \n  \n\n      \n    \n  \n  \n\n\nNow we need to combined the necessary information from MaxQuant and the annotation package into a DataFrame mimicing the one found in the “Prosit-style GRU with ProteomeTools data” found here (https://www.proteomicsml.org/tutorials/fragmentation/proteometools-prosit.html) for an easy handover\n\nPROSIT_ALHABET = {\n    \"A\": 1,\n    \"C\": 2,\n    \"D\": 3,\n    \"E\": 4,\n    \"F\": 5,\n    \"G\": 6,\n    \"H\": 7,\n    \"I\": 8,\n    \"K\": 9,\n    \"L\": 10,\n    \"M\": 11,\n    \"N\": 12,\n    \"P\": 13,\n    \"Q\": 14,\n    \"R\": 15,\n    \"S\": 16,\n    \"T\": 17,\n    \"V\": 18,\n    \"W\": 19,\n    \"Y\": 20,\n    \"M(ox)\": 21,\n}\nsequence_integer = [[PROSIT_ALHABET[AA] for AA in sequence] for sequence in msms['Sequence']]\nprecursor_charge_onehot = pd.get_dummies(msms['Charge']).values\ncollision_energy_aligned_normed = annotation_df['COLLISION_ENERGY']\nintensities_raw = annotation['INTENSITIES']\n\n\ndf = pd.DataFrame(list(zip(sequence_integer, precursor_charge_onehot, collision_energy_aligned_normed, intensities_raw)),\n                  columns=['sequence_integer', 'precursor_charge_onehot', 'collision_energy', 'intensities_raw'])\ndf\n\n\n  \n    \n      \n\n\n\n\n\n\nsequence_integer\nprecursor_charge_onehot\ncollision_energy\nintensities_raw\n\n\n\n\n0\n[1, 1, 1, 5, 20, 18, 15]\n[0, 1, 0]\n28.0\n[0.36918813165578857, 0.0, -1.0, 0.0, 0.0, -1....\n\n\n1\n[1, 1, 1, 5, 20, 18, 15]\n[0, 1, 0]\n35.0\n[0.028514689782729, 0.0, -1.0, 0.0, 0.0, -1.0,...\n\n\n2\n[1, 1, 1, 5, 20, 18, 15]\n[0, 1, 0]\n28.0\n[0.3452339640378655, 0.0, -1.0, 0.0, 0.0, -1.0...\n\n\n3\n[1, 1, 1, 5, 20, 18, 15]\n[0, 1, 0]\n35.0\n[0.030064791591335877, 0.0, -1.0, 0.0, 0.0, -1...\n\n\n4\n[1, 1, 5, 17, 4, 2, 2, 14, 1, 1, 3, 9]\n[0, 1, 0]\n35.0\n[0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 0.07584115481...\n\n\n...\n...\n...\n...\n...\n\n\n11965\n[20, 20, 16, 8, 10, 4, 9]\n[0, 1, 0]\n35.0\n[0.009784486409648692, 0.0, -1.0, 0.0, 0.0, -1...\n\n\n11966\n[20, 20, 16, 8, 10, 4, 9]\n[0, 1, 0]\n28.0\n[0.23857646569260368, 0.0, -1.0, 0.0, 0.0, -1....\n\n\n11967\n[20, 20, 16, 8, 10, 4, 9]\n[0, 1, 0]\n35.0\n[0.012048242613237779, 0.0, -1.0, 0.0, 0.0, -1...\n\n\n11968\n[20, 20, 16, 8, 10, 4, 9]\n[0, 1, 0]\n28.0\n[0.39071905153057307, 0.0, -1.0, 0.0, 0.0, -1....\n\n\n11969\n[20, 20, 16, 8, 10, 4, 9]\n[0, 1, 0]\n35.0\n[0.02029996314040732, 0.0, -1.0, 0.0, 0.0, -1....\n\n\n\n\n\n11970 rows × 4 columns"
  },
  {
    "objectID": "tutorials/fragmentation/preannotated-prosit.html#download-and-prepare-training-data",
    "href": "tutorials/fragmentation/preannotated-prosit.html#download-and-prepare-training-data",
    "title": "Prosit-style GRU with pre-annotated ProteomeTools data",
    "section": "Download and prepare training data",
    "text": "Download and prepare training data\n\n# Load ProteomeTools data from Figshare\n!wget https://figshare.com/ndownloader/files/12506534\n!mv 12506534 prosit_2018_holdout.hdf5\n\n--2022-09-27 22:45:17--  https://figshare.com/ndownloader/files/12506534\nResolving figshare.com (figshare.com)... 54.170.191.210, 54.194.117.36, 2a05:d018:1f4:d003:88c8:2c66:47e7:4f07, ...\nConnecting to figshare.com (figshare.com)|54.170.191.210|:443... connected.\nHTTP request sent, awaiting response... 302 Found\nLocation: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/12506534/holdout_hcd.hdf5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20220927/eu-west-1/s3/aws4_request&X-Amz-Date=20220927T224517Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=3cae745acb61cdbeed831bae33ea71451bf3becf7404b7423a9fb848f55b3e3c [following]\n--2022-09-27 22:45:18--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/12506534/holdout_hcd.hdf5?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20220927/eu-west-1/s3/aws4_request&X-Amz-Date=20220927T224517Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=3cae745acb61cdbeed831bae33ea71451bf3becf7404b7423a9fb848f55b3e3c\nResolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.120.88\nConnecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.120.88|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 262781348 (251M) [application/octet-stream]\nSaving to: ‘12506534’\n\n12506534            100%[===================&gt;] 250.61M  13.7MB/s    in 20s     \n\n2022-09-27 22:45:39 (12.4 MB/s) - ‘12506534’ saved [262781348/262781348]\n\n\n\n\n# Import packages\nimport pandas as pd\nimport numpy as np\nimport h5py as h5\n\n# Using the alphabet as defined in Prosit:\n# https://github.com/kusterlab/prosit/blob/master/prosit/constants.py#L21-L43\nPROSIT_ALHABET = {\n    \"A\": 1,\n    \"C\": 2,\n    \"D\": 3,\n    \"E\": 4,\n    \"F\": 5,\n    \"G\": 6,\n    \"H\": 7,\n    \"I\": 8,\n    \"K\": 9,\n    \"L\": 10,\n    \"M\": 11,\n    \"N\": 12,\n    \"P\": 13,\n    \"Q\": 14,\n    \"R\": 15,\n    \"S\": 16,\n    \"T\": 17,\n    \"V\": 18,\n    \"W\": 19,\n    \"Y\": 20,\n    \"M(ox)\": 21,\n}\nPROSIT_INDEXED_ALPHABET = {i: c for c, i in PROSIT_ALHABET.items()}\n\n\n# Read the downloaded data to a dataframe\nwith h5.File('prosit_2018_holdout.hdf5', 'r') as f:\n  KEY_ARRAY = [\"sequence_integer\", \"precursor_charge_onehot\", \"intensities_raw\"]\n  KEY_SCALAR = [\"collision_energy_aligned_normed\", \"collision_energy\"]\n  df = pd.DataFrame({key: list(f[key][...]) for key in KEY_ARRAY})\n  for key in KEY_SCALAR:\n    df[key] = f[key][...]\n\n# Add convenience columns\ndf['precursor_charge'] = df.precursor_charge_onehot.map(lambda a: a.argmax() + 1)\ndf['sequence_maxquant'] = df.sequence_integer.map(lambda s: \"\".join(PROSIT_INDEXED_ALPHABET[i] for i in s if i != 0))\ndf['sequence_length'] = df.sequence_integer.map(lambda s: np.count_nonzero(s))"
  },
  {
    "objectID": "tutorials/fragmentation/preannotated-prosit.html#inspecting-data-distributions",
    "href": "tutorials/fragmentation/preannotated-prosit.html#inspecting-data-distributions",
    "title": "Prosit-style GRU with pre-annotated ProteomeTools data",
    "section": "Inspecting data distributions",
    "text": "Inspecting data distributions\n\ndf['precursor_charge'].hist()\n\n&lt;matplotlib.axes._subplots.AxesSubplot&gt;\n\n\n\n\n\n\ndf['collision_energy'].hist(bins=10)\n\n&lt;matplotlib.axes._subplots.AxesSubplot&gt;\n\n\n\n\n\n\ndf['sequence_length'].hist(bins=30-7)\n\n&lt;matplotlib.axes._subplots.AxesSubplot&gt;"
  },
  {
    "objectID": "tutorials/fragmentation/preannotated-prosit.html#dataset-preparation",
    "href": "tutorials/fragmentation/preannotated-prosit.html#dataset-preparation",
    "title": "Prosit-style GRU with pre-annotated ProteomeTools data",
    "section": "Dataset preparation",
    "text": "Dataset preparation\n\n# Split the data into training, validation, and test\n\nfrom random import shuffle\n\ndef split_dataframe(df,\n                    unique_column,\n                    ratio_training=0.8,\n                    ratio_validation=0.1,\n                    ratio_test=0.1):\n  \"\"\"\n  This function splits the dataframe in three splits and makes sure that values\n  of `unique_column` are unique to each of the splits. This is helpful if, for\n  example, you have non-unique sequence in `unique_column` but want to ensure\n  that a sequence value is unique to one of the splits.\n  \"\"\"\n\n  assert ratio_training + ratio_validation + ratio_test == 1\n\n  unique = list(set(df[unique_column]))\n  n_unique = len(unique)\n  shuffle(unique)\n\n  train_split = int(n_unique * ratio_training)\n  val_split = int(n_unique * (ratio_training + ratio_validation))\n\n  unique_train = unique[:train_split]\n  unique_validation = unique[train_split:val_split]\n  unique_test = unique[val_split:]\n\n  assert len(unique_train) + len(unique_validation) + len(unique_test) == n_unique\n\n  df_train = df[df[unique_column].isin(unique_train)]\n  df_validation = df[df[unique_column].isin(unique_validation)]\n  df_test = df[df[unique_column].isin(unique_test)]\n\n  assert len(df_train) + len(df_validation) + len(df_test) == len(df)\n\n  return df_train, df_validation, df_test\n\ndf_train, df_validation, df_test = split_dataframe(df, unique_column='sequence_maxquant')\n\n\n# Prepare the training data\nINPUT_COLUMNS = ('sequence_integer', 'precursor_charge_onehot', 'collision_energy_aligned_normed')\nOUTPUT_COLUMN = 'intensities_raw'\n\nx_train = [np.vstack(df_train[column]) for column in INPUT_COLUMNS]\ny_train = np.vstack(df_train[OUTPUT_COLUMN])\n\nx_validation = [np.vstack(df_validation[column]) for column in INPUT_COLUMNS]\ny_validation = np.vstack(df_validation[OUTPUT_COLUMN])\n\nx_test = [np.vstack(df_test[column]) for column in INPUT_COLUMNS]\ny_test = np.vstack(df_test[OUTPUT_COLUMN])"
  },
  {
    "objectID": "tutorials/fragmentation/preannotated-prosit.html#model-setup-and-training",
    "href": "tutorials/fragmentation/preannotated-prosit.html#model-setup-and-training",
    "title": "Prosit-style GRU with pre-annotated ProteomeTools data",
    "section": "Model setup and training",
    "text": "Model setup and training\n\n# Setup model and trainig parameters\nDIM_LATENT = 124\nDIM_EMBEDDING_IN = max(PROSIT_ALHABET.values()) + 1  # max value + zero for padding\nDIM_EMBEDDING_OUT = 32\nEPOCHS = 5\nBATCH_SIZE = 256\n\n\nimport tensorflow as tf\nimport numpy as np\nfrom tensorflow.keras.layers import Input, Dense, GRU, Embedding, Multiply\nfrom tensorflow.keras.models import Model\nfrom tensorflow.keras import backend as k\n\n# Build the model with input layers for sequence, precursor charge, and collision energy\nin_sequence = Input(shape=[x_train[0].shape[1]], name=\"in_sequence\")\nin_precursor_charge = Input(shape=[x_train[1].shape[1]], name=\"in_precursor_charge\")\nin_collision_energy = Input(shape=[x_train[2].shape[1]], name=\"in_collision_energy\")\n\nx_s = Embedding(input_dim=DIM_EMBEDDING_IN, output_dim=DIM_EMBEDDING_OUT)(in_sequence)\nx_s = GRU(DIM_LATENT)(x_s)\nx_z = Dense(DIM_LATENT)(in_precursor_charge)\nx_e = Dense(DIM_LATENT)(in_collision_energy)\nx = Multiply()([x_s, x_z, x_e])\nout_intensities = Dense(y_train.shape[1])(x)\n\nmodel = Model([in_sequence, in_precursor_charge, in_collision_energy], out_intensities)\nmodel.summary()\n\nModel: \"model\"\n__________________________________________________________________________________________________\n Layer (type)                   Output Shape         Param #     Connected to                     \n==================================================================================================\n in_sequence (InputLayer)       [(None, 30)]         0           []                               \n                                                                                                  \n embedding (Embedding)          (None, 30, 32)       704         ['in_sequence[0][0]']            \n                                                                                                  \n in_precursor_charge (InputLaye  [(None, 6)]         0           []                               \n r)                                                                                               \n                                                                                                  \n in_collision_energy (InputLaye  [(None, 1)]         0           []                               \n r)                                                                                               \n                                                                                                  \n gru (GRU)                      (None, 124)          58776       ['embedding[0][0]']              \n                                                                                                  \n dense (Dense)                  (None, 124)          868         ['in_precursor_charge[0][0]']    \n                                                                                                  \n dense_1 (Dense)                (None, 124)          248         ['in_collision_energy[0][0]']    \n                                                                                                  \n multiply (Multiply)            (None, 124)          0           ['gru[0][0]',                    \n                                                                  'dense[0][0]',                  \n                                                                  'dense_1[0][0]']                \n                                                                                                  \n dense_2 (Dense)                (None, 174)          21750       ['multiply[0][0]']               \n                                                                                                  \n==================================================================================================\nTotal params: 82,346\nTrainable params: 82,346\nNon-trainable params: 0\n__________________________________________________________________________________________________\n\n\n\ndef masked_spectral_distance(true, pred):\n    # This is the spectral angle implementation as used in Prosit\n    # See https://github.com/kusterlab/prosit/blob/master/prosit/losses.py#L4-L16\n    # Note, fragment ions that cannot exists (i.e. y20 for a 7mer) must have the value  -1.\n    import keras.backend as k\n\n    epsilon = k.epsilon()\n    pred_masked = ((true + 1) * pred) / (true + 1 + epsilon)\n    true_masked = ((true + 1) * true) / (true + 1 + epsilon)\n    pred_norm = k.l2_normalize(true_masked, axis=-1)\n    true_norm = k.l2_normalize(pred_masked, axis=-1)\n    product = k.sum(pred_norm * true_norm, axis=1)\n    arccos = tf.acos(product)\n    return 2 * arccos / np.pi\n\nmodel.compile(optimizer='Adam', loss=masked_spectral_distance)\nhistory = model.fit(x=x_train, y=y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_data=(x_validation, y_validation))\n\nEpoch 1/5\n2362/2362 [==============================] - 30s 10ms/step - loss: 0.5062 - val_loss: 0.4790\nEpoch 2/5\n2362/2362 [==============================] - 17s 7ms/step - loss: 0.4308 - val_loss: 0.4086\nEpoch 3/5\n2362/2362 [==============================] - 17s 7ms/step - loss: 0.3789 - val_loss: 0.3665\nEpoch 4/5\n2362/2362 [==============================] - 17s 7ms/step - loss: 0.3429 - val_loss: 0.3404\nEpoch 5/5\n2362/2362 [==============================] - 17s 7ms/step - loss: 0.3123 - val_loss: 0.3086"
  },
  {
    "objectID": "tutorials/fragmentation/preannotated-prosit.html#model-evaluation",
    "href": "tutorials/fragmentation/preannotated-prosit.html#model-evaluation",
    "title": "Prosit-style GRU with pre-annotated ProteomeTools data",
    "section": "Model evaluation",
    "text": "Model evaluation\n\n# Plotting the training history\n\nimport matplotlib.pyplot as plt\n\nplt.plot(range(EPOCHS), history.history['loss'], '-', color='r', label='Training loss')\nplt.plot(range(EPOCHS), history.history['val_loss'], '--', color='r', label='Validation loss')\nplt.title(f'Training and validation loss across epochs')\nplt.xlabel('Epochs')\nplt.ylabel('Loss')\nplt.legend()\nplt.show()\n\n\n\n\n\ntest_spectral_angle = model.evaluate(x_test, y_test)\ntest_spectral_angle\n\n2341/2341 [==============================] - 7s 3ms/step - loss: 0.3031\n\n\n0.3031081557273865"
  },
  {
    "objectID": "tutorials/index.html",
    "href": "tutorials/index.html",
    "title": "Tutorials",
    "section": "",
    "text": "On ProteomicsML you will find detailed tutorials outlining how to work the latest state-of-the-art machine learning models, and even how to turn your own raw data into a suitable format. Explore all tutorials on ProteomicsML and click the “Open in Colab” badge to interact with the notebooks in a userfriendly coding environment."
  },
  {
    "objectID": "tutorials/index.html#detectability",
    "href": "tutorials/index.html#detectability",
    "title": "Tutorials",
    "section": "Detectability",
    "text": "Detectability\n\n\n\n\n\n\nTitle\n\n\nAuthor\n\n\nDate\n\n\n\n\n\n\nModelling protein detectability with an MLP\n\n\nEric Deutsch\n\n\nSep 21, 2022\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "tutorials/index.html#fragmentation",
    "href": "tutorials/index.html#fragmentation",
    "title": "Tutorials",
    "section": "Fragmentation",
    "text": "Fragmentation\n\n\n\n\n\n\nTitle\n\n\nAuthor\n\n\nDate\n\n\n\n\n\n\nNIST (part 1): Preparing a spectral library for ML\n\n\nRalf Gabriels\n\n\nOct 5, 2022\n\n\n\n\nNIST (part 2): Traditional ML: Gradient boosting\n\n\nRalf Gabriels\n\n\nOct 5, 2022\n\n\n\n\nProsit-style GRU with pre-annotated ProteomeTools data\n\n\nSiegfried Gessulat\n\n\nSep 28, 2022\n\n\n\n\nRaw file processing with PROSIT style annotation\n\n\nTobias Greisager Rehfeldt\n\n\nSep 21, 2022\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "tutorials/index.html#ion-mobility",
    "href": "tutorials/index.html#ion-mobility",
    "title": "Tutorials",
    "section": "Ion mobility",
    "text": "Ion mobility\n\n\n\n\n\n\nTitle\n\n\nAuthor\n\n\nDate\n\n\n\n\n\n\nPredicting CCS values for TIMS data\n\n\nRobbin Bouwmeester\n\n\nSep 21, 2022\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "tutorials/index.html#retention-time",
    "href": "tutorials/index.html#retention-time",
    "title": "Tutorials",
    "section": "Retention time",
    "text": "Retention time\n\n\n\n\n\n\nTitle\n\n\nAuthor\n\n\nDate\n\n\n\n\n\n\nDLOmix embedding of Prosit model on ProteomeTools data\n\n\nTobias Greisager Rehfeldt\n\n\nSep 21, 2022\n\n\n\n\nManual embedding of Bi-LSTM model on ProteomeTools data\n\n\nTobias Greisager Rehfeldt\n\n\nSep 21, 2022\n\n\n\n\nPreparing a retention time data set for machine learning\n\n\nRobbin Bouwmeester\n\n\nSep 23, 2022\n\n\n\n\nTransfer learning with DeepLC\n\n\nRobbin Bouwmeester\n\n\nFeb 3, 2023\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "tutorials/ionmobility/index.html",
    "href": "tutorials/ionmobility/index.html",
    "title": "Ion mobility",
    "section": "",
    "text": "Title\n\n\nAuthor\n\n\nDate\n\n\n\n\n\n\nPredicting CCS values for TIMS data\n\n\nRobbin Bouwmeester\n\n\nSep 21, 2022\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "tutorials/ionmobility/meier-tims-ccs.html#introduction",
    "href": "tutorials/ionmobility/meier-tims-ccs.html#introduction",
    "title": "Predicting CCS values for TIMS data",
    "section": "Introduction",
    "text": "Introduction\nIon mobility is a technique to separate ionized analytes based on their size, shape, and physicochemical properties. Initially the techniques for ion mobility propelled the ions with an electric field through a cell with inert gas. The ions collide with the inert gas without fragmentation. Separation is achieved by propelling the ions faster or slower in the electric field (i.e., based on their charge) and are slowed down by the collisions with the gas (i.e., based on shape and size). Trapped ion mobility (TIMS) reverses this operation by trapping the ions in an electric field and forcing them forward by collision with the gas. From any of the different ion mobility techniques you are able to derive the collisional cross section (CCS) in Angstrom squared. In this notebook you can follow a short tutorial on how to train a Machine Learning model for the prediction of these CCS values.\n\nimport pandas as pd\nfrom matplotlib import pyplot as plt\nfrom collections import Counter\nfrom scipy.stats import pearsonr\n\nvol_dict = {\"A\" : 88.6,\n            \"B\" : 0.0,\n            \"O\" : 0.0,\n            \"X\" : 0.0,\n            \"J\" : 0.0,\n            \"R\" : 173.4,\n            \"N\" : 114.1,\n            \"D\" : 111.1,\n            \"C\" : 108.5,\n            \"Q\" : 143.8,\n            \"E\" : 138.4,\n            \"G\" : 60.1,\n            \"H\" : 153.2,\n            \"I\" : 166.7,\n            \"L\" : 166.7,\n            \"K\" : 168.6,\n            \"M\" : 162.9,\n            \"F\" : 189.9,\n            \"P\" : 112.7,\n            \"S\" : 89.0,\n            \"T\" : 116.1,\n            \"W\" : 227.8,\n            \"Y\" : 193.6,\n            \"V\" : 140}\n\naa_to_pos = dict(zip(vol_dict.keys(),range(len(vol_dict.keys()))))"
  },
  {
    "objectID": "tutorials/ionmobility/meier-tims-ccs.html#data-reading-and-preparation",
    "href": "tutorials/ionmobility/meier-tims-ccs.html#data-reading-and-preparation",
    "title": "Predicting CCS values for TIMS data",
    "section": "Data reading and preparation",
    "text": "Data reading and preparation\nRead the training data from Meier et al.\n\nccs_df = pd.read_csv(\"https://github.com/ProteomicsML/ProteomicsML/blob/main/datasets/ionmobility/Meier_IM_CCS/combined_sm.zip?raw=true\", compression=\"zip\", index_col=0)\n\nExecute the cell below to read a smaller data set from Van Puyvelde et al.. Remove all the “#” to read this smaller data set. On for example colab it is recommended to load this smaller data set. Please do note that the description is based on the larger data set. It is expected that more complex models do not benefit at the same rate from the smaller data set (e.g., the deep learning network). Hans Vissers from Waters analyzed this traveling wave IM data:\n\n#ccs_df = pd.read_csv(\n#    \"https://github.com/ProteomicsML/ProteomicsML/raw/main/datasets/ionmobility/VanPuyvelde_TWIMS_CCS/TWIMSpeptideCCS.tsv.gz\",\n#    compression=\"gzip\",\n#    low_memory=False,\n#    sep=\"\\t\"\n#)\n\nA small summarization of the data that was just read:\n\nccs_df.describe()\n\n\n\n\n\n\n\n\nCharge\nMass\nIntensity\nRetention time\nCCS\n\n\n\n\ncount\n718917.000000\n718917.000000\n7.189170e+05\n718917.000000\n718917.000000\n\n\nmean\n2.376747\n1829.771049\n6.716163e+05\n300.215311\n475.545205\n\n\nstd\n0.582843\n606.256496\n2.139819e+06\n940.711797\n109.083740\n\n\nmin\n2.000000\n696.428259\n2.790800e+02\n0.004795\n275.418854\n\n\n25%\n2.000000\n1361.766700\n5.405700e+04\n28.260000\n392.076630\n\n\n50%\n2.000000\n1729.834520\n1.655000e+05\n50.624000\n454.656281\n\n\n75%\n3.000000\n2189.009920\n5.357000e+05\n84.241000\n534.702698\n\n\nmax\n4.000000\n4599.284130\n2.481000e+08\n6897.700000\n1118.786133\n\n\n\n\n\n\n\n\nccs_df\n\n\n\n\n\n\n\n\nModified sequence\nCharge\nMass\nIntensity\nRetention time\nCCS\nPT\n\n\n\n\n0\n_(ac)AAAAAAAAAAGAAGGR_\n2\n1239.63200\n149810.0\n70.140\n409.092529\nFalse\n\n\n1\n_(ac)AAAAAAAAEQQSSNGPVKK_\n2\n1810.91734\n21349.0\n19.645\n481.229248\nTrue\n\n\n2\n_(ac)AAAAAAAGAAGSAAPAAAAGAPGSGGAPSGSQGVLIGDR_\n3\n3144.55482\n194000.0\n3947.700\n772.098083\nFalse\n\n\n3\n_(ac)AAAAAAAGDSDSWDADAFSVEDPVRK_\n2\n2634.18340\n6416400.0\n94.079\n573.213196\nFalse\n\n\n4\n_(ac)AAAAAAAGDSDSWDADAFSVEDPVRK_\n3\n2634.18340\n5400600.0\n94.841\n635.000549\nFalse\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n718912\n_YYYVCQYCPAGNNM(ox)NR_\n2\n2087.82880\n131230.0\n21.753\n461.667145\nTrue\n\n\n718913\n_YYYVCQYCPAGNWANR_\n2\n2083.86690\n84261.0\n28.752\n459.721191\nTrue\n\n\n718914\n_YYYVPADFVEYEK_\n2\n1684.76609\n382810.0\n92.273\n436.103699\nFalse\n\n\n718915\n_YYYVQNVYTPVDEHVYPDHR_\n3\n2556.17099\n30113.0\n26.381\n580.297058\nTrue\n\n\n718916\n_YYYVQNVYTPVDEHVYPDHR_\n4\n2556.17099\n33682.0\n26.381\n691.901123\nTrue\n\n\n\n\n718917 rows × 7 columns\n\n\n\nPrepare the data to not contain any \"_\" characters or modifications in between [ ]:\n\n# Strip \"_\" from sequence\nccs_df[\"sequence\"] = ccs_df[\"Modified sequence\"].str.strip(\"_\")\n\n# Strip everything between \"()\" and \"[]\" from sequence\nccs_df[\"sequence\"] = ccs_df[\"sequence\"].str.replace(r\"[\\(\\[].*?[\\)\\]]\", \"\", regex=True)\n\nCount the occurence of amino acids, those that did not get detected; repace with 0\n\n# Apply counter to each sequence, fill NA with 0.0, make matrix from counts\nX_matrix_count = pd.DataFrame(ccs_df[\"sequence\"].apply(Counter).to_dict()).fillna(0.0).T\n\n\nX_matrix_count\n\n\n\n\n\n\n\n\nA\nG\nR\nE\nQ\nS\nN\nP\nV\nK\nL\nI\nD\nW\nF\nM\nT\nC\nY\nH\n\n\n\n\n0\n12.0\n3.0\n1.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n\n\n1\n8.0\n1.0\n0.0\n1.0\n2.0\n2.0\n1.0\n1.0\n1.0\n2.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n\n\n2\n17.0\n9.0\n1.0\n0.0\n1.0\n4.0\n0.0\n3.0\n1.0\n0.0\n1.0\n1.0\n1.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n\n\n3\n9.0\n1.0\n1.0\n1.0\n0.0\n3.0\n0.0\n1.0\n2.0\n1.0\n0.0\n0.0\n5.0\n1.0\n1.0\n0.0\n0.0\n0.0\n0.0\n0.0\n\n\n4\n9.0\n1.0\n1.0\n1.0\n0.0\n3.0\n0.0\n1.0\n2.0\n1.0\n0.0\n0.0\n5.0\n1.0\n1.0\n0.0\n0.0\n0.0\n0.0\n0.0\n\n\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n...\n\n\n718912\n1.0\n1.0\n1.0\n0.0\n1.0\n0.0\n3.0\n1.0\n1.0\n0.0\n0.0\n0.0\n0.0\n0.0\n0.0\n1.0\n0.0\n2.0\n4.0\n0.0\n\n\n718913\n2.0\n1.0\n1.0\n0.0\n1.0\n0.0\n2.0\n1.0\n1.0\n0.0\n0.0\n0.0\n0.0\n1.0\n0.0\n0.0\n0.0\n2.0\n4.0\n0.0\n\n\n718914\n1.0\n0.0\n0.0\n2.0\n0.0\n0.0\n0.0\n1.0\n2.0\n1.0\n0.0\n0.0\n1.0\n0.0\n1.0\n0.0\n0.0\n0.0\n4.0\n0.0\n\n\n718915\n0.0\n0.0\n1.0\n1.0\n1.0\n0.0\n1.0\n2.0\n4.0\n0.0\n0.0\n0.0\n2.0\n0.0\n0.0\n0.0\n1.0\n0.0\n5.0\n2.0\n\n\n718916\n0.0\n0.0\n1.0\n1.0\n1.0\n0.0\n1.0\n2.0\n4.0\n0.0\n0.0\n0.0\n2.0\n0.0\n0.0\n0.0\n1.0\n0.0\n5.0\n2.0\n\n\n\n\n718917 rows × 20 columns\n\n\n\nA fairly rudimentary technique is to use the volume of each amino acid and sum these volumes:\n\ndef to_predicted_ccs(row):\n    vol_sum = sum([vol_dict[k]*v for k,v in row.to_dict().items()])\n    return vol_sum\n\nccs_df[\"predicted_CCS_vol_based\"] = X_matrix_count.apply(to_predicted_ccs,axis=1)\n\nLets see the results:\n\nif len(ccs_df.index) &lt; 1e4:\n    set_alpha = 0.2\n    set_size = 3\nelse:\n    set_alpha = 0.05\n    set_size = 1\n\nfor c in range(2,5):\n    plt.scatter(\n        ccs_df.loc[ccs_df[\"Charge\"]==c,\"CCS\"],\n        ccs_df.loc[ccs_df[\"Charge\"]==c,\"predicted_CCS_vol_based\"],\n        alpha=set_alpha,\n        s=set_size,\n        label=\"Z=\"+str(c)\n    )\n\nlegend = plt.legend()\n\nfor lh in legend.legendHandles:\n    lh.set_sizes([25])\n    lh.set_alpha(1)\n\nplt.xlabel(\"Observed CCS (Angstrom^2)\")\nplt.xlabel(\"Predicted CCS (Angstrom^2)\")\n\nplt.show()\n\n\n\n\nClear correlation, but seems we need to change the intercepts of each curve and make seperate predictions for each peptide charge state. In addition to these observations it seems that higher charge states have higher errors. This likely influenced by a large part by the relation between higher charge states and longer peptides. These longer peptides can deviate more from each other in terms of structures (and CCS). Instead of spending more time on this, lets have a look at a more ML-based approach."
  },
  {
    "objectID": "tutorials/ionmobility/meier-tims-ccs.html#training-a-linear-regression-model-for-ccs-prediction",
    "href": "tutorials/ionmobility/meier-tims-ccs.html#training-a-linear-regression-model-for-ccs-prediction",
    "title": "Predicting CCS values for TIMS data",
    "section": "Training a linear regression model for CCS prediction",
    "text": "Training a linear regression model for CCS prediction\n\nfrom sklearn.linear_model import LinearRegression\nimport numpy as np\nimport random\n\nIn this section we will fit a linear regression model. This model is only able to fit a linear function between the features (sequence) and target (CCS). This linear model can be expressed as the following equation:\n$ Y = _0 + _1 X$\nWhere \\(Y\\) is a vector (/list) of all CCS values and X a matrix (/2-dimensional list) of all the amino acids counts. The intercept and weights of each features are learned so the predicted value (\\(\\hat{Y}\\)) is close to the observed outcome (\\(Y\\)). What is considered close and how this closeness between predictions and observations are minimized is not further discussed here. However, there is a rich amount of information available on the internet (e.g., https://www.coursera.org/learn/machine-learning).\nFirst, we will split the matrix into 90% training peptides and 10% testing peptides. These testing peptides are very valuable in estimating model performance. Since the model has not seen these sequences before it cannot overfit on these particular examples.\n\n# Get all the index identifiers\nall_idx = list(X_matrix_count.index)\nrandom.seed(42)\n\n# Shuffle the index identifiers so we can randomly split them in a testing and training set\nrandom.shuffle(all_idx)\n\n# Select 90 % for training and the remaining 10 % for testing\ntrain_idx = all_idx[0:int(len(all_idx)*0.9)]\ntest_idx = all_idx[int(len(all_idx)*0.9):]\n\n# Get the train and test indices and point to new variables\nccs_df_train = ccs_df.loc[train_idx,:]\nccs_df_test = ccs_df.loc[test_idx,:]\n\n# Also for the feature matrix get the train and test indices\nX_matrix_count_train = X_matrix_count.loc[train_idx,:]\nX_matrix_count_test = X_matrix_count.loc[test_idx,:]\n\nNow lets start training the models. Although we could encode the charge as a feature here we separate all models to counter any charge to composition specific patterns.\n\n# Initialize a model object\nlinear_model_z2 = LinearRegression()\n\n# Fit the initialized model object to our training data (only charge 2)\nlinear_model_z2.fit(\n    X=X_matrix_count_train.loc[ccs_df_train[\"Charge\"]==2,:],\n    y=ccs_df_train.loc[ccs_df_train[\"Charge\"]==2,\"CCS\"]\n)\n\n# Repeat for the other two charge states\nlinear_model_z3 = LinearRegression()\nlinear_model_z3.fit(\n    X=X_matrix_count_train.loc[ccs_df_train[\"Charge\"]==3,:],\n    y=ccs_df_train.loc[ccs_df_train[\"Charge\"]==3,\"CCS\"]\n)\n\nlinear_model_z4 = LinearRegression()\nlinear_model_z4.fit(\n    X=X_matrix_count_train.loc[ccs_df_train[\"Charge\"]==4,:],\n    y=ccs_df_train.loc[ccs_df_train[\"Charge\"]==4,\"CCS\"]\n)\n\nLinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegressionLinearRegression()\n\n\nNow we can have a look at the coefficients \\(\\beta_1\\) learned. These should be highly correlated with the previous experimentally determined volumetric observations for each amino acid:\n\n# Scatter plot the coefficients of each amino acid against their experimentally determined volumes\nplt.scatter(\n    linear_model_z2.coef_,\n    [vol_dict[v] for v in X_matrix_count.columns]\n)\n\n# Plot a diagonal line we expect the points to be on\nplt.plot(\n    [6.0,26.0],\n    [60.0,260],\n    c=\"grey\",\n    zorder=0\n)\n\n# Annotate each point with their respective amino acids\nfor v,x,y in zip(X_matrix_count.columns,\n                 linear_model_z2.coef_,\n                 [vol_dict[v] for v in X_matrix_count.columns]):\n\n    plt.annotate(v,(x+0.1,y+5.0))\n\nplt.show()\n\n\n\n\nObservations are very similar. There are differences that could be cause by a multitude of reasons. For example, the difference between volumetric observations in the CCS cell is different or being part of a polypeptide chain changes the volume of the amino acid.\nNext we will plot the predictions of the test set and compare them with observational data. Note that we apply each charge model seperately.\n\nif len(ccs_df.index) &lt; 1e4:\n    set_alpha = 0.2\n    set_size = 3\nelse:\n    set_alpha = 0.05\n    set_size = 1\n\n# Scatter plot the observations on the test set against the predictions on the same set (z=2)\nplt.scatter(\n    linear_model_z2.predict(X=X_matrix_count_test.loc[ccs_df[\"Charge\"]==2,:]),\n    ccs_df_test.loc[ccs_df[\"Charge\"]==2,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=2\"\n)\n\n# Scatter plot the observations on the test set against the predictions on the same set (z=3)\nplt.scatter(\n    linear_model_z3.predict(X=X_matrix_count_test.loc[ccs_df[\"Charge\"]==3,:]),\n    ccs_df_test.loc[ccs_df[\"Charge\"]==3,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=3\"\n)\n\n# Scatter plot the observations on the test set against the predictions on the same set (z=4)\nplt.scatter(\n    linear_model_z4.predict(X=X_matrix_count_test.loc[ccs_df[\"Charge\"]==4,:]),\n    ccs_df_test.loc[ccs_df[\"Charge\"]==4,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=4\"\n)\n\n# Plot a diagonal the points should be one\nplt.plot([300,1100],[300,1100],c=\"grey\")\n\n# Add a legend for the charge states\nlegend = plt.legend()\n\n# Make sure the legend labels are visible and big enough\nfor lh in legend.legendHandles:\n    lh.set_sizes([25])\n    lh.set_alpha(1)\n\n# Get the predictions and calculate performance metrics\npredictions = linear_model_z2.predict(X_matrix_count_test.loc[ccs_df[\"Charge\"]==3,:])\nmare = round(sum((abs(predictions-ccs_df_test.loc[ccs_df[\"Charge\"]==3,\"CCS\"])/ccs_df_test.loc[ccs_df[\"Charge\"]==3,\"CCS\"])*100)/len(predictions),3)\npcc = round(pearsonr(predictions,ccs_df_test.loc[ccs_df[\"Charge\"]==3,\"CCS\"])[0],3)\nperc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[ccs_df[\"Charge\"]==3,\"CCS\"])/ccs_df_test.loc[ccs_df[\"Charge\"]==3,\"CCS\"])*100,95)*2,2)\n\nplt.title(f\"Linear model - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}% (z3 model for z3 observations)\")\n\nax = plt.gca()\nax.set_aspect('equal')\n\nplt.xlabel(\"Observed CCS (^2)\")\nplt.ylabel(\"Predicted CCS (^2)\")\n\nplt.show()\n\n\n\n\nIt is clear that the predictions and observations are on the diagonal. This means that they are very similar. However, there are still some differences between observations and predictions.\nIn the previous example we trained models for charge state seperately. This is slightly inconvenient and other charge states might still be able to provide useful training examples. As long as the model corrects for the right charge state of course. In the next example we add charge state to the feature matrix. The linear model should be (partially…) able to account for the charge states of peptides.\n\n# Make a new copy of feature matrix and add charge as a feature\nX_matrix_count_charge_train = X_matrix_count_train.copy()\nX_matrix_count_charge_train[\"charge\"] = ccs_df_train[\"Charge\"]\n\nX_matrix_count_charge_test = X_matrix_count_test.copy()\nX_matrix_count_charge_test[\"charge\"] = ccs_df_test[\"Charge\"]\n\n\n# Fit the linear model, but this time with the charge as a feature\nlinear_model = LinearRegression()\n\nlinear_model.fit(\n    X=X_matrix_count_charge_train,\n    y=ccs_df_train.loc[:,\"CCS\"]\n)\n\nLinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegressionLinearRegression()\n\n\n\nif len(ccs_df.index) &lt; 1e4:\n    set_alpha = 0.2\n    set_size = 3\nelse:\n    set_alpha = 0.05\n    set_size = 1\n\n# Scatter plot the observations on the test set against the predictions on the same set\nplt.scatter(\n    linear_model.predict(X=X_matrix_count_charge_test.loc[ccs_df[\"Charge\"]==2,:]),\n    ccs_df_test.loc[ccs_df[\"Charge\"]==2,\"CCS\"],\n    alpha=set_alpha,\n    s=1,\n    label=\"Z=2\"\n)\n\nplt.scatter(\n    linear_model.predict(X=X_matrix_count_charge_test.loc[ccs_df[\"Charge\"]==3,:]),\n    ccs_df_test.loc[ccs_df[\"Charge\"]==3,\"CCS\"],\n    alpha=set_alpha,\n    s=1,\n    label=\"Z=3\"\n)\n\nplt.scatter(\n    linear_model.predict(X=X_matrix_count_charge_test.loc[ccs_df[\"Charge\"]==4,:]),\n    ccs_df_test.loc[ccs_df[\"Charge\"]==4,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=4\"\n)\n\n# Plot a diagonal the points should be one\nplt.plot([300,1100],[300,1100],c=\"grey\")\n\nlegend = plt.legend()\n\nfor lh in legend.legendHandles:\n    lh.set_sizes([25])\n    lh.set_alpha(1)\n\n# Get the predictions and calculate performance metrics\npredictions = linear_model.predict(X=X_matrix_count_charge_test)\nmare = round(sum((abs(predictions-ccs_df_test.loc[:,\"CCS\"])/ccs_df_test.loc[:,\"CCS\"])*100)/len(predictions),3)\npcc = round(pearsonr(predictions,ccs_df_test.loc[:,\"CCS\"])[0],3)\nperc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[:,\"CCS\"])/ccs_df_test.loc[:,\"CCS\"])*100,95)*2,2)\n\nplt.title(f\"Linear model - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%\")\n\nax = plt.gca()\nax.set_aspect('equal')\n\nplt.xlabel(\"Observed CCS (^2)\")\nplt.ylabel(\"Predicted CCS (^2)\")\n\nplt.show()\n\n\n\n\nWith this model we are capable to predict CCS values for all three charge states (maybe more; be careful with extrapolation). However, it also shows that both z3 and z4 are not optimally predicted. Especially z4 we can probably draw a line manually that provides better performance than the current model. The incapability of the model to correctly predict some of these values is largely due to the linear algorithm. With this algorithm we can only fit “simple” linear relations, but more complex relations are not modeled correctly. In the next section we will fit a non-linear model that is able to capture these complex relations better. However, keep in mind that more complex models are usually also able to overfit data better, resulting in poorer generalization performance."
  },
  {
    "objectID": "tutorials/ionmobility/meier-tims-ccs.html#training-an-rf-non-linear-regression-model-for-ccs-prediction",
    "href": "tutorials/ionmobility/meier-tims-ccs.html#training-an-rf-non-linear-regression-model-for-ccs-prediction",
    "title": "Predicting CCS values for TIMS data",
    "section": "Training an RF (non-linear) regression model for CCS prediction",
    "text": "Training an RF (non-linear) regression model for CCS prediction\nIn this section we will fit a random forest (RF) regression model. We hope to fit some of the non-linear relations present in the data. The RF algorithm fits multiple decision trees, but what makes these trees different is the random selection of instances (peptides) and/or features (amino acid count). The predictions between the forest of trees can be averaged to obtain a single prediction per peptide (instead of multiple for the same peptide). Later we will see that the algorithm might actually not be suitable for fitting this type of data.\n\nfrom sklearn.ensemble import RandomForestRegressor\n\n\n# Make a new copy of feature matrix and add charge as a feature\nX_matrix_count_charge_train = X_matrix_count_train.copy()\nX_matrix_count_charge_train[\"charge\"] = ccs_df_train[\"Charge\"]\n\nX_matrix_count_charge_test = X_matrix_count_test.copy()\nX_matrix_count_charge_test[\"charge\"] = ccs_df_test[\"Charge\"]\n\n\n# Initialize a RF object, note the hyperparameters that the model will follow\nrf_model = RandomForestRegressor(\n                max_depth=20,\n                n_estimators=50,\n                n_jobs=-1\n)\n\n# Fit the RF model\nrf_model.fit(\n    X=X_matrix_count_charge_train,\n    y=ccs_df_train.loc[:,\"CCS\"]\n)\n\nRandomForestRegressor(max_depth=20, n_estimators=50, n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.RandomForestRegressorRandomForestRegressor(max_depth=20, n_estimators=50, n_jobs=-1)\n\n\n\nif len(ccs_df.index) &lt; 1e4:\n    set_alpha = 0.2\n    set_size = 3\nelse:\n    set_alpha = 0.05\n    set_size = 1\n\n# Scatter plot the observations on the test set against the predictions on the same set\nplt.scatter(\n    rf_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test[\"Charge\"]==2,:]),\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==2,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=2\"\n)\n\nplt.scatter(\n    rf_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test[\"Charge\"]==3,:]),\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==3,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=3\"\n)\n\nplt.scatter(\n    rf_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test[\"Charge\"]==4,:]),\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==4,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=4\"\n)\n\n# Plot a diagonal the points should be one\nplt.plot([300,1100],[300,1100],c=\"grey\")\n\nlegend = plt.legend()\n\nfor lh in legend.legendHandles:\n    lh.set_sizes([25])\n    lh.set_alpha(1)\n\n# Get the predictions and calculate performance metrics\npredictions = rf_model.predict(X=X_matrix_count_charge_test)\nmare = round(sum((abs(predictions-ccs_df_test.loc[:,\"CCS\"])/ccs_df_test.loc[:,\"CCS\"])*100)/len(predictions),3)\npcc = round(pearsonr(predictions,ccs_df_test.loc[:,\"CCS\"])[0],3)\nperc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[:,\"CCS\"])/ccs_df_test.loc[:,\"CCS\"])*100,95)*2,2)\n\nplt.title(f\"RF - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%\")\n\nax = plt.gca()\nax.set_aspect('equal')\n\nplt.xlabel(\"Observed CCS (^2)\")\nplt.ylabel(\"Predicted CCS (^2)\")\n\nplt.show()\n\n\n\n\nAs can be observed the problem with z=4 splitting up is gone, probably due to the capability of RF to fit non-linear relations. However, we see quite a large deviation on the diagonal. One of the major causes of this problem is the exclusion of amino acid counts for the decision trees. Although this is fundamental to the inner workings of RF, it means that we cannot take the excluded amino acids into account and these values are likely to be replaced by average expected volume to other (non-excluded) amino acids. RF performs very well when features correlate, and predictions are not fully dependent on the inclusion of all features. Next we will look at a decision tree algorithm (XGBoost) that does not rely on the exclusion of features.\nPS note that you might be able to fit a much better model by using a much larger number of trees, but overall the problem largely remains, and it is better to choose an algorithm that respects/fits your data best."
  },
  {
    "objectID": "tutorials/ionmobility/meier-tims-ccs.html#training-a-xgboost-non-linear-regression-model-for-ccs-prediction",
    "href": "tutorials/ionmobility/meier-tims-ccs.html#training-a-xgboost-non-linear-regression-model-for-ccs-prediction",
    "title": "Predicting CCS values for TIMS data",
    "section": "Training a XGBoost (non-linear) regression model for CCS prediction",
    "text": "Training a XGBoost (non-linear) regression model for CCS prediction\nIn this section we will fit a XGBoost regression model. This algorithm works by training a sequence of underfitted models. Each model in the sequence receives the output of the previous decision tree models. This combination of trees allows to fit the data well without greatly overfitting it.\n\nfrom xgboost import XGBRegressor\n\n\n# Make a new copy of feature matrix and add charge as a feature\nX_matrix_count_charge_train = X_matrix_count_train.copy()\nX_matrix_count_charge_train[\"charge\"] = ccs_df_train[\"Charge\"]\n\nX_matrix_count_charge_test = X_matrix_count_test.copy()\nX_matrix_count_charge_test[\"charge\"] = ccs_df_test[\"Charge\"]\n\n\n# Initialize the XGB object\nxgb_model = XGBRegressor(\n                max_depth=12,\n                n_estimators=250\n)\n\n# Fit the XGB model\nxgb_model.fit(\n    X_matrix_count_charge_train,\n    ccs_df_train.loc[:,\"CCS\"]\n)\n\nXGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n             early_stopping_rounds=None, enable_categorical=False,\n             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n             importance_type=None, interaction_constraints='',\n             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n             max_delta_step=0, max_depth=12, max_leaves=0, min_child_weight=1,\n             missing=nan, monotone_constraints='()', n_estimators=250, n_jobs=0,\n             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n             reg_lambda=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.XGBRegressorXGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,\n             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,\n             early_stopping_rounds=None, enable_categorical=False,\n             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',\n             importance_type=None, interaction_constraints='',\n             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,\n             max_delta_step=0, max_depth=12, max_leaves=0, min_child_weight=1,\n             missing=nan, monotone_constraints='()', n_estimators=250, n_jobs=0,\n             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n             reg_lambda=1, ...)\n\n\n\nif len(ccs_df.index) &lt; 1e4:\n    set_alpha = 0.2\n    set_size = 3\nelse:\n    set_alpha = 0.05\n    set_size = 1\n\n# Scatter plot the observations on the test set against the predictions on the same set\nplt.scatter(\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==2,\"CCS\"],\n    xgb_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test[\"Charge\"]==2,:]),\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=2\")\n\nplt.scatter(\n    xgb_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test[\"Charge\"]==3,:]),\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==3,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=3\"\n)\n\nplt.scatter(\n    xgb_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test[\"Charge\"]==4,:]),\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==4,\"CCS\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=4\"\n)\n\n# Plot a diagonal the points should be one\nplt.plot([300,1100],[300,1100],c=\"grey\")\n\nlegend = plt.legend()\n\nfor lh in legend.legendHandles:\n    lh.set_sizes([25])\n    lh.set_alpha(1)\n\n# Get the predictions and calculate performance metrics\npredictions = xgb_model.predict(X_matrix_count_charge_test)\nmare = round(sum((abs(predictions-ccs_df_test.loc[:,\"CCS\"])/ccs_df_test.loc[:,\"CCS\"])*100)/len(predictions),3)\npcc = round(pearsonr(predictions,ccs_df_test.loc[:,\"CCS\"])[0],3)\nperc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[:,\"CCS\"])/ccs_df_test.loc[:,\"CCS\"])*100,95)*2,2)\n\nplt.title(f\"XGBoost - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%\")\n\nax = plt.gca()\nax.set_aspect('equal')\n\nplt.xlabel(\"Observed CCS (^2)\")\nplt.ylabel(\"Predicted CCS (^2)\")\n\nplt.show()"
  },
  {
    "objectID": "tutorials/ionmobility/meier-tims-ccs.html#training-a-deep-learning-lstm-model-for-ccs-prediction",
    "href": "tutorials/ionmobility/meier-tims-ccs.html#training-a-deep-learning-lstm-model-for-ccs-prediction",
    "title": "Predicting CCS values for TIMS data",
    "section": "Training a deep learning LSTM model for CCS prediction",
    "text": "Training a deep learning LSTM model for CCS prediction\nThe deviation on the diagonal has been decreased significantly. But… A decision tree based algorithm is usually not the best for a regression model. Since the target data is continuous a model that can respect this structure is likely to perform better. Furthermore, up till now we simply counted amino acids, but structure is important. So to get the most out of the data we need to use the exact positions of amino acids.\nAlso… We have a lot of data it makes sense to use deep learning (DL). DL models are usually capable of learning more complex relations than traditional algorithms. Furthormore, for traditional ML algorithms we usually need to engineer features, while DL can usually work directly from raw data. DL is able to construct its own features.\n\nfrom tensorflow.keras.layers import Dense, concatenate, Input, Bidirectional, LSTM\nfrom tensorflow.keras.models import Model\nimport tensorflow as tf\n\nAs mentioned before, we want to use features that can also tell us something about the potential structure of the peptide. This means we need to take the sequence of the peptide into account and not just the amino acid counts. For this we will use a ‘one-hot encoding’, in this matrix each position in the peptide are the columns (number of columns equals the length of the peptide) and each amino acid per position has its own row (for the standard amino acids this is 20). So as a result we create a matrix that is the length of the peptide by the amount of unique amino acids in the whole data set. For each position we indicate the presence with a ‘1’ and absence with ‘0’. As a result the sum of each columnn is ‘1’ and the sum of the whole matrix equals the length of the peptide.\n\ndef aa_seq_to_one_hot(seq,padding_length=60):\n    # Although padding is not needed for an LSTM, we might need it if we for example apply a CNN\n    # Calculate how much padding is needed\n    seq_len = len(seq)\n    if seq_len &gt; padding_length:\n        seq = seq[0:padding_length]\n        seq_len = len(seq)\n\n    # Add padding for peptides that are too short\n    padding = \"\".join([\"X\"] * (padding_length - len(seq)))\n    seq = seq + padding\n\n    # Initialize all feature matrix\n    matrix_hc = np.zeros(\n        (len(aa_to_pos.keys()), len(seq)), dtype=np.int8)\n\n    # Fill the one-hot matrix, when we encounter an 'X' it should be the end of the sequence\n    for idx,aa in enumerate(seq):\n        if aa == \"X\":\n            break\n        matrix_hc[aa_to_pos[aa],idx] = 1\n\n    return matrix_hc\n\n\n# Calculate the one-hot matrices and stack them\n# Result is a 3D matrix where the first dimension is each peptide, and then the last two dims are the one-hot matrix\none_hot_encoded_train = np.stack(ccs_df_train[\"sequence\"].apply(aa_seq_to_one_hot).values)\none_hot_encoded_test = np.stack(ccs_df_test[\"sequence\"].apply(aa_seq_to_one_hot).values)\n\n\nif len(ccs_df.index) &lt; 1e4:\n    epochs = 100\n    num_lstm = 12\n    batch_size = 128\nelse:\n    batch_size = 1024\n    epochs = 10\n    num_lstm = 64\n\nbatch_size = 128\nv_split    = 0.1\noptimizer  = \"adam\"\nloss       = \"mean_squared_error\"\n\n# The architecture chosen consists of two inputs: (1) the one-hot matrix and (2) the charge\n# The first part is a biderectional LSTM (a), in paralel we have dense layers containing the charge (b)\n# Both a and b are concatenated to go through several dense layers (c)\ninput_a = Input(shape=(None, one_hot_encoded_train.shape[2]))\na = Bidirectional(LSTM(num_lstm,return_sequences=True))(input_a)\na = Bidirectional(LSTM(num_lstm))(a)\na = Model(inputs=input_a, outputs=a)\n\ninput_b = Input(shape=(1,))\nb = Dense(5, activation=\"relu\")(input_b)\nb = Model(inputs=input_b, outputs=b)\n\nc = concatenate([a.output, b.output],axis=-1)\n\nc = Dense(64, activation=\"relu\")(c)\nc = Dense(32, activation=\"relu\")(c)\nc = Dense(1, activation=\"relu\")(c)\n\n# Create the model with specified inputs and outputs\nmodel = Model(inputs=[a.input, b.input], outputs=c)\n\nmodel.compile(optimizer=optimizer, loss=loss)\n\n# Fit the model on the training data\nhistory = model.fit(\n            (one_hot_encoded_train,ccs_df_train.loc[:,\"Charge\"]),\n            ccs_df_train.loc[:,\"CCS\"],\n            epochs=epochs,\n            batch_size=batch_size,\n            validation_split=v_split\n)\n\nEpoch 1/10\n4550/4550 [==============================] - 108s 23ms/step - loss: 5536.7256 - val_loss: 467.1984\nEpoch 2/10\n4550/4550 [==============================] - 102s 22ms/step - loss: 436.4136 - val_loss: 403.9187\nEpoch 3/10\n4550/4550 [==============================] - 100s 22ms/step - loss: 395.7551 - val_loss: 384.5470\nEpoch 4/10\n4550/4550 [==============================] - 103s 23ms/step - loss: 376.4315 - val_loss: 382.0145\nEpoch 5/10\n4550/4550 [==============================] - 102s 23ms/step - loss: 364.8819 - val_loss: 395.8338\nEpoch 6/10\n4550/4550 [==============================] - 106s 23ms/step - loss: 357.4092 - val_loss: 355.8185\nEpoch 7/10\n4550/4550 [==============================] - 104s 23ms/step - loss: 342.5216 - val_loss: 312.9571\nEpoch 8/10\n4550/4550 [==============================] - 104s 23ms/step - loss: 275.4517 - val_loss: 274.8963\nEpoch 9/10\n4550/4550 [==============================] - 110s 24ms/step - loss: 253.2955 - val_loss: 252.0118\nEpoch 10/10\n4550/4550 [==============================] - 105s 23ms/step - loss: 240.6064 - val_loss: 260.1613\n\n\n\n# Predict CCS values test set\nccs_df_test[\"LSTM_predictions\"] = model.predict((one_hot_encoded_test,ccs_df_test.loc[:,\"Charge\"]))\n\n2247/2247 [==============================] - 20s 8ms/step\n\n\n\nif len(ccs_df.index) &lt; 1e4:\n    set_alpha = 0.2\n    set_size = 3\nelse:\n    set_alpha = 0.05\n    set_size = 1\n\n# Scatter plot the observations on the test set against the predictions on the same set\nplt.scatter(\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==2,\"CCS\"],\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==2,\"LSTM_predictions\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=2\"\n)\n\nplt.scatter(\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==3,\"CCS\"],\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==3,\"LSTM_predictions\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=3\"\n)\n\nplt.scatter(\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==4,\"CCS\"],\n    ccs_df_test.loc[ccs_df_test[\"Charge\"]==4,\"LSTM_predictions\"],\n    alpha=set_alpha,\n    s=set_size,\n    label=\"Z=4\"\n)\n\n# Plot a diagonal the points should be one\nplt.plot([300,1100],[300,1100],c=\"grey\")\n\nlegend = plt.legend()\n\nfor lh in legend.legendHandles:\n    lh.set_sizes([25])\n    lh.set_alpha(1)\n\n# Get the predictions and calculate performance metrics\npredictions = ccs_df_test[\"LSTM_predictions\"]\nmare = round(sum((abs(predictions-ccs_df_test.loc[:,\"CCS\"])/ccs_df_test.loc[:,\"CCS\"])*100)/len(predictions),3)\npcc = round(pearsonr(predictions,ccs_df_test.loc[:,\"CCS\"])[0],3)\nperc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[:,\"CCS\"])/ccs_df_test.loc[:,\"CCS\"])*100,95)*2,2)\n\nplt.title(f\"LSTM - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%\")\n\nax = plt.gca()\nax.set_aspect('equal')\n\nplt.xlabel(\"Observed CCS (^2)\")\nplt.ylabel(\"Predicted CCS (^2)\")\n\nplt.show()\n\n\n\n\nIt is clear that the performance of this model is much better. But… Performance can be improved a lot more by for example tuning hyperparameters like the network architecture or number of epochs.\nHope you enjoyed this tutorial! Feel free to edit it and make a pull request!"
  },
  {
    "objectID": "tutorials/detectability/index.html",
    "href": "tutorials/detectability/index.html",
    "title": "Detectability",
    "section": "",
    "text": "Title\n\n\nAuthor\n\n\nDate\n\n\n\n\n\n\nModelling protein detectability with an MLP\n\n\nEric Deutsch\n\n\nSep 21, 2022\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "tutorials/detectability/modeling-protein-detectability.html",
    "href": "tutorials/detectability/modeling-protein-detectability.html",
    "title": "Modelling protein detectability with an MLP",
    "section": "",
    "text": "Introduction\nWhen subjecting whole cell lysates to mass spectrometry-based proteomics analysis, some proteins are easily detected while others are not seen. The proteins that are never detected are often colloquially called the dark proteome. There are many reasons for not detecting proteins. Some proteins may only be found in certain cell types or in certain developmental stages. Comprehensive accumulation of datasets from different cell types and developmental stages can overcome this limitation. Other reasons such as the physicochemical properties of the proteins may hinder detection. Here we explore the “light and dark proteome” based on proteins that are observed and not observed in the Arabidopsis PeptideAtlas, which has been assembled by search over 200 million MS/MS spectra from 100 different datasets.\nFirst we import some needed libraries\n\n! pip install numpy~=1.21 pandas~=1.3 matplotlib~=3.5 scikit-learn~=1.0 --quiet\n\n\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom sklearn.neural_network import MLPClassifier\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import roc_curve, auc\n\nRead input data and extract the columns to train on. We will attempt to train on the protein molecular weight, protein physicochemical properties gravy score (a measure of hydrophobicity), isoelectric point (pI), and then two metrics from RNA-seq analysis: the percentage of RNA-seq experiments that detect a transcript for the given protein, and the highest TPM (transcripts per million, i.e. abundance) in any one dataset.\n\nproteins = pd.read_csv('http://www.peptideatlas.org/builds/arabidopsis/light_and_dark_protein_list.tsv', sep=\"\\t\")\nlearning_values = proteins[ ['molecular_weight', 'gravy', 'pI', 'rna_detected_percent', 'highest_tpm' ] ].copy()\n\nNormalize the data to have ranges like 0 to 1\n\nlearning_values.loc[ :, 'molecular_weight'] = learning_values['molecular_weight'] / 100\nlearning_values.loc[ learning_values[ 'molecular_weight'] &gt; 1, 'molecular_weight'] = 1.0\nlearning_values.loc[ :, 'gravy'] = ( learning_values['gravy'] + 2 ) / 4\nlearning_values.loc[ :, 'pI'] = ( learning_values['pI'] - 4 ) / 8\n\nlearning_values.loc[ :, 'rna_detected_percent'] = learning_values['rna_detected_percent'] / 100\nlearning_values.loc[ :, 'highest_tpm'] = learning_values['highest_tpm'] / 300\nlearning_values.loc[ learning_values[ 'highest_tpm'] &gt; 1, 'highest_tpm'] = 1.0\n\nSet the classifications to 0 and 1\n\nclasses = proteins['status'].copy()\nclasses[ classes == 'canonical' ] = 1\nclasses[ classes == 'not observed' ] = 0\n\nSplit into 75% train and 25% test\n\nX_train, X_test, y_train, y_test = train_test_split(learning_values, classes, test_size=0.25)\n\nTrain the classifier on the training set\n\nclf = MLPClassifier(solver='lbfgs', max_iter=1000, hidden_layer_sizes=(100,), alpha=1e-4, random_state=1)\nclf.fit(X_train, list(y_train))\n\nC:\\Program Files\\Python310\\lib\\site-packages\\sklearn\\neural_network\\_multilayer_perceptron.py:549: ConvergenceWarning: lbfgs failed to converge (status=1):\nSTOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n\nIncrease the number of iterations (max_iter) or scale the data as shown in:\n    https://scikit-learn.org/stable/modules/preprocessing.html\n  self.n_iter_ = _check_optimize_result(\"lbfgs\", opt_res, self.max_iter)\n\n\nMLPClassifier(max_iter=1000, random_state=1, solver='lbfgs')\n\n\nPredict for all the test set\n\npredictions = clf.predict(X_test)\nprobabilities = clf.predict_proba(X_test)\n\nMake a ROC curve\n\nprobabilities_list = list(probabilities[ :, 1])\nfpr, tpr, thresholds = roc_curve(np.ravel(list(y_test)), np.ravel(probabilities_list))\nroc_auc = auc(fpr, tpr)\nplt.figure()\nplt.plot(fpr,tpr,color=\"darkorange\",lw=2,label=\"ROC curve (area = %0.2f)\" % roc_auc)\nplt.plot([0, 1], [0, 1], color=\"navy\", lw=2, linestyle=\"--\")\nplt.xlim([0.0, 1.0])\nplt.ylim([0.0, 1.05])\nplt.xlabel(\"False positive rate\")\nplt.ylabel(\"True positive rate\")\nplt.title(\"ROC plot for canonical predictions\")\nplt.legend(loc=\"lower right\")\nplt.grid(True)\nplt.show()\n\n\n\n\nPredict for all protein and write out the table with learned results\n\nprobabilities = clf.predict_proba(learning_values)\nproteins['learned_canonical_prob'] = probabilities[ :, 1]\nproteins.to_csv('light_and_dark_protein_list_trained.tsv', sep=\"\\t\", index=False)"
  },
  {
    "objectID": "publication.html",
    "href": "publication.html",
    "title": "ProteomicsML: An Online Platform for Community-Curated Data Sets and Tutorials for Machine Learning in Proteomics",
    "section": "",
    "text": "Published in the Third Special Issue on Software Tools and Resources of Journal of Proteome Research.\n\nProteomicsML: An Online Platform for Community-Curated Data Sets and Tutorials for Machine Learning in Proteomics. Tobias G. Rehfeldt*, Ralf Gabriels*, Robbin Bouwmeester*, Siegfried Gessulat, Benjamin A. Neely, Magnus Palmblad, Yasset Perez-Riverol, Tobias Schmidt, Juan Antonio Vizcaı́no§, and Eric W. Deutsch§. J. Proteome Res. 2023, 22, 2, 632–636. doi:10.1021/acs.jproteome.2c00629.\n\n\nIntroduction\nComputational predictions of analyte behavior in the context of mass spectrometry (MS) data have been explored for nearly five decades, with early rudimentary predictions dating back to 1983. (Heijne 1983) With the rise of technology and computational power, machine learning (ML) approaches were introduced into the field of proteomics in 1998 (Nielsen, Brunak, and Heijne 1999) and ML-based models quickly overtook human accuracy. Since then, dozens of articles have described efforts to train models for a multitude of physicochemical properties associated with the field of high-throughput proteomics, as reviewed by Neely et al. (Neely et al. 2023) Some of the most-commonly studied properties are retention time and fragmentation spectrum intensities, while a large range of lesser explored properties exists as well. For an exhaustive review of the current undertakings, see Wen et al. and Bouwmeester et al. (Wen et al. 2020; Bouwmeester et al. 2020) While many of these efforts are still in the realm of basic exploratory research, ML approaches are increasingly being incorporated into mainstream tools and standalone predictive resources. (Wen et al. 2020; Gessulat et al. 2019; Bouwmeester et al. 2021; Meyer 2021)\nWhen training any ML model, it is crucial to obtain suitable training and evaluation data sets. Likewise, in many fields of research where ML is applied, it is common to have a range of educational data sets, such as the MNIST (Modified National Institute of Standards and Technology) (Deng 2012) or IRIS data sets, allowing newcomers to the field to easily learn common ML methodologies. Likewise, state-of-the-art models can use benchmark data sets such as ImageNet or those available on the UCI Machine Learning Repository to compare their predictive capabilities. Similar to the utility of benchmark data sets, such as the number of survivors on the Titanic, which has been modeled more than 54 000 times (kaggle.com/competitions/titanic), we seek to define proteomics data sets that can provide an entry point for ML modeling.\nAlthough there have been numerous efforts to explore the predictive capabilities of models, there are barriers that limit widespread adoption in the field of predictive proteomics. First, there are considerable difficulties in accessing data sets in a suitable form for ML applications. A substantial effort is required to prepare raw proteomics data sets into a format usable for ML, as this demands extensive knowledge of the multitude of proteomics file formats and postprocessing methods. MS data also has a tendency to be fraught with missing metadata, making it challenging to compare across data sets. Furthermore, most ML frameworks in proteomics implement dedicated postprocessing pipelines to prepare the files for ML algorithms. Recently, tools such as ppx (Fondrie, Bittremieux, and Noble 2021) and MS2AI (Rehfeldt et al. 2021) were created to facilitate this process, but they are still limited to certain use cases due to the complex nature of liquid chromatography coupled to mass spectrometry (LC-MS) data.\nSecond, while some ML-ready data sets are available on platforms such as Kaggle (Kaggle.com, n.d.) or in supplementary tables of publications, they are often difficult to find and lack long-term maintenance and support postpublication. While there is no formal consensus in the field, there are certain data sets that are often used for training such as ProteomeTools. (Zolg et al. 2017) Nevertheless, there are no widely used data sets used to compare the performance of tools developed by different researchers, making it difficult for new algorithms to be evaluated and compared to older tools. This issue is only further exacerbated by individual groups relying on different pre- and postprocessing protocols, such as differences in normalization of measurements or in the implementation of model performance metrics.\nAs an outcome of the 2022 Lorentz Center Workshop on Proteomics and Machine Learning (Leiden, The Netherlands, March 2022), we have created a web platform to facilitate the application of ML approaches to the field of MS-based proteomics. The resource is intended to provide a central focal point for curating and disseminating data sets that are ready to use for ML research, and to encourage new entrants into the field through expert-driven tutorials. Here we describe how ProteomicsML has been developed using commonly available tools and designed for future ease of maintenance. We provide a brief overview of the data sets that are currently available at ProteomicsML and how it can be expanded in the future with more data. We also describe the initial set of tutorials that can be used as an introduction to the field of ML in proteomics.\n\n\nThe ProteomicsML Platform\nThe primary entry point for the resource is the ProteomicsML web site (www.proteomicsml.org). It contains general introductory data sets that are already preprocessed and ready for training or evaluation, and contains educational resources in the form of tutorials for those new to ML in proteomics. The code base for the Web site is maintained via a GitHub repository, and is therefore easy to maintain and amenable to outside contributions from the community. On the GitHub repository, researchers can open pull requests (proposals for adding or changing information) for new data sets or tutorials. These pull requests are then reviewed by the maintainers, currently the authors of this paper, in line with the guidelines in the contributing section of the ProteomicsML Web site. Data sets and tutorials hosted as part of the GitHub repository fall under the CC BY 4.0 license, as indicated on both the repository and the Web site. The PRIDE database infrastructure (Perez-Riverol et al. 2022) is also used to store larger data sets on an FTP server dedicated to ProteomicsML.\nA key goal of ProteomicsML is to advance with the field, which is why we provide a platform with detailed documentation, including a contributing guide on how to upload data sets and tutorials for specific ML workflows or algorithms. After curation by the maintainers, the contributions have to pass a build test in order to maintain integrity of the platform, and, if passed, are automatically published on the Web site and are freely accessible to other researchers.\nFor many LC-MS properties, such as retention time and fragmentation intensity, well-performing ML models have already been published. We aim to provide suitable data sets and tutorials to easily reproduce these results in an educational fashion. All data sets on the platform are organized by data type, and should ideally be provided in a simple data format that is suitable for direct import into ML toolkits. Each data type can contain one or more data sets for different purposes, and each data set should be sufficiently annotated with metadata (e.g., its origin, how it was processed, and the relevant literature citations). Along with well-annotated data sets, the platform provides users with in-depth tutorials on how to download, import, handle, and train various ML models. Many of the LC-MS data types require certain, sometimes complex, preprocessing steps in order to be fully compatible with ML frameworks. For this reason, we believe it is crucial to provide guidelines on these processes to ultimately lower the entry barriers for new users to the field. Tutorials on ProteomicsML can be attribute- or data set-specific, allowing new tutorial submissions to focus on either the direct interactions with specific ML models or methodologies, or on a certain aspect of data preprocessing.\nOften when new modeling approaches are published, they are accompanied by data sets with novel pre- and postprocessing steps. Using ProteomicsML, the new data can be uploaded to the site along with a unified metadata entry and an accompanying tutorial that improves reproducibility of the work and facilitates benchmarking by the community.\n\n\nData Sets and Tutorials\nThe original raw data for proteomics data sets currently included in ProteomicsML have already been made publicly available through ProteomeXchange, (Deutsch et al. 2020) mostly via the PRIDE database. (Perez-Riverol et al. 2022) Here, the data hosted at ProteomicsML are provided in an ML-ready format, with links to original metadata and raw files for full provenance. Even though the data sets at ProteomicsML do not contain raw files, we do provide users with extensive tutorials on how to process raw data into ML-ready formats. ProteomicsML currently contains data sets and tutorials for fragmentation intensity, ion mobility (IM), retention time, and protein detectability. More data types can easily be added in the future, as the platform evolves along with the field.\n\nRetention time. Due to retention time playing a major role in modern peptide identification workflows, it is one of the most explored properties in predictive proteomics. (Wen et al. 2020) While some data sets for predicting retention time already exists, such as the publicly available data set from Kaggle kaggle.com/datasets/kirillpe/proteomics-retention-time-prediction and the DLOmix data sets, we have also compiled new multitiered ML-ready data sets from the ProteomeTools synthetic peptide library, (Zolg et al. 2017) in three specific sizes: 100 000 data points (small), well suited for new practitioners; (ii) 250 000 data points (medium), and (iii) 1 million data points (large), well suited for larger-scale ML training or benchmarking. As amino acid modifications can complicate the application of ML in proteomics, these three tiers do not contain any modified peptides except for carbamidomethylation of cysteine. Nevertheless, to train models for more real-life applications, we have also included an additional data set tier containing 200 000 oxidized peptides, as well as a mixed data set containing 200 000 oxidized and 200 000 unmodified peptides. These data sets require minimal data preparation, although we still provide two distinct tutorials on methods to incorporate these data sets into deep learning (DL)-based models. In addition to preprocessed data, we also provide a detailed tutorial that combines and aligns retention times between runs from MaxQuant evidence files. (Tyanova, Temu, and Cox 2016) The output of this tutorial is a fully ML-ready file for retention time prediction.\nFragmentation intensity. While it is easy to calculate the m/z values of theoretical peptide spectra, fragment ion peak intensities follow complex patterns that can be hard to predict. Nevertheless, these intensities can play a key role in accurate peptide identification. (C Silva et al. 2019) For this reason, fragment ion intensity prediction is likely the second most explored topic for prediction purposes, for which comprehensive data sets and tutorials exist within ProteomicsML. As there are many attributes of peptides that affect their fragmentation patterns, the preprocessing steps of fragmentation data are more complex, and can be substantially different from lab to lab. For this reason, we have composed two separate tutorials, one that mimics the Prosit (Gessulat et al. 2019) data processing approach on the ProteomeTools (Zolg et al. 2017) data sets, which consists of 745 000 annotated spectra, and one that mimics the MS2PIP data process on a consensus human spectral library from the National Institute of Standards and Technology, which consists of 270 440 annotated spectra. (Gabriels, Martens, and Degroeve 2019) For data sets in this category it is difficult to provide a simple format with unified columns, as the handling and preprocessing steps differ significantly from model to model. Currently, there is one tutorial available on ProteomicsML describing the data processing pipeline from raw file to Prosit-style annotation, and we believe that with future additions we can provide users with tutorials for additional processing approaches.\nIon mobility. Ion mobility is a technique to separate ionized analytes based on their size, shape, and physicochemical properties. (Dodds and Baker 2019) Techniques for ion mobility are generally based on propelling or trapping ions with an electric field in an ion mobility cell. Peptides are then separated by colliding them with an inert gas without fragmentation. Indeed, peptides with a larger area to collide will be more affected by the collisions, resulting in a higher measured collisional cross section (CCS). Historically, most methods predicting ion mobility were based on molecular dynamics models that calculate the CCS from first-principles in physics. (Larriba-Andaluz and Prell 2020) Lately the field has generated multiple ML and DL approaches for both peptide and metabolite CCS prediction. (Zhou, Xiong, and Zhu 2017; Broeckling et al. 2021; Meier et al. 2021) The tutorials made available in ProteomicsML use both trapping (trapped ion mobility, (Michelmann et al. 2015) TIMS) and propelling ion mobility (traveling wave ion mobility, (Shvartsburg and Smith 2008) TWIMS) data, where the large TIMS data set was sourced from Meier et al. (Meier et al. 2021) (718 917 data points) and the TWIMS data was sourced from Puyvelde et al. (Van Puyvelde et al. 2022) (6268 data points). The tutorial is a walkthrough for training various model types, ranging from simple linear models to more complex nonlinear models (e.g., DL-based networks) showing advantages and disadvantages of various learning algorithms for CCS prediction.\nProtein detectability. Modern proteomics methods and instrumentation are now routinely detecting and quantifying the majority of proteins thought to be encoded by the genome of a given species. (Hebert et al. 2014) Yet even after gathering enormous amounts of data, there is always a subset of proteins that remains refractory to detection. For example, even though tremendous effort has been focused on the human proteome, the fraction of unobserved proteins has been pushed just below 10%. (Adhikari et al. 2020; Omenn et al. 2021) It remains unclear why certain proteins remain undetected, although ML has been applied to explore which properties most strongly influence detectability (as reviewed within). (Dincer et al. 2022) One can compute a set of properties for a proteome and then train a model using those properties based on real world observations of the proteins that are detected and the proteins that are not detected. The model can be trained to learn which properties separate the detected from the undetected. Such a model has further utility to highlight proteins with properties that should sort them into the detected group, yet are not, as well as proteins that should belong to the undetected group, and yet they are detected. To facilitate this we have included the Arabidopsis PeptideAtlas data set, which is based on an extensive study of a single proteome. (Wijk et al. 2021) This data set is based on the 2021 build, which has 52 data sets reprocessed to yield 40 million peptide-spectrum matches and a good overall coverage of the Arabidopsis thaliana proteome. Proteins in the data set are categorized as either “canonical”, having the strongest evidence of detection, or “not observed”, for which no peptides are identified. Along with these class labels, the data set contains various protein properties such as molecular weight, hydrophobicity, and isoelectric point, which could be crucial for classification purposes. The data set has an accompanying tutorial that illustrates how to analyze the data with a classification model for the observability of peptides.\n\nOverall, these initial data set submissions and tutorials leave room for future expansion, until the community resource contains data sets for all properties previously and currently being explored in the field of proteomics. It is also open for user submissions, allowing researchers to upload their data in a standardized fashion, along with in-depth tutorials on their data handling and ML methodologies, resulting in more reproducible science. Our expectation is that this will shape the future of predictive proteomics, in favor of being more accessible, standardized, and reproducible.\nAdditionally, we have compiled a list of proteomics publications that utilize ML, along with a list of ProteomeXchange data sets used by each of the publications (Supplementary Table 1). Each of these ProteomeXchange data sets have been given a set of tags to indicate the nature of the usage in the publications (e.g., benchmarking, retention time, deep learning, etc.) as shown in Supplementary Table 2. Furthermore, these tags have also been added to the respective PRIDE data sets, which allows the tags to be easily searched, and for users to compile their ideal data set, if ProteomicsML does not already contain one.\n\n\nConclusion\nWe have presented ProteomicsML, a comprehensive resource of data sets and tutorials for every ML practitioner in the field of MS-based proteomics. ProteomicsML contains multiple data sets on a range of LC-MS peptide properties, allowing computational proteomics researchers to compare new algorithms to state-of-the-art models, as well as providing newcomers to the field with an accessible starting point, without requiring immediate in-depth knowledge of the entire proteomics analysis pipeline. We believe that this resource will aid the next generation of ML practitioners, and provide a stepping stone for more open and more reproducible science in the field.\n\n\nSupporting Information\nThe Supporting Information is available free of charge at pubs.acs.org/doi/10.1021/acs.jproteome.2c00629.\n\nSupplementary Table 1: Proteomics ML publications along with links to the ProteomeXchange data sets used for training or testing (XLSX)\nSupplementary Table 2: Public ProteomeXchange data sets that have been used for ML training or benchmarking (XLSX)\n\n\n\nNotes\nThe authors declare the following competing financial interest(s): Tobias Schmidt and Siegfried Gessulat are employees of MSAID. MSAID makes ML-based software modules that are sold as part of Proteome Discoverer and also offers contract research. All other authors declare no competing financial interest.\nIdentification of certain commercial equipment, instruments, software, or materials does not imply recommendation or endorsement by the National Institute of Standards and Technology (NIST), nor does it imply that the products identified are necessarily the best available for the proposed purpose.\n\n\nAcknowledgments\nWe thank Wassim Gabriel and Mathias Wilhelm for consultations on the Prosit annotation pipeline. The 2022 Lorentz Center workshop on Proteomics and Machine Learning was funded by the Dutch Research Council (NWO) with generous support from the Leiden University Medical Center, Thermo Fisher Scientific and Journal of Proteome Research (ACS). We also thank the staff at the Lorentz Center for helping make the hybrid workshop a success in pandemic times. T.G.R. acknowledges funding from the Velux Foundation [00028116]. R.G. acknowledges funding from the Research Foundation Flanders (FWO) [12B7123N]. R.B. acknowledges funding from the Vlaams Agentschap Innoveren en Ondernemen [HBC.2020.2205]. J.A.V. acknowledges funding from EMBL core funding, Wellcome [grant 223745/Z/21/Z], EU H2020 [823839], and BBSRC [BB/S01781X/1; BB/V018779/1]. E.W.D. acknowledges funding from the National Institutes of Health [R01 GM087221; R24 GM127667; U19 AG023122], and from the National Science Foundation [DBI-1933311; IOS-1922871].\n\n\n\n\n\nReferences\n\nAdhikari, Subash, Edouard C Nice, Eric W Deutsch, Lydie Lane, Gilbert S Omenn, Stephen R Pennington, Young-Ki Paik, et al. 2020. “A High-Stringency Blueprint of the Human Proteome.” Nat. Commun. 11 (1): 5301. https://doi.org/10.1038/s41467-020-19045-9.\n\n\nBouwmeester, Robbin, Ralf Gabriels, Niels Hulstaert, Lennart Martens, and Sven Degroeve. 2021. “DeepLC Can Predict Retention Times for Peptides That Carry as-yet Unseen Modifications.” Nat. Methods 18 (11): 1363–69. https://doi.org/10.1038/s41592-021-01301-5.\n\n\nBouwmeester, Robbin, Ralf Gabriels, Tim Van Den Bossche, Lennart Martens, and Sven Degroeve. 2020. “The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows.” PROTEOMICS 20 (21-22): 1900351. https://doi.org/https://doi.org/10.1002/pmic.201900351.\n\n\nBroeckling, Corey D, Linxing Yao, Giorgis Isaac, Marisa Gioioso, Valentin Ianchis, and Johannes P C Vissers. 2021. “Application of Predicted Collisional Cross Section to Metabolome Databases to Probabilistically Describe the Current and Future Ion Mobility Mass Spectrometry.” J. Am. Soc. Mass Spectrom. 32 (3): 661–69. https://doi.org/10.1021/jasms.0c00375.\n\n\nC Silva, Ana S, Robbin Bouwmeester, Lennart Martens, and Sven Degroeve. 2019. “Accurate Peptide Fragmentation Predictions Allow Data Driven Approaches to Replace and Improve Upon Proteomics Search Engine Scoring Functions.” Bioinformatics 35 (24): 5243–48. https://doi.org/10.1093/bioinformatics/btz383.\n\n\nDeng, Li. 2012. “The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web].” IEEE Signal Processing Magazine 29 (6): 141–42. https://doi.org/10.1109/MSP.2012.2211477.\n\n\nDeutsch, Eric W, Nuno Bandeira, Vagisha Sharma, Yasset Perez-Riverol, Jeremy J Carver, Deepti J Kundu, David Garcı́a-Seisdedos, et al. 2020. “The ProteomeXchange Consortium in 2020: Enabling ’Big Data’ Approaches in Proteomics.” Nucleic Acids Res. 48 (D1): D1145–52. https://doi.org/10.1093/nar/gkz984.\n\n\nDincer, Ayse B., Yang Lu, Devin K. Schweppe, Sewoong Oh, and William Stafford Noble. 2022. “Reducing Peptide Sequence Bias in Quantitative Mass Spectrometry Data with Machine Learning.” Journal of Proteome Research 21 (7): 1771–82. https://doi.org/10.1021/acs.jproteome.2c00211.\n\n\nDodds, James N, and Erin S Baker. 2019. “Ion Mobility Spectrometry: Fundamental Concepts, Instrumentation, Applications, and the Road Ahead.” J. Am. Soc. Mass Spectrom. 30 (11): 2185–95. https://doi.org/10.1007/s13361-019-02288-2.\n\n\nFondrie, William E, Wout Bittremieux, and William S Noble. 2021. “ppx: Programmatic Access to Proteomics Data Repositories.” J. Proteome Res. 20 (9): 4621–24. https://doi.org/10.1021/acs.jproteome.1c00454.\n\n\nGabriels, Ralf, Lennart Martens, and Sven Degroeve. 2019. “Updated MS²PIP Web Server Delivers Fast and Accurate MS² Peak Intensity Prediction for Multiple Fragmentation Methods, Instruments and Labeling Techniques.” Nucleic Acids Res. 47 (W1): W295–99. https://doi.org/10.1093/nar/gkz299.\n\n\nGessulat, Siegfried, Tobias Schmidt, Daniel Paul Zolg, Patroklos Samaras, Karsten Schnatbaum, Johannes Zerweck, Tobias Knaute, et al. 2019. “Prosit: Proteome-Wide Prediction of Peptide Tandem Mass Spectra by Deep Learning.” Nat. Methods 16 (6): 509–18. https://doi.org/10.1038/s41592-019-0426-7.\n\n\nHebert, Alexander S, Alicia L Richards, Derek J Bailey, Arne Ulbrich, Emma E Coughlin, Michael S Westphall, and Joshua J Coon. 2014. “The One Hour Yeast Proteome.” Mol. Cell. Proteomics 13 (1): 339–47. https://doi.org/10.1074/mcp.M113.034769.\n\n\nHeijne, G von. 1983. “Patterns of Amino Acids Near Signal-Sequence Cleavage Sites.” Eur. J. Biochem. 133 (1): 17–21. https://doi.org/10.1111/j.1432-1033.1983.tb07424.x.\n\n\nKaggle.com. n.d. Kaggle. https://www.kaggle.com/datasets?search=proteomics.\n\n\nLarriba-Andaluz, Carlos, and James S Prell. 2020. “Fundamentals of Ion Mobility in the Free Molecular Regime. Interlacing the Past, Present and Future of Ion Mobility Calculations.” Int. Rev. Phys. Chem. 39 (4): 569–623. https://doi.org/10.1080/0144235X.2020.1826708.\n\n\nMeier, Florian, Niklas D Köhler, Andreas-David Brunner, Jean-Marc H Wanka, Eugenia Voytik, Maximilian T Strauss, Fabian J Theis, and Matthias Mann. 2021. “Deep Learning the Collisional Cross Sections of the Peptide Universe from a Million Experimental Values.” Nat. Commun. 12 (1): 1185. https://doi.org/10.1038/s41467-021-21352-8.\n\n\nMeyer, Jesse G. 2021. “Deep Learning Neural Network Tools for Proteomics.” Cell Rep Methods 1 (2): 100003. https://doi.org/10.1016/j.crmeth.2021.100003.\n\n\nMichelmann, Karsten, Joshua A Silveira, Mark E Ridgeway, and Melvin A Park. 2015. “Fundamentals of Trapped Ion Mobility Spectrometry.” J. Am. Soc. Mass Spectrom. 26 (1): 14–24. https://doi.org/10.1007/s13361-014-0999-4.\n\n\nNeely, Benjamin A., Viktoria Dorfer, Lennart Martens, Isabell Bludau, Robbin Bouwmeester, Sven Degroeve, Eric W. Deutsch, et al. 2023. “Toward an Integrated Machine Learning Model of a Proteomics Experiment.” Journal of Proteome Research 22 (3): 681–96. https://doi.org/10.1021/acs.jproteome.2c00711.\n\n\nNielsen, H, S Brunak, and G von Heijne. 1999. “Machine Learning Approaches for the Prediction of Signal Peptides and Other Protein Sorting Signals.” Protein Eng. 12 (1): 3–9. https://doi.org/10.1093/protein/12.1.3.\n\n\nOmenn, Gilbert S, Lydie Lane, Christopher M Overall, Young-Ki Paik, Ileana M Cristea, Fernando J Corrales, Cecilia Lindskog, et al. 2021. “Progress Identifying and Analyzing the Human Proteome: 2021 Metrics from the HUPO Human Proteome Project.” J. Proteome Res. 20 (12): 5227–40. https://doi.org/10.1021/acs.jproteome.1c00590.\n\n\nPerez-Riverol, Yasset, Jingwen Bai, Chakradhar Bandla, David Garcı́a-Seisdedos, Suresh Hewapathirana, Selvakumar Kamatchinathan, Deepti J Kundu, et al. 2022. “The PRIDE Database Resources in 2022: A Hub for Mass Spectrometry-Based Proteomics Evidences.” Nucleic Acids Res. 50 (D1): D543–52. https://doi.org/10.1093/nar/gkab1038.\n\n\nRehfeldt, Tobias Greisager, Konrad Krawczyk, Mathias Bøgebjerg, Veit Schwämmle, and Richard Röttger. 2021. “MS2AI: Automated Repurposing of Public Peptide LC-MS Data for Machine Learning Applications.” Bioinformatics, October. https://doi.org/10.1021/acs.analchem.9b01262.\n\n\nShvartsburg, Alexandre A, and Richard D Smith. 2008. “Fundamentals of Traveling Wave Ion Mobility Spectrometry.” Anal. Chem. 80 (24): 9689–99. https://doi.org/10.1021/ac8016295.\n\n\nTyanova, Stefka, Tikira Temu, and Juergen Cox. 2016. “The MaxQuant Computational Platform for Mass Spectrometry-Based Shotgun Proteomics.” Nature Protocols 11 (12): 2301–19. https://doi.org/10.1038/nprot.2016.136.\n\n\nVan Puyvelde, Bart, Simon Daled, Sander Willems, Ralf Gabriels, Anne Gonzalez de Peredo, Karima Chaoui, Emmanuelle Mouton-Barbosa, et al. 2022. “A Comprehensive LFQ Benchmark Dataset on Modern Day Acquisition Strategies in Proteomics.” Sci Data 9 (1): 126. https://doi.org/10.1038/s41597-022-01216-6.\n\n\nWen, Bo, Wen-Feng Zeng, Yuxing Liao, Zhiao Shi, Sara R Savage, Wen Jiang, and Bing Zhang. 2020. “Deep Learning in Proteomics.” Proteomics 20 (21-22). https://doi.org/10.1002/pmic.201900335.\n\n\nWijk, Klaas J van, Tami Leppert, Qi Sun, Sascha S Boguraev, Zhi Sun, Luis Mendoza, and Eric W Deutsch. 2021. “The Arabidopsis PeptideAtlas: Harnessing Worldwide Proteomics Data to Create a Comprehensive Community Proteomics Resource.” Plant Cell 33 (11): 3421–53. https://doi.org/10.1093/plcell/koab211.\n\n\nZhou, Zhiwei, Xin Xiong, and Zheng-Jiang Zhu. 2017. “MetCCS Predictor: A Web Server for Predicting Collision Cross-Section Values of Metabolites in Ion Mobility-Mass Spectrometry Based Metabolomics.” Bioinformatics 33 (14): 2235–37. https://doi.org/10.1093/bioinformatics/btx140.\n\n\nZolg, Daniel P, Mathias Wilhelm, Karsten Schnatbaum, Johannes Zerweck, Tobias Knaute, Bernard Delanghe, Derek J Bailey, et al. 2017. “Building ProteomeTools Based on a Complete Synthetic Human Proteome.” Nat. Methods 14 (3): 259–62. https://doi.org/10.1038/nmeth.4153."
  },
  {
    "objectID": "code-of-conduct.html",
    "href": "code-of-conduct.html",
    "title": "Contributor Covenant Code of Conduct",
    "section": "",
    "text": "We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.\nWe pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.\n\n\n\nExamples of behavior that contributes to a positive environment for our community include:\n\nDemonstrating empathy and kindness toward other people\nBeing respectful of differing opinions, viewpoints, and experiences\nGiving and gracefully accepting constructive feedback\nAccepting responsibility and apologizing to those affected by our mistakes, and learning from the experience\nFocusing on what is best not just for us as individuals, but for the overall community\n\nExamples of unacceptable behavior include:\n\nThe use of sexualized language or imagery, and sexual attention or advances of any kind\nTrolling, insulting or derogatory comments, and personal or political attacks\nPublic or private harassment\nPublishing others’ private information, such as a physical or email address, without their explicit permission\nOther conduct which could reasonably be considered inappropriate in a professional setting\n\n\n\n\nCommunity leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.\nCommunity leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.\n\n\n\nThis Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.\n\n\n\nInstances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at eric.deutsch@isbscience.org. All complaints will be reviewed and investigated promptly and fairly.\nAll community leaders are obligated to respect the privacy and security of the reporter of any incident.\n\n\n\nCommunity leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:\n\n\nCommunity Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.\nConsequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.\n\n\n\nCommunity Impact: A violation through a single incident or series of actions.\nConsequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.\n\n\n\nCommunity Impact: A serious violation of community standards, including sustained inappropriate behavior.\nConsequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.\n\n\n\nCommunity Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.\nConsequence: A permanent ban from any sort of public interaction within the community.\n\n\n\n\nThis Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.\nCommunity Impact Guidelines were inspired by Mozilla’s code of conduct enforcement ladder.\nFor answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations."
  },
  {
    "objectID": "code-of-conduct.html#our-pledge",
    "href": "code-of-conduct.html#our-pledge",
    "title": "Contributor Covenant Code of Conduct",
    "section": "",
    "text": "We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.\nWe pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community."
  },
  {
    "objectID": "code-of-conduct.html#our-standards",
    "href": "code-of-conduct.html#our-standards",
    "title": "Contributor Covenant Code of Conduct",
    "section": "",
    "text": "Examples of behavior that contributes to a positive environment for our community include:\n\nDemonstrating empathy and kindness toward other people\nBeing respectful of differing opinions, viewpoints, and experiences\nGiving and gracefully accepting constructive feedback\nAccepting responsibility and apologizing to those affected by our mistakes, and learning from the experience\nFocusing on what is best not just for us as individuals, but for the overall community\n\nExamples of unacceptable behavior include:\n\nThe use of sexualized language or imagery, and sexual attention or advances of any kind\nTrolling, insulting or derogatory comments, and personal or political attacks\nPublic or private harassment\nPublishing others’ private information, such as a physical or email address, without their explicit permission\nOther conduct which could reasonably be considered inappropriate in a professional setting"
  },
  {
    "objectID": "code-of-conduct.html#enforcement-responsibilities",
    "href": "code-of-conduct.html#enforcement-responsibilities",
    "title": "Contributor Covenant Code of Conduct",
    "section": "",
    "text": "Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.\nCommunity leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate."
  },
  {
    "objectID": "code-of-conduct.html#scope",
    "href": "code-of-conduct.html#scope",
    "title": "Contributor Covenant Code of Conduct",
    "section": "",
    "text": "This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event."
  },
  {
    "objectID": "code-of-conduct.html#enforcement",
    "href": "code-of-conduct.html#enforcement",
    "title": "Contributor Covenant Code of Conduct",
    "section": "",
    "text": "Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at eric.deutsch@isbscience.org. All complaints will be reviewed and investigated promptly and fairly.\nAll community leaders are obligated to respect the privacy and security of the reporter of any incident."
  },
  {
    "objectID": "code-of-conduct.html#enforcement-guidelines",
    "href": "code-of-conduct.html#enforcement-guidelines",
    "title": "Contributor Covenant Code of Conduct",
    "section": "",
    "text": "Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:\n\n\nCommunity Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.\nConsequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.\n\n\n\nCommunity Impact: A violation through a single incident or series of actions.\nConsequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.\n\n\n\nCommunity Impact: A serious violation of community standards, including sustained inappropriate behavior.\nConsequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.\n\n\n\nCommunity Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.\nConsequence: A permanent ban from any sort of public interaction within the community."
  },
  {
    "objectID": "code-of-conduct.html#attribution",
    "href": "code-of-conduct.html#attribution",
    "title": "Contributor Covenant Code of Conduct",
    "section": "",
    "text": "This Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.\nCommunity Impact Guidelines were inspired by Mozilla’s code of conduct enforcement ladder.\nFor answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations."
  },
  {
    "objectID": "contributing.html",
    "href": "contributing.html",
    "title": "Contributing",
    "section": "",
    "text": "This document describes how to contribute to the ProteomicsML resource by adding new or updating existing tutorials and/or datasets.\n\n\nAt ProteomicsML, we pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community. By interacting with or contributing to ProteomicsML at https://github.com/ProteomicsML or at https://proteomicsml.org, you agree to our Code of Conduct. Violation of our Code of Conduct may ultimately lead to a permanent ban from any sort of public interaction within the community. 🤝 Read the Code of Conduct\nIf you have an idea for a new tutorial or dataset, or found a mistake, you are welcome to communicate it with the community by opening a discussion thread in GitHub Discussions or by creating an issue. 💬 Start a discussion thread 💡 Open an issue\n\n\n\nProteomicsML uses the Quarto system to publish a static website from markdown and Jupyter IPython notebook files. All source files are maintained at ProteomicsML/ProteomicsML. Upon each commit on the main branch (after merging a pull request), the website is rebuilt on GitHub Actions and pushed to the ProteomicsML/proteomicsml.github.io repository, where it is hosted with GitHub Pages on the ProteomicsML.org website. See Website deployment for the full deployment workflow.\n\n\n\n\n\n\nFork ProteomicsML/ProteomicsML on GitHub to make your changes.\nClone your fork of the repository to your local machine.\nInstall Quarto to build the website on your machine.\nTo preview the website while editing, run: quarto preview . --render html\n\nMaintainers with write access to the repository can skip the first two steps and make a new local branch instead. Direct commits to the main branch are not allowed.\n\n\n\nProteomicsML tutorials are educational Jupyter notebooks that combine fully functional code cells and descriptive text cells. The end result should be a notebook that is easy to comprehend to anyone with a basic understanding of proteomics, programming, and machine learning. When adding or updating a tutorial, please follow these rules and conventions:\n\nTitle, filename, metadata, and subheadings\n\nTutorials are grouped by data type: Detectability, Fragmentation, Ion mobility, and Retention time. Place your tutorial notebook in the appropriate directory in the repository. E.g., tutorials/fragmentation. If your tutorial is part of a new data type group, please open a new discussion thread first.\nThe filename should be an abbreviated version of the tutorial title, formatted in kebab case (lowercase with - replacing spaces), for instance title-of-tutorial.ipynb.\nThe following front matter metadata items are required (see the Quarto Documentation for more info):\n\ntitle: A descriptive sentence-like title\nauthors: All authors that significantly contributed to the tutorial\ndate: Use last-modified to automatically render the correct date\n\n\n\n\n\n\n\nNote\n\n\n\nUnfortunately, YAML front matter is not rendered by Google Colab. Instead it is interpreted as plain markdown and the first cell of the notebook might look out of place when loaded into Google Colab. Nevertheless, the front matter results in a clean header on ProteomicsML.org, the primary platform for viewing tutorials.\n\n\nQuarto will render the title automatically from the metadata. Therefore, only subheadings should be included as markdown, starting at the second heading level (##).\nAdd an Open with Colab badge directly after the front matter metadata. The badge should be hyperlinked to open the notebook in Colab directly from GitHub. This can be achieved by replacing https://github.com/ with https://colab.research.google.com/github/ in the full URL to the file on GitHub. Additionally, in this URL the filename should be prefixed with an underscore (_); see point 2 in Website deployment for more info on notebook copies for Colab.\nFor example:\n[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ProteomicsML/ProteomicsML/blob/main/tutorials/fragmentation/_nist-1-parsing-spectral-library.ipynb)\nrenders as\n\n\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nThe URL will not work (or be updated) until the pull request adding or updating the notebook is merged into the main branch.\n\n\n\nSubject and contents\n\nEach tutorial should clearly show and describe one or more steps in a certain machine learning workflow for proteomics.\nSufficiently describe each code cell and each step in the workflow.\nTutorials should ideally be linked to a single ProteomicsML dataset from the same group.\nWhile multiple tutorials can be added for a single data type, make sure that each tutorial is sufficiently different from the others in terms of methodology and/or datasets used.\nAll original publications that describe the methodologies, datasets, or tools that are used in the tutorial should be properly cited following scientific authoring conventions. To add a citation, add a bibtex entry to references.bib and use the Quarto citation tag. For example, [@ProteomicsML2022] renders to: (Rehfeldt et al. 2022). More info can be found in the Quarto documentation.\n\n\n\n\n\n\nTip\n\n\n\nUse doi2bib.org to easily get bibtex entries for any given publication.\n\n\n\nCode cells and programming language\n\nTutorials should work on all major platforms (Linux, Windows, macOS). An exception to this rule can be made if one or more tools central to the tutorial is not cross-platform.\nPer ProteomicsML convention, tutorials should use the Python programming language. Exceptions may be allowed if the other language is essential to the tutorial or methodology.\nProteomicsML recommends Google Colab to interactively use tutorial notebooks. Therefore, all code should be backwards compatible with the Python version used by Google Colab. At time of writing, this is Python 3.7.\nDependencies should ideally be installable with pip. A first code cell can be used to install all requirements using the Jupyter shell prefix !. For instance: ! pip install pandas.\nCode should be easy to read. For Python, follow the PEP8 style guide where possible.\nUpon pull request (PR) creation, all expected output cells should be present. When rendering the ProteomicsML website, notebooks are not rerun. Therefore, as a final step before submitting your PR, restart the kernel, run all cells from start to finish, and save the notebook. See point 2 in Website deployment for more info on notebook copies for Colab.\n\n\n\n\n\nProteomicsML datasets are community-curated proteomics datasets fit for machine learning. Ideally, each dataset is accompanied by a tutorial. When adding or updating a dataset, please follow these rules and conventions:\n\nDataset description and data files:\n\nEach dataset is represented as a single markdown file describing the dataset.\nThe data itself can be added in one of three ways:\n\nIf the dataset itself consists of one or more files, each smaller than 50 MB, they can be added in a subfolder with the same name as the markdown file. These files should be individually gzipped to save space and to prevent line-by-line tracking by Git.\n\n\n\n\n\n\nNote\n\n\n\nGzipped CSV files can very easily be read by Pandas into a DataFrame. Simply use the filename with the .gz suffix in the pandas.read_csv() function and Pandas will automatically unzip the file while reading.\n\n\nLarger files can be added to the ProteomicsML FTP file server by the project maintainers. Please request this in your pull request.\nFiles that are already publicly and persistently stored elsewhere, can be represented by solely the markdown file. In this case, all tutorials using this dataset should start from the file(s) as is and include any required preprocessing steps.[TODO: List supported platforms]\n\n\nTitle, filename, and metadata:\n\nDatasets are grouped by data type: Fragmentation, Ion mobility, Detectability, or Retention time. Place your dataset and markdown description in the appropriate directory in the repository. E.g., tutorials/fragmentation. If your dataset is part of a new data type group, please open a new discussion thread first.\nThe filename / directory name should be an abbreviated version of the dataset title, formatted in kebab case (lowercase with - replacing spaces), for instance title-of-dataset.md / title-of-dataset/.\nThe following front matter metadata items are required (see the Quarto Documentation for more info):\n\ntitle: A descriptive sentence-like title\ndate: Use last-modified to automatically render the correct date\n\nQuarto will render the title automatically from the metadata. Therefore, only subheadings should be included as markdown, starting at the second heading level (##).\n\nDataset description\nDownload the readme template, fill out the details, and add download links\n\nRetention time\nFragmentation Intensity\nIon Mobility\nDetectability\n\n\n\n\n\n\nCommit and push your changes to your fork.\nOpen a pull request with these changes. Choose the pull request template that fits your changes best.\nThe pull request should pass all the continuous integration tests which are automatically run by GitHub Actions.\nAll pull requests should be approved by at least two maintainers before they can be merged.\n\n\n\n\n\nIf you would like to become a maintainer and review pull requests by others, please start a discussion thread to let us know!\n\n\n\nWhen a pull request has been opened, the following GitHub Action is triggered: Test website rendering: The full website is rendered to check that no errors occur. This action should already have been run successfully for the pull request that implemented the changes. Nevertheless, merging could also introduce new issues.\nWhen a pull request has been marked “ready for review”, the following GitHub Action is triggered: Update notebook copies: A script is run to make copies of all tutorial notebooks with all output removed. The filenames of these copies are prepended with an underscore and should be used to open the notebooks interactively, e.g., in Google Colab. This script adds a commit to the pull request branch, which can only be accepted once this action has run successfully.\nWhen a pull request is merged with the main branch, the following GitHub Action is triggered: Publish website: Quarto is used to render the static website, which is then force-pushed to the ProteomicsML/proteomicsml.github.io repository. This repository is served on proteomicsml.org through GitHub Pages."
  },
  {
    "objectID": "contributing.html#before-you-begin",
    "href": "contributing.html#before-you-begin",
    "title": "Contributing",
    "section": "",
    "text": "At ProteomicsML, we pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community. By interacting with or contributing to ProteomicsML at https://github.com/ProteomicsML or at https://proteomicsml.org, you agree to our Code of Conduct. Violation of our Code of Conduct may ultimately lead to a permanent ban from any sort of public interaction within the community. 🤝 Read the Code of Conduct\nIf you have an idea for a new tutorial or dataset, or found a mistake, you are welcome to communicate it with the community by opening a discussion thread in GitHub Discussions or by creating an issue. 💬 Start a discussion thread 💡 Open an issue"
  },
  {
    "objectID": "contributing.html#the-proteomicsml-infrastructure",
    "href": "contributing.html#the-proteomicsml-infrastructure",
    "title": "Contributing",
    "section": "",
    "text": "ProteomicsML uses the Quarto system to publish a static website from markdown and Jupyter IPython notebook files. All source files are maintained at ProteomicsML/ProteomicsML. Upon each commit on the main branch (after merging a pull request), the website is rebuilt on GitHub Actions and pushed to the ProteomicsML/proteomicsml.github.io repository, where it is hosted with GitHub Pages on the ProteomicsML.org website. See Website deployment for the full deployment workflow."
  },
  {
    "objectID": "contributing.html#how-to-contribute",
    "href": "contributing.html#how-to-contribute",
    "title": "Contributing",
    "section": "",
    "text": "Fork ProteomicsML/ProteomicsML on GitHub to make your changes.\nClone your fork of the repository to your local machine.\nInstall Quarto to build the website on your machine.\nTo preview the website while editing, run: quarto preview . --render html\n\nMaintainers with write access to the repository can skip the first two steps and make a new local branch instead. Direct commits to the main branch are not allowed.\n\n\n\nProteomicsML tutorials are educational Jupyter notebooks that combine fully functional code cells and descriptive text cells. The end result should be a notebook that is easy to comprehend to anyone with a basic understanding of proteomics, programming, and machine learning. When adding or updating a tutorial, please follow these rules and conventions:\n\nTitle, filename, metadata, and subheadings\n\nTutorials are grouped by data type: Detectability, Fragmentation, Ion mobility, and Retention time. Place your tutorial notebook in the appropriate directory in the repository. E.g., tutorials/fragmentation. If your tutorial is part of a new data type group, please open a new discussion thread first.\nThe filename should be an abbreviated version of the tutorial title, formatted in kebab case (lowercase with - replacing spaces), for instance title-of-tutorial.ipynb.\nThe following front matter metadata items are required (see the Quarto Documentation for more info):\n\ntitle: A descriptive sentence-like title\nauthors: All authors that significantly contributed to the tutorial\ndate: Use last-modified to automatically render the correct date\n\n\n\n\n\n\n\nNote\n\n\n\nUnfortunately, YAML front matter is not rendered by Google Colab. Instead it is interpreted as plain markdown and the first cell of the notebook might look out of place when loaded into Google Colab. Nevertheless, the front matter results in a clean header on ProteomicsML.org, the primary platform for viewing tutorials.\n\n\nQuarto will render the title automatically from the metadata. Therefore, only subheadings should be included as markdown, starting at the second heading level (##).\nAdd an Open with Colab badge directly after the front matter metadata. The badge should be hyperlinked to open the notebook in Colab directly from GitHub. This can be achieved by replacing https://github.com/ with https://colab.research.google.com/github/ in the full URL to the file on GitHub. Additionally, in this URL the filename should be prefixed with an underscore (_); see point 2 in Website deployment for more info on notebook copies for Colab.\nFor example:\n[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ProteomicsML/ProteomicsML/blob/main/tutorials/fragmentation/_nist-1-parsing-spectral-library.ipynb)\nrenders as\n\n\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nThe URL will not work (or be updated) until the pull request adding or updating the notebook is merged into the main branch.\n\n\n\nSubject and contents\n\nEach tutorial should clearly show and describe one or more steps in a certain machine learning workflow for proteomics.\nSufficiently describe each code cell and each step in the workflow.\nTutorials should ideally be linked to a single ProteomicsML dataset from the same group.\nWhile multiple tutorials can be added for a single data type, make sure that each tutorial is sufficiently different from the others in terms of methodology and/or datasets used.\nAll original publications that describe the methodologies, datasets, or tools that are used in the tutorial should be properly cited following scientific authoring conventions. To add a citation, add a bibtex entry to references.bib and use the Quarto citation tag. For example, [@ProteomicsML2022] renders to: (Rehfeldt et al. 2022). More info can be found in the Quarto documentation.\n\n\n\n\n\n\nTip\n\n\n\nUse doi2bib.org to easily get bibtex entries for any given publication.\n\n\n\nCode cells and programming language\n\nTutorials should work on all major platforms (Linux, Windows, macOS). An exception to this rule can be made if one or more tools central to the tutorial is not cross-platform.\nPer ProteomicsML convention, tutorials should use the Python programming language. Exceptions may be allowed if the other language is essential to the tutorial or methodology.\nProteomicsML recommends Google Colab to interactively use tutorial notebooks. Therefore, all code should be backwards compatible with the Python version used by Google Colab. At time of writing, this is Python 3.7.\nDependencies should ideally be installable with pip. A first code cell can be used to install all requirements using the Jupyter shell prefix !. For instance: ! pip install pandas.\nCode should be easy to read. For Python, follow the PEP8 style guide where possible.\nUpon pull request (PR) creation, all expected output cells should be present. When rendering the ProteomicsML website, notebooks are not rerun. Therefore, as a final step before submitting your PR, restart the kernel, run all cells from start to finish, and save the notebook. See point 2 in Website deployment for more info on notebook copies for Colab.\n\n\n\n\n\nProteomicsML datasets are community-curated proteomics datasets fit for machine learning. Ideally, each dataset is accompanied by a tutorial. When adding or updating a dataset, please follow these rules and conventions:\n\nDataset description and data files:\n\nEach dataset is represented as a single markdown file describing the dataset.\nThe data itself can be added in one of three ways:\n\nIf the dataset itself consists of one or more files, each smaller than 50 MB, they can be added in a subfolder with the same name as the markdown file. These files should be individually gzipped to save space and to prevent line-by-line tracking by Git.\n\n\n\n\n\n\nNote\n\n\n\nGzipped CSV files can very easily be read by Pandas into a DataFrame. Simply use the filename with the .gz suffix in the pandas.read_csv() function and Pandas will automatically unzip the file while reading.\n\n\nLarger files can be added to the ProteomicsML FTP file server by the project maintainers. Please request this in your pull request.\nFiles that are already publicly and persistently stored elsewhere, can be represented by solely the markdown file. In this case, all tutorials using this dataset should start from the file(s) as is and include any required preprocessing steps.[TODO: List supported platforms]\n\n\nTitle, filename, and metadata:\n\nDatasets are grouped by data type: Fragmentation, Ion mobility, Detectability, or Retention time. Place your dataset and markdown description in the appropriate directory in the repository. E.g., tutorials/fragmentation. If your dataset is part of a new data type group, please open a new discussion thread first.\nThe filename / directory name should be an abbreviated version of the dataset title, formatted in kebab case (lowercase with - replacing spaces), for instance title-of-dataset.md / title-of-dataset/.\nThe following front matter metadata items are required (see the Quarto Documentation for more info):\n\ntitle: A descriptive sentence-like title\ndate: Use last-modified to automatically render the correct date\n\nQuarto will render the title automatically from the metadata. Therefore, only subheadings should be included as markdown, starting at the second heading level (##).\n\nDataset description\nDownload the readme template, fill out the details, and add download links\n\nRetention time\nFragmentation Intensity\nIon Mobility\nDetectability\n\n\n\n\n\n\nCommit and push your changes to your fork.\nOpen a pull request with these changes. Choose the pull request template that fits your changes best.\nThe pull request should pass all the continuous integration tests which are automatically run by GitHub Actions.\nAll pull requests should be approved by at least two maintainers before they can be merged."
  },
  {
    "objectID": "contributing.html#becoming-a-maintainer",
    "href": "contributing.html#becoming-a-maintainer",
    "title": "Contributing",
    "section": "",
    "text": "If you would like to become a maintainer and review pull requests by others, please start a discussion thread to let us know!"
  },
  {
    "objectID": "contributing.html#website-deployment",
    "href": "contributing.html#website-deployment",
    "title": "Contributing",
    "section": "",
    "text": "When a pull request has been opened, the following GitHub Action is triggered: Test website rendering: The full website is rendered to check that no errors occur. This action should already have been run successfully for the pull request that implemented the changes. Nevertheless, merging could also introduce new issues.\nWhen a pull request has been marked “ready for review”, the following GitHub Action is triggered: Update notebook copies: A script is run to make copies of all tutorial notebooks with all output removed. The filenames of these copies are prepended with an underscore and should be used to open the notebooks interactively, e.g., in Google Colab. This script adds a commit to the pull request branch, which can only be accepted once this action has run successfully.\nWhen a pull request is merged with the main branch, the following GitHub Action is triggered: Publish website: Quarto is used to render the static website, which is then force-pushed to the ProteomicsML/proteomicsml.github.io repository. This repository is served on proteomicsml.org through GitHub Pages."
  }
]