Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/bigbio/ibaqpy into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
ypriverol committed Jun 15, 2024
2 parents 044ac38 + a489b4a commit aa51699
Show file tree
Hide file tree
Showing 9 changed files with 120 additions and 10 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + B
> Note: At the moment, ibaqpy computes the ibaq values only based on unique peptides. Shared peptides are discarded. However, if a group of proteins share the same unique peptides (e.g., Pep1 -> Prot1;Prot2 and Pep2 -> Prot1;Prot2), the intensity of the proteins is summed and divided by the number of proteins in the group.
#### Calculate the IBAQ Value
First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.See details in `peptides2proteins`.
First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the normalized intensity of the protein is divided by the number of theoretical peptides.See details in `peptides2proteins`.

> Note: In all scripts and result files, *uniprot accession* is used as the protein identifier.
Expand Down
119 changes: 114 additions & 5 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,12 @@ In summary, both datasets were searched with three search engines _SAGE_, _COMET

#### Coefficient of Variation (CV)

Coefficient of variation for all samples in both experiments using `quantile`, `median`, `median-cov`. We extracted human proteins common to 11 samples from IBAQ data. The mean of the coefficient of variation of all proteins in 11 samples was then calculated.
Coefficient of variation for all samples in both experiments using `quantile`, `median`, `median-cov`.
- `quantile`: In the data preprocessing, adjust the samples to ensure that the mean and variance of all samples are equal. Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.
- `median`: In the data preprocessing, adjust the samples to ensure that the median of all samples are equal. Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.
- `median-cov`: In the data preprocessing, adjust the samples to ensure that the median of all samples are equal. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the normalized intensity of the protein is divided by the number of theoretical peptides.

We extracted human proteins common to 11 samples from IBAQ data. The mean of the coefficient of variation of all proteins in 11 samples was then calculated.

Compared to the `quantile`, `median` and `median-cov` has a smaller coefficient of variation. `median-cov` has the smallest CV in the lfq experiment.

Expand Down Expand Up @@ -97,6 +102,36 @@ The following boxplot shows the coefficient of variation for the 11 samples for
<img src='images/PXD007683-TMTvsLFQ-boxplot.png' style="flex:1;height:600px;" />
</center>

#### Correlation between Ibaq and MaxQuant(Ibaq)

For `PXD007683-LFQ`, we will normalize the MaxQuant Ibaq values of the proteins by dividing it by the total sum of that sample. Then, we use quantmsio to convert the results report from quantms, and subsequently apply the median-cov method to calculate the iBAQ values in the results report. Finally, we compare the correlation between the log values of it and the log values of IbaqNorm.

<div style="display:flex;justify-content:center">
<img src='images/PXD007683-LFQ-ibaq-vs-maxquant-density.png' style="heigit:600px;" />
</div>
<div style="display:flex;justify-content:center">
<img src='images/PXD007683-LFQ-11samples-ibaq-vs-maxquant-density.png' style="height:1200px;" />
</div>

Next, for the peptide table of MaxQuant, we recalculated the Ibaq values using `ibaqpy`. Then compare the correlation between the log values of it and the log values of IbaqNorm. `Cov` is used to reduce the impact of missing values, adding an extra step compared to directly calculating the iBAQ values, which leads to some differences between its iBAQ values and those calculated directly.

<div style="display:flex;justify-content:center">
<img src='images/PXD007683-LFQ-ibaq-ibaqpy-and-maxquant.png' style="heigit:600px;" />
</div>
<div style="display:flex;justify-content:center">
<img src='images/PXD007683-LFQ-11samples-ibaq-ibaqpy-and-maxquant.png' style="height:1200px;" />
</div>

If we don't use `cov` to normalize proteins, but calculate the iBAQ values directly. For the peptide table of MaxQuant, the iBAQ values we calculated using ibaqpy are very close to those obtained from MaxQuant.

<div style="display:flex;justify-content:center">
<img src='images/PXD007683-LFQ-no_cov.png' style="heigit:600px;" />
</div>
<div style="display:flex;justify-content:center">
<img src='images/PXD007683-LFQ-11samples-no_cov.png' style="height:1200px;" />
</div>


#### LFQ missing values

Number of peptides missing in LFQ experiments.
Expand Down Expand Up @@ -179,8 +214,82 @@ We will normalize the MaxLFQ values of the proteins in the DIANN report by divid
</center>

### Performance testing
The [PXD030304](https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute-expression/PXD030304/) project collected mass spectrometry data from 949 cancer cell lines and reanalyzed it using the DIANN analysis pipeline within the quantms platform.The size of the `diann_report.tsv` file is 167GB, after being converted to a parquet file using quantmsio, the size is 15.8GB.We conducted performance testing in a 128GB memory environment.

| Project | Samples | Size(diann report) | Size(parquet file) | Runn time |
|--------|---------|----------|----------|----------|
| PXD030304 | 2013 | 167G | 15.8G | 2.75h |
We have conducted performance tests on three methods. Since `median` and `median-cov` only differ when calculating ibaq, they are referred to as `median` below. It can be seen that the `median` is based on the sample level. It does not read all data at once like the `quantile`, but reads it in batches (by default, it reads 20 samples at a time), which greatly reduces memory consumption.

<table align="center">
<thead>
<tr>
<th>Project</th>
<th>File size(original)</th>
<th>File size(transform)</th>
<th>Ms runs</th>
<th>Samples</th>
<th>Method</th>
<th>Memory</th>
<th>Run time</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=2>PXD016999.1</td>
<td rowspan=2>5.7 G</td>
<td rowspan=2>292 M</td>
<td rowspan=2>336</td>
<td rowspan=2>280</td>
<td>quantile</td>
<td>36.4 G</td>
<td>14 min</td>
</tr>
<tr>
<td>median</td>
<td>8.4 G</td>
<td>20 min</td>
</tr>
<tr>
<td rowspan=2>PXD019909</td>
<td rowspan=2>1.9 G</td>
<td rowspan=2>171 M</td>
<td rowspan=2>43</td>
<td rowspan=2>43</td>
<td>quantile</td>
<td>7.9 G</td>
<td>30 s</td>
</tr>
<tr>
<td>median</td>
<td>4.0 G</td>
<td>1.4 min</td>
</tr>
<tr>
<td rowspan=2>PXD010154</td>
<td rowspan=2>1.9 G</td>
<td rowspan=2>287 M</td>
<td rowspan=2>1367</td>
<td rowspan=2>38</td>
<td>quantile</td>
<td>32.1 G</td>
<td>8 min</td>
</tr>
<tr>
<td>median</td>
<td>16.2 G</td>
<td>12 min</td>
</tr>
<tr>
<td rowspan=2>PXD030304</td>
<td rowspan=2>167 G</td>
<td rowspan=2>15.8 G</td>
<td rowspan=2>6862</td>
<td rowspan=2>2013</td>
<td>quantile</td>
<td>> 128 G</td>
<td>> 2 days</td>
</tr>
<tr>
<td>median</td>
<td>13.1 G</td>
<td>2.75 h</td>
</tr>
</tbody>
</table>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmarks/images/PXD007683-LFQ-no_cov.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 5 additions & 4 deletions ibaqpy/ibaq/peptide_normalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,9 +194,9 @@ def data_common_process(data_df: pd.DataFrame, min_aa: int) -> pd.DataFrame:
data_df = data_df[data_df["Condition"] != "Empty"]

# Filter peptides with less amino acids than min_aa (default: 7)
data_df = data_df[
data_df.apply(lambda x: len(x[PEPTIDE_CANONICAL]) >= min_aa, axis=1)
]
data_df.loc[:,'len'] = data_df[PEPTIDE_CANONICAL].apply(len)
data_df = data_df[data_df['len']>=min_aa]
data_df.drop(['len'],inplace=True,axis=1)
data_df[PROTEIN_NAME] = data_df[PROTEIN_NAME].apply(parse_uniprot_accession)
if FRACTION not in data_df.columns:
data_df[FRACTION] = 1
Expand Down Expand Up @@ -561,7 +561,8 @@ def peptide_normalization(
technical_repetitions, label, sample_names, choice = analyse_sdrf(sdrf)
else:
technical_repetitions, label, sample_names, choice = feature.experimental_inference
low_frequency_peptides = feature.low_frequency_peptides
if remove_low_frequency_peptides:
low_frequency_peptides = feature.low_frequency_peptides
header = False
if not skip_normalization and pnmethod == "globalMedian":
med_map = feature.get_median_map()
Expand Down

0 comments on commit aa51699

Please sign in to comment.