Merge pull request #61 from zprobot/master

update: readme
bigbio · Jun 2, 2024 · 741de8f · 741de8f
2 parents 9d763eb + 4555a8f
commit 741de8f
Show file tree

Hide file tree

Showing 3 changed files with 65 additions and 83 deletions.
diff --git a/README.md b/README.md
@@ -18,11 +18,9 @@ As mentioned before, ibaq values are calculated by dividing the total precursor
 
 - _Total precursor intensities_, the total intensity of a protein is calculated by summing the intensity of all peptides that belong to the protein. The intensity values are obtained from the feature parquet file in [quantms.io](https://github.com/bigbio/quantms.io).
 
-## TODO @PingZheng: Can you double-check that for protein group we multiply the intensity or divided? 
+> Note: If protein-group exists in the peptide intensity dataframe, the intensity of all proteins in the protein-group is summed based on the above steps, and then divided by the number of proteins in the protein-group.
 
-> Note: If protein-group exists in the peptide intensity dataframe, the intensity of all proteins in the protein-group is summed based on the above steps, and then multiplied by the number of proteins in the protein-group.
-
-#### Other normalized IBAQ values  
+### Other normalized IBAQ values  
 
 - `IbaqNorm` - normalize the ibaq values using the total ibaq of the sample `ibaq / sum(ibaq)`, the sum is applied for proteins in the same _sample + condition_.
 
@@ -52,49 +50,34 @@ In summary, each feature is the unique combination of a peptide sequence includi
 
 #### Data preprocessing
 
-## TODO @PingZheng: Can you confirm that the following steps are correct?
-
-In this section, ibaqpy will do: 
+In this section`features2peptides`, ibaqpy will do: 
+- Parse the identifier of proteins and retain only unique peptides.
 - Remove lines where intensity or study condition is empty: This could happen in the following cases: 
   - The DIA pipeline sometimes for some features releases intensities with value 0.
   - The quantms.io do not contain feature information for some conditions. This extreme case could happen when not ID/Quant was found for a given condition during the analysis.
-- Parse the identifier of proteins and retain only unique peptides. 
+- Filter peptides with less amino acids than min_aa.
 - Low-confidence proteins were removed according to the threshold of unique peptides: We use a thershold of 2 unique peptides to consider a protein as a high-confidence protein. This parameter is applied if not specified by the user, and the default value is 2. If users want to change this threshold, they can use the `--min_unique` parameter.
 - Filter decoy, contaminants, entrapment: Proteins with the following prefix are removed by default: `DECOY, CONTAMINANT, ENTRAPMENT` could be removed, by default, the filter is not applied. If users want to remove these proteins, they can use the `--remove_decoy_contaminants` parameter.
 - Filter user-specified proteins: The user can provide a list of protein identifiers to remove from the analysis using the `--remove_ids` parameter. The remove ids parameters will remove proteins from the analysis that could be potential to influence the intensity normalization. For example, ALBU_HUMAN could be over expressed in human tissues, and that is why we may want to remove it when analyzing tissue data.
+- Normalize at feature level between ms runs (technical repetitions):
+  - When `MS runs > 1` in the sample, the `mean` of all average(`mean`, `median` or `iqr`) in each MS run is calculated(SampleMean)
+  - The ratio between SampleMean and the average MS run is used as a reference to scale the original intensity
+- Merge peptidoforms across fractions and technical repetitions: Combine technical replicates and fragments from the same sample.
+- Normalize the data at the sample level:
+  - `globalMedian`: A global median that adjusts the median of all samples.
+  - `conditionMedian`: All samples under the same conditions were adjusted to the median value under the current conditions.
 - Remove peptides with low frequency if `sample number > 1`: This parameter is applied always unless the user specifies the `--remove_low_frequency_peptides` parameter. The default value is 20% of the samples. If users want to change this threshold, they can use the `--remove_low_frequency_peptides` parameter.
+- Assembly peptidoforms to peptides:
+A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + BioReplicate + Fraction` (among other features), and a peptide is a combination of a `PeptideSequence(Canonical) + BioReplicate`. ibaqpy will do:
+  - Select peptidoforms with the highest intensity across different modifications, fractions, and technical replicates
+  - Merge peptidoforms across different charges and combined into peptides. In order to merge peptidoforms, the package will applied the `sum` of the intensity values of the peptidoforms.
 - Intensity transformation to log: The user can specify the `--log2` parameter to transform the peptide intensity values to log2 before normalization.
 
 > Note: At the moment, ibaqpy computes the ibaq values only based on unique peptides. Shared peptides are discarded. However, if a group of proteins share the same unique peptides (e.g., Pep1 -> Prot1;Prot2 and Pep2 -> Prot1;Prot2), the intensity of the proteins is summed and divided by the number of proteins in the group.
 
-#### Feature normalization
-
-## TODO @PingZheng: Can you confirm that the following steps are correct?
-
-In this section, ibaqpy corrects the intensity of each feature MS runs in the sample, eliminating the effect between runs (including technique repetitions). It will do:
-
-- When `MS runs > 1` in the sample, the `mean` of all average(`mean`, `median` or `iqr`) in each MS run is calculated(SampleMean)
-- The ratio between SampleMean and the average MS run is used as a reference to scale the original intensity
-
-#### Features to peptides
-
-A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + BioReplicate + Fraction` (among other features), and a peptide is a combination of a `PeptideSequence(Canonical) + BioReplicate`. ibaqpy will do:
-
-- Select peptidoforms with the highest intensity across different modifications, fractions, and technical replicates
-- Merge peptidoforms across different charges and combined into peptides. In order to merge peptidoforms, the package will applied the `sum` of the intensity values of the peptidoforms.
-
-#### Peptide Intensity Normalization
-
-Finally, the intensity of the peptide was normalized globally by `globalMedian`,`conditionMedian`. In the sample, the protein was combined with its unique peptides. 
-
-#### Computing Ibaq and TPA 
-
-The iBAQ value was calculated as the Intensity of the protein divided by the theoretical number of unique peptides cut by the enzyme. This command also normalized iBAQ as riBAQ. See details in `peptides2proteins`.
-
-Compute TPA values, protein copy numbers and concentration from the output file from script `commands/features2peptides`. 
-
-**Merge projects**: Merge ibaq results from multiple datasets. It consists of three steps: missing value imputation, outlier removal, and batch effect removal. 
-
+#### Calculate the IBAQ Value
+First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.See details in `peptides2proteins`.
+
 > Note: In all scripts and result files, *uniprot accession* is used as the protein identifier.
 
 ### How to install ibaqpy
@@ -144,7 +127,7 @@ E.g. http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute
 #### Features to peptides
 
 ```asciidoc
-ibaqpy features2peptides -p tests/PXD003947/PXD003947-featrue.parquet -s tests/PXD003947/PXD003947.sdrf.tsv --remove_ids data/contaminants_ids.tsv --remove_decoy_contaminants --remove_low_frequency_peptides --output tests/PXD003947/PXD003947-peptides-norm.csv --log2 --save_parquet
+ibaqpy features2peptides -p tests/PXD003947/PXD003947-featrue.parquet -s tests/PXD003947/PXD003947.sdrf.tsv --remove_ids data/contaminants_ids.tsv --remove_decoy_contaminants --remove_low_frequency_peptides --output tests/PXD003947/PXD003947-peptides-norm.csv
 ```
 ```asciidoc
 Usage: features2peptides.py [OPTIONS]
@@ -167,11 +150,11 @@ Options:
                                   properties for normalization
   --skip_normalization            Skip normalization step
   --nmethod TEXT                  Normalization method used to normalize
-                                  feature intensities for all samples
+                                  feature intensities for tec
                                   (options: mean, median, iqr, none)
   --pnmethod TEXT                 Normalization method used to normalize
                                   peptides intensities for all samples
-                                  (options: globalMedian,conditionMedian)
+                                  (options: globalMedian,conditionMedian,none)
   --log2                          Transform to log2 the peptide intensity
                                   values before normalization
   --save_parquet                  Save normalized peptides to parquet

diff --git a/benchmarks/images/missing_peptides_by_sample.png b/benchmarks/images/missing_peptides_by_sample.png
diff --git a/ibaqpy/ibaq/peptide_normalization.py b/ibaqpy/ibaq/peptide_normalization.py
@@ -553,56 +553,28 @@ def peptide_normalization(
         med_map = F.get_median_map_to_condition()
     for samples, df in F.iter_samples():
         for sample in samples:
-            ## TODO: Perform data preprocessing on every sample
+            # Perform data preprocessing on every sample
             print(f"{str(sample).upper()}: Data preprocessing...")
             dataset_df = df[df["sample_accession"] == sample].copy()
+            # Step1: Parse the identifier of proteins and retain only unique peptides.
             dataset_df = dataset_df[dataset_df["unique"] == 1]
-            if not skip_normalization:
-                if pnmethod == "globalMedian":
-                    dataset_df.loc[:, "intensity"] = (
-                            dataset_df["intensity"] / med_map[sample]
-                    )
-                elif pnmethod == "conditionMedian":
-                    con = dataset_df["condition"].unique()[0]
-                    dataset_df.loc[:, "intensity"] = (
-                            dataset_df["intensity"] / med_map[con][sample]
-                    )
             dataset_df = dataset_df[PARQUET_COLUMNS]
             dataset_df = parquet_common_process(dataset_df, label, choice)
+            # Step2: Remove lines where intensity or study condition is empty.
+            # Step3: Filter peptides with less amino acids than min_aa.
             dataset_df = data_common_process(dataset_df, min_aa)
-            # Only proteins with unique peptides number greater than min_unique (default: 2) are retained
+            # Step4: Delete low-confidence proteins.
             dataset_df = dataset_df.groupby(PROTEIN_NAME).filter(
                 lambda x: len(set(x[PEPTIDE_CANONICAL])) >= min_unique
             )
-            dataset_df.rename(columns={INTENSITY: NORM_INTENSITY}, inplace=True)
-            ## TODO: @PingZheng I think the removal of these features must happen before the normalization step.
-            ## TODO: The entire point to catch any outliers and remove them before normalization.
-
-            # Remove high abundant entrapment's, contaminants, proteins and the outliers
-            if remove_ids is not None:
-                dataset_df = remove_protein_by_ids(dataset_df, remove_ids)
+            # Step5: Filter decoy, contaminants, entrapment
             if remove_decoy_contaminants:
                 dataset_df = remove_contaminants_entrapments_decoys(dataset_df)
-
-            if remove_low_frequency_peptides and len(sample_names) > 1:
-                dataset_df.set_index(
-                    [PROTEIN_NAME, PEPTIDE_CANONICAL], drop=True, inplace=True
-                )
-                dataset_df = dataset_df[
-                    ~dataset_df.index.isin(low_frequency_peptides)
-                ].reset_index()
-                print(
-                    f"{str(sample).upper()}: Peptides after remove low frequency peptides: {len(dataset_df.index)}"
-                )
-
-            if len(dataset_df[FRACTION].unique().tolist()) > 1:
-                print(f"{str(sample).upper()}: Merge features across fractions.. ")
-                dataset_df = merge_fractions(dataset_df)
-                print(
-                    f"{str(sample).upper()}: Number of features after merging fractions: {len(dataset_df.index)}"
-                )
-
-            # TODO: Normalize at feature level between ms runs (technical repetitions)
+            # Step6: Filter user-specified proteins
+            if remove_ids is not None:
+                dataset_df = remove_protein_by_ids(dataset_df, remove_ids)
+            dataset_df.rename(columns={INTENSITY: NORM_INTENSITY}, inplace=True)
+            # Step7: Normalize at feature level between ms runs (technical repetitions)
             if (
                     not skip_normalization
                     and nmethod != "none"
@@ -613,21 +585,48 @@ def peptide_normalization(
                 print(
                     f"{str(sample).upper()}: Number of features after normalization: {len(dataset_df.index)}"
                 )
-
-            ## TODO: Assembly features to peptides
-            # Merge peptidoforms across fractions and technical repetitions
+            # Step8: Merge peptidoforms across fractions and technical repetitions
             dataset_df = get_peptidoform_normalize_intensities(dataset_df)
             print(
                 f"{str(sample).upper()}: Number of peptides after peptidofrom selection: {len(dataset_df.index)}"
             )
+            if len(dataset_df[FRACTION].unique().tolist()) > 1:
+                print(f"{str(sample).upper()}: Merge features across fractions.. ")
+                dataset_df = merge_fractions(dataset_df)
+                print(
+                    f"{str(sample).upper()}: Number of features after merging fractions: {len(dataset_df.index)}"
+                )
+            # Step9: Normalize the data.
+            if not skip_normalization:
+                if pnmethod == "globalMedian":
+                    dataset_df.loc[:, NORM_INTENSITY] = (
+                            dataset_df[NORM_INTENSITY] / med_map[sample]
+                    )
+                elif pnmethod == "conditionMedian":
+                    con = dataset_df[CONDITION].unique()[0]
+                    dataset_df.loc[:, NORM_INTENSITY] = (
+                            dataset_df[NORM_INTENSITY] / med_map[con][sample]
+                    )
+
+            # Step10: Remove peptides with low frequency.
+            if remove_low_frequency_peptides and len(sample_names) > 1:
+                dataset_df.set_index(
+                    [PROTEIN_NAME, PEPTIDE_CANONICAL], drop=True, inplace=True
+                )
+                dataset_df = dataset_df[
+                    ~dataset_df.index.isin(low_frequency_peptides)
+                ].reset_index()
+                print(
+                    f"{str(sample).upper()}: Peptides after remove low frequency peptides: {len(dataset_df.index)}"
+                )
 
-            # Assembly peptidoforms to peptides
+            # Step11: Assembly peptidoforms to peptides.
             print(f"{str(sample).upper()}: Sum all peptidoforms per Sample...")
             dataset_df = sum_peptidoform_intensities(dataset_df)
             print(
                 f"{str(sample).upper()}: Number of peptides after selection: {len(dataset_df.index)}"
             )
-
+            # Step12: Intensity transformation to log.
             if log2:
                 dataset_df[NORM_INTENSITY] = np.log2(dataset_df[NORM_INTENSITY])