Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark by='source' does not seem to work correctly #149

Open
IBMB-MFP opened this issue Aug 13, 2024 · 3 comments
Open

Benchmark by='source' does not seem to work correctly #149

IBMB-MFP opened this issue Aug 13, 2024 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation question User question: anything that's not obviously a bug

Comments

@IBMB-MFP
Copy link

Hi, I am trying to please reviewers and provide benchmark statistics for individual TFs.

According to decoupler-py's 'read the docs' page, this should be possible using the argument by='source' in the 'benchmark' function.

When I run benchmark with by='experiment', it runs fine and gives me the expected output matrix of benchmarking statistics.

When I run with by='source', the verbose output is identical:

Extracting inputs...
Formating net...
Removed 1 experiments without sources in net.
Running 9 experiments for 5 unique sources.
Running methods...
29194 features of mat are empty, they will be removed.
Running mlm on mat with 9 samples and 17729 targets for 487 sources.
Calculating metrics...
Computing metrics...
Done.

but I get an empty data frame returned, and I don't understand why.

Here is the code used to run this analysis - the files obs_limit.txt, bench_limit.txt are attached (they are limited subsets of the full benchmarking set for the purpose of the reproducible example), as is the GRN (CelEsT_GRN.txt).

#%%
import pandas as pd
import decoupler as dc
import os

from pathlib import Path

#%%

home_directory = os.path.expanduser("~")

working_directory = os.path.join(home_directory, "Cel_GRN_revisions")

os.chdir(working_directory)

output_directory = Path("output/benchmark_out")

output_directory.mkdir(parents = True, exist_ok = True)
#%%

benchRAPToR = pd.read_csv("bench_limit.txt",
                  sep = '\t',
                  index_col = 0)

#%%
obs1 = pd.read_csv("obs_limit.txt",
                  sep = '\t',
                     index_col=0,
                     encoding='unicode_escape')

#%%
    
mat = pd.read_table(os.path.join(home_directory, "CelEsT_GRN.txt"))

decouple_kws={
    'methods' : ['mlm'],
    'consensus': False,
}
    

#%%

notbysource_successful = dc.benchmark(benchRAPToR, obs1, mat, perturb = 'target_gseq', by = 'experiment', sign = -1, verbose = True, decouple_kws = decouple_kws)  


#%%

bysource_doesntwork = dc.benchmark(benchRAPToR, obs1, mat, by = 'source', perturb = 'target_gseq', sign = -1, verbose = True, decouple_kws = decouple_kws)  

I am using decoupler 1.6.0 and python 3.8.18.

  • Python version 3.8.18
  • decoupler 1.6.0 (having trouble updating)

obs_limit.txt
bench_limit.txt
CelEsT_GRN.txt

@IBMB-MFP IBMB-MFP added the bug Something isn't working label Aug 13, 2024
@PauBadiaM PauBadiaM self-assigned this Aug 14, 2024
@PauBadiaM
Copy link
Member

Hi @IBMB-MFP,

This is caused because of the interaction of running by='source' and min_expr=5. It seems like none of your TFs have more than 5 perturbation experiments, that's why the benchmark does not run. By setting min_expr=3 you get results but keep in mind that this AUC has been built considering a ranking of just 3 elements which I don't know how valid this is. That is the reason why by default we enforce at least 5 experiments.

Another alternative is to use the new recall metric implemented in 1.7.0 which checks how many times the perturbed TF has been correctly predicted to be perturbed (sign consistent score and significant p-value). You can run it by source.

Sorry that the verbose output was confusing, I will update it to make it clearer what is going on under the hood, and good luck with your revisions! Let me know if you need anything else ;)

@PauBadiaM PauBadiaM added documentation Improvements or additions to documentation question User question: anything that's not obviously a bug and removed bug Something isn't working labels Aug 14, 2024
@IBMB-MFP
Copy link
Author

Hi Pau,

Thanks as always for the swift reply.

This makes sense, I hadn't noticed this default argument in the background of the function.

I have updated to decoupler 1.8 and am trying to run the 'recall' metric but am quite confused about the output. As with auroc/auprc it is responsive to the min_exp argument - for the purposes of this I have set min_exp to 1, but what I go on to describe happens whatever min_exp is set to (just with more or less TFs).

I add metrics = ['auprc', 'recall] to the benchmark function call as below:

output = dc.benchmark(benchRAPToR, obs1, mat, metrics = ['auprc', 'recall'], by = 'source', min_exp = 1, perturb = 'target_gseq', sign = -1, verbose = True, decouple_kws = decouple_kws)

However, althought the AUPRC values take a continuous range of values for different TFs, the recall metric is always 1.... the output is attached.

Is there an alternative way of running the recall metric to serve as a useful alternative here?

recall_bysource.txt

@PauBadiaM
Copy link
Member

Hi @IBMB-MFP,

Sorry for the late reply, I was on holidays. You can use the argument use_pval and set it to a desired significance threshold such as 0.05 in the dc.benchmark function. This will determine whether a TF has significant activity or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question User question: anything that's not obviously a bug
Projects
None yet
Development

No branches or pull requests

2 participants