Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sc.metrics module (add confusion matrix & Geary's C methods) #915

Merged
merged 24 commits into from
Mar 4, 2021

Conversation

ivirshup
Copy link
Member

This PR adds a module sc.metrics for functions which wouldn't modify an anndata object, but are useful calculations. I'm basing this on sklearn.metrics, namely, how sklearn has separated transformers (sc.tl) from measurements. I've started it with two functions, confusion_matrix and gearys_c but think there are more use cases (e.g. modularity). I'm open to this not being a module, but I think these methods should be available and I'm not sure where they'd fit within the current api.

My vision for this module is to make it easier to calculate values based on values you'd get using the scanpy ecosystem. Methods that would be included would be either a) not available in other libraries (gearys_c) or b) are available, but have difficult interfaces (confusion_matrix).

sc.metrics.confusion_matrix

Creates a confusion matrix for comparing categorical labels. This is based on sklearn.metrics.confusion_matrix but is easier to use, and returns a object with labels. I think this is mostly done, though I'm considering changing the calling convention. Here's an example of usage:

import scanpy as sc
import seaborn as sns

pbmc = sc.datasets.pbmc68k_reduced()
sc.tl.leiden(pbmc)
sns.heatmap(sc.metrics.confusion_matrix("bulk_labels", "leiden", pbmc.obs))

image

I've copied seaborns calling convention here, but I think that could change. Right now the above call is equivalent to:

sc.metrics.confusion_matrix(pbmc.obs["bulk_labels"], pbmc.obs["louvain"])

But I wonder if it would make more sense to have the DataFrame go first if it's provided. I've also based the API around my usage of confusion matrices, so I'm very open to more general feedback on this. My reason for including it here was the amount of code it took wrapping sklearn.metrics.confusion_matrix to get useful output.

sc.metrics.gearys_c (Wiki page)

Calculates autocorrelation on a measure on a network. Used in VISION for ranking gene sets. This is useful for finding out whether some per-cell measure is correlated with the structure of a connectivity graph. In practice, I've found it useful for identifying features that look good on a UMAP:

import numpy as np
pbmc.layers["logcounts"] = pbmc.raw.X

%time gearys_c = sc.metrics.gearys_c(pbmc, layer="logcounts")
# CPU times: user 496 ms, sys: 3.88 ms, total: 500 ms
# Wall time: 74.9 ms
to_plot = pbmc.var_names[np.argsort(gearys_c)[:4]]
sc.pl.umap(pbmc, color=to_plot, ncols=2)

image

It can also be useful to rank components of dimensionality reductions (example with ICA: #767 (comment)).

@LuckyMD
Copy link
Contributor

LuckyMD commented Nov 13, 2019

The confusion matrix metric looks very similar to what sc.marker_genes_overlap() does. It might be worth moving that function over as well and potentially merging the functions. What do you think?

@flying-sheep
Copy link
Member

Nice! Can you please explain your rationale for why they shouldn’t a) be normal tools and b) saved into the AnnData object?

sc.tl.gearys_c(pbmc, layer="logcounts")
to_plot = pbmc.var_names[np.argsort(pbmc.var.gearys_c)[:4]]

Sure, adding more and more features is a good point to think about the API, I’d just like to hear why all current analysis tools belong into tl and these two don’t!

@ivirshup
Copy link
Member Author

@LuckyMD

There's definitely some similarities, especially for the overlap_count case. I'm planning on adding support for categories encoded as boolean sparse matrices, which would make the calculation even more similar. I don't think they have enough of a common use case to be merged however, especially since calculation is such a small part of both functions. sc.tl.marker_genes_overlap seems specific to differential expression, while confusion_matrix is specific to comparing sets of labels.

@flying-sheep

Why sc.metrics and not sc.tl? It doesn't feel like it fits. My expectation for most things in sc.tl and sc.pp is that they will take an anndata object and modify it. One of these functions takes an anndata object (optionally) and neither modifies it. I would agree with @LuckyMD that sc.tl.marker_gene_overlap could also fit in sc.metrics. There are possibly more philosophical points to be made about the difference between a "measure" and a "representation" but I'd have to think on that.

As to not modifying the AnnData object, It's not obvious to me how these functions would modify the object in a helpful way. I don't expect the outputs of these functions to have a similar shape to the AnnData object particularly often. Admittedly, the example I included has that, but I'm not sure if that'll be a common use case. You could whack it in .uns, but then there's either some default key or the user has to specify one. It'd be pretty straight forward for the user to do adata.uns["some key"] = sc.metrics.some_func.

I guess I'd summarize this as: I don't see added value from adding the results to the object – but I do see more complexity in implementing that.

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Love your clean implementations.

scanpy/metrics/_gearys_c.py Outdated Show resolved Hide resolved
Comment on lines 216 to 232
# TODO: Document better
# TODO: Have scanpydoc work with multipledispatch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to scanpydoc are online fast, do you have something in line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty open to suggestions on this.

We could do something similar to R? There they give all the signatures and then have all the arguments for all of them mixed together.

Is it possible to have multiple Params sections in sphinx? Otherwise, maybe we find functions which have had multiple dispatch applied, then know to stitch their doc strings together somehow?

I feel like languages that use dispatch more heavily (and not like object.method) tend to have more narrative docs with examples of usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the current output? Do we at least have the docstring and signature of one of them?

Copy link
Member Author

@ivirshup ivirshup Nov 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?sc.metrics.gearys_c looks like:

Dispatched docstring
Signature:   sc.metrics.gearys_c(*args, **kwargs)
Type:        Dispatcher
String form: <dispatched gearys_c>
File:        /usr/local/lib/python3.7/site-packages/multipledispatch/dispatcher.py
Docstring:
Multiply dispatched method: gearys_c

Inputs: <AnnData>
------------------
Calculate `Geary's C` <https://en.wikipedia.org/wiki/Geary's_C>`_, as used
    by `VISION <https://doi.org/10.1038/s41467-019-12235-0>`_.

    Geary's C is a measure of autocorrelation for some measure on a graph. This
    can be to whether measures are correlated between neighboring cells. Lower
    values indicate greater correlation.

    ..math

        C =
        rac{
            (N - 1)\sum_{i,j} w_{i,j} (x_i - x_j)^2
        }{
            2W \sum_i (x_i - �ar{x})^2
        }

    Params
    ------
    adata
    vals
        Values to calculate Geary's C for. If this is two dimensional, should
        be of shape `(n_features, n_cells)`. Otherwise should be of shape
        `(n_cells,)`. This matrix can be selected from elements of the anndata
        object by using key word arguments: `layer`, `obsm`, `obsp`, or
        `use_raw`.
    use_graph
        Key to use for graph in anndata object. If not provided, default
        neighbors connectivities will be used instead.
    layer
        Key for `adata.layers` to choose `vals`.
    obsm
        Key for `adata.obsm` to choose `vals`.
    obsp
        Key for `adata.obsp` to choose `vals`.
    use_raw
        Whether to use `adata.raw.X` for `vals`.

    Returns
    -------
    If vals is two dimensional, returns a 1 dimensional ndarray array. Returns
    a scalar if `vals` is 1d.

Other signatures:
    spmatrix, spmatrix
    csr_matrix, csr_matrix
    spmatrix, ndarray
    spmatrix, DataFrame
    spmatrix, Series

So not that much of a signature. I added a doc-string to one of the other methods and got this:

Two doc-strings
Signature:   sc.metrics.gearys_c(*args, **kwargs)
Type:        Dispatcher
String form: <dispatched gearys_c>
File:        /usr/local/lib/python3.7/site-packages/multipledispatch/dispatcher.py
Docstring:  
Multiply dispatched method: gearys_c

Inputs: <spmatrix, ndarray>
----------------------------
Params
    ------
    g
        Connectivity graph as a scipy sparse matrix. Should have shape:
        `(n_obs, n_obs)`.
    vals
        Values to calculate Geary's C for. If one dimensional, should have
        shape `(n_obs,)`. If two dimensional (i.e calculating Geary's C for
        multiple variables) should have shape `(n_vars, n_obs)`.

Inputs: <AnnData>
------------------
Calculate `Geary's C` <https://en.wikipedia.org/wiki/Geary's_C>`_, as used
    by `VISION <https://doi.org/10.1038/s41467-019-12235-0>`_.

    Geary's C is a measure of autocorrelation for some measure on a graph. This
    can be to whether measures are correlated between neighboring cells. Lower
    values indicate greater correlation.

    ..math

        C =
        rac{
            (N - 1)\sum_{i,j} w_{i,j} (x_i - x_j)^2
        }{
            2W \sum_i (x_i - �ar{x})^2
        }

    Params
    ------
    adata
    vals
        Values to calculate Geary's C for. If this is two dimensional, should
        be of shape `(n_features, n_cells)`. Otherwise should be of shape
        `(n_cells,)`. This matrix can be selected from elements of the anndata
        object by using key word arguments: `layer`, `obsm`, `obsp`, or
        `use_raw`.
    use_graph
        Key to use for graph in anndata object. If not provided, default
        neighbors connectivities will be used instead.
    layer
        Key for `adata.layers` to choose `vals`.
    obsm
        Key for `adata.obsm` to choose `vals`.
    obsp
        Key for `adata.obsp` to choose `vals`.
    use_raw
        Whether to use `adata.raw.X` for `vals`.

    Returns
    -------
    If vals is two dimensional, returns a 1 dimensional ndarray array. Returns
    a scalar if `vals` is 1d.

Other signatures:
    spmatrix, spmatrix
    csr_matrix, csr_matrix
    spmatrix, DataFrame
    spmatrix, Series

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scanpydoc throws an error if I try to add this to the docs. The full log is below, but I think these are the relevant parts of the traceback:

  File "/usr/local/lib/python3.7/site-packages/sphinx/events.py", line 103, in emit
    results.append(callback(self.app, *args))
  File "/usr/local/lib/python3.7/site-packages/scanpydoc/elegant_typehints/return_tuple.py", line 46, in process_docstring
    ret_types = get_tuple_annot(get_type_hints(obj).get("return"))
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/typing.py", line 996, in get_type_hints
    'or function.'.format(obj))
TypeError: <dispatched gearys_c> is not a module, class, method, or function.
Full error
# Sphinx version: 2.2.1
# Python version: 3.7.5 (CPython)
# Docutils version: 0.15.2 release
# Jinja2 version: 2.10.1
# Last messages:
#   loading pickled environment...
#   done
#   [autosummary] generating autosummary for: api/index.rst, api/plotting.rst, api/scanpy.Neighbors.compute_eigen.rst, api/scanpy.Neighbors.compute_neighbors.rst, api/scanpy.Neighbors.compute_transitions.rst, api/scanpy.Neighbors.connectivities.rst, api/scanpy.Neighbors.distances.rst, api/scanpy.Neighbors.distances_dpt.rst, api/scanpy.Neighbors.eigen_basis.rst, api/scanpy.Neighbors.eigen_values.rst, ..., external/scanpy.external.tl.cyclone.rst, external/scanpy.external.tl.palantir.rst, external/scanpy.external.tl.phate.rst, external/scanpy.external.tl.phenograph.rst, external/scanpy.external.tl.sandbag.rst, index.rst, installation.rst, references.rst, release_notes.rst, tutorials.rst
#   [autosummary] generating autosummary for: /Users/isaac/github/scanpy/docs/api/scanpy.metrics.confusion_matrix.rst, /Users/isaac/github/scanpy/docs/api/scanpy.metrics.gearys_c.rst
#   building [mo]: targets for 0 po files that are out of date
#   building [html]: targets for 6 source files that are out of date
#   updating environment:
#   [extensions changed ('5')]
#   242 added, 0 changed, 0 removed
#   reading sources... [  0%] api/index
# Loaded extensions:
#   sphinx.ext.mathjax (2.2.1) from /usr/local/lib/python3.7/site-packages/sphinx/ext/mathjax.py
#   sphinxcontrib.applehelp (1.0.1) from /usr/local/lib/python3.7/site-packages/sphinxcontrib/applehelp/__init__.py
#   sphinxcontrib.devhelp (1.0.1) from /usr/local/lib/python3.7/site-packages/sphinxcontrib/devhelp/__init__.py
#   sphinxcontrib.htmlhelp (1.0.2) from /usr/local/lib/python3.7/site-packages/sphinxcontrib/htmlhelp/__init__.py
#   sphinxcontrib.serializinghtml (1.1.3) from /usr/local/lib/python3.7/site-packages/sphinxcontrib/serializinghtml/__init__.py
#   sphinxcontrib.qthelp (1.0.2) from /usr/local/lib/python3.7/site-packages/sphinxcontrib/qthelp/__init__.py
#   alabaster (0.7.12) from /usr/local/lib/python3.7/site-packages/alabaster/__init__.py
#   sphinx.ext.autodoc (2.2.1) from /usr/local/lib/python3.7/site-packages/sphinx/ext/autodoc/__init__.py
#   sphinx.ext.intersphinx (2.2.1) from /usr/local/lib/python3.7/site-packages/sphinx/ext/intersphinx.py
#   sphinx.ext.doctest (2.2.1) from /usr/local/lib/python3.7/site-packages/sphinx/ext/doctest.py
#   sphinx.ext.coverage (2.2.1) from /usr/local/lib/python3.7/site-packages/sphinx/ext/coverage.py
#   sphinx.ext.napoleon (2.2.1) from /usr/local/lib/python3.7/site-packages/sphinx/ext/napoleon/__init__.py
#   sphinx.ext.autosummary (2.2.1) from /usr/local/lib/python3.7/site-packages/sphinx/ext/autosummary/__init__.py
#   sphinx_autodoc_typehints (unknown version) from /usr/local/lib/python3.7/site-packages/sphinx_autodoc_typehints.py
#   scanpydoc.autosummary_generate_imported (0.4.4) from /usr/local/lib/python3.7/site-packages/scanpydoc/autosummary_generate_imported.py
#   scanpydoc.definition_list_typed_field (0.4.4) from /usr/local/lib/python3.7/site-packages/scanpydoc/definition_list_typed_field.py
#   scanpydoc.elegant_typehints (0.4.4) from /usr/local/lib/python3.7/site-packages/scanpydoc/elegant_typehints/__init__.py
#   scanpydoc.rtd_github_links (0.4.4) from /usr/local/lib/python3.7/site-packages/scanpydoc/rtd_github_links.py
#   scanpydoc (0.4.4) from /usr/local/lib/python3.7/site-packages/scanpydoc/__init__.py
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sphinx/cmd/build.py", line 276, in build_main
    app.build(args.force_all, filenames)
  File "/usr/local/lib/python3.7/site-packages/sphinx/application.py", line 346, in build
    self.builder.build_update()
  File "/usr/local/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 299, in build_update
    len(to_build))
  File "/usr/local/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 311, in build
    updated_docnames = set(self.read())
  File "/usr/local/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 418, in read
    self._read_serial(docnames)
  File "/usr/local/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 439, in _read_serial
    self.read_doc(docname)
  File "/usr/local/lib/python3.7/site-packages/sphinx/builders/__init__.py", line 479, in read_doc
    doctree = read_doc(self.app, self.env, self.env.doc2path(docname))
  File "/usr/local/lib/python3.7/site-packages/sphinx/io.py", line 326, in read_doc
    pub.publish()
  File "/usr/local/lib/python3.7/site-packages/docutils/core.py", line 217, in publish
    self.settings)
  File "/usr/local/lib/python3.7/site-packages/sphinx/io.py", line 114, in read
    self.parse()
  File "/usr/local/lib/python3.7/site-packages/docutils/readers/__init__.py", line 77, in parse
    self.parser.parse(self.input, document)
  File "/usr/local/lib/python3.7/site-packages/sphinx/parsers.py", line 95, in parse
    self.statemachine.run(inputlines, document, inliner=self.inliner)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 171, in run
    input_source=document['source'])
  File "/usr/local/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/local/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2771, in underline
    self.section(title, source, style, lineno - 1, messages)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 327, in section
    self.new_subsection(title, lineno, messages)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 395, in new_subsection
    node=section_node, match_titles=True)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/usr/local/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/local/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2771, in underline
    self.section(title, source, style, lineno - 1, messages)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 327, in section
    self.new_subsection(title, lineno, messages)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 395, in new_subsection
    node=section_node, match_titles=True)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 282, in nested_parse
    node=node, match_titles=match_titles)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 196, in run
    results = StateMachineWS.run(self, input_lines, input_offset)
  File "/usr/local/lib/python3.7/site-packages/docutils/statemachine.py", line 239, in run
    context, state, transitions)
  File "/usr/local/lib/python3.7/site-packages/docutils/statemachine.py", line 460, in check_line
    return method(match, context, next_state)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2344, in explicit_markup
    nodelist, blank_finish = self.explicit_construct(match)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2356, in explicit_construct
    return method(self, expmatch)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2099, in directive
    directive_class, match, type_name, option_presets)
  File "/usr/local/lib/python3.7/site-packages/docutils/parsers/rst/states.py", line 2148, in run_directive
    result = directive_instance.run()
  File "/usr/local/lib/python3.7/site-packages/sphinx/ext/autosummary/__init__.py", line 240, in run
    items = self.get_items(names)
  File "/usr/local/lib/python3.7/site-packages/sphinx/ext/autosummary/__init__.py", line 349, in get_items
    documenter.add_content(None)
  File "/usr/local/lib/python3.7/site-packages/sphinx/ext/autodoc/__init__.py", line 471, in add_content
    for i, line in enumerate(self.process_doc(docstrings)):
  File "/usr/local/lib/python3.7/site-packages/sphinx/ext/autodoc/__init__.py", line 441, in process_doc
    self.options, docstringlines)
  File "/usr/local/lib/python3.7/site-packages/sphinx/application.py", line 450, in emit
    return self.events.emit(event, *args)
  File "/usr/local/lib/python3.7/site-packages/sphinx/events.py", line 103, in emit
    results.append(callback(self.app, *args))
  File "/usr/local/lib/python3.7/site-packages/scanpydoc/elegant_typehints/return_tuple.py", line 46, in process_docstring
    ret_types = get_tuple_annot(get_type_hints(obj).get("return"))
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/typing.py", line 996, in get_type_hints
    'or function.'.format(obj))
TypeError: <dispatched gearys_c> is not a module, class, method, or function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can’t really se the signature because of the corrupted ANSI escape sequences. can you copy it from a terminal that lets you copy plain text?

the fix in scanpydoc seems easy!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the header for the docstrings in the examples gave.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that looks good! Can you test it with theislab/scanpydoc#12?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've responded on that PR

scanpy/tests/test_metrics.py Show resolved Hide resolved
scanpy/tests/test_metrics.py Outdated Show resolved Hide resolved
scanpy/metrics/_metrics.py Outdated Show resolved Hide resolved
scanpy/metrics/_metrics.py Outdated Show resolved Hide resolved
scanpy/metrics/_gearys_c.py Outdated Show resolved Hide resolved
scanpy/metrics/_gearys_c.py Outdated Show resolved Hide resolved
scanpy/metrics/_gearys_c.py Outdated Show resolved Hide resolved
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great! TODO:

  • get multipledispatch support into scanpydoc
  • add changelog entry

@ivirshup
Copy link
Member Author

get multipledispatch support into scanpydoc

Would you mind handling that? I have zero experience with sphinx extensions.

@ivirshup
Copy link
Member Author

Also two api thoughts:

For sc.metrics.gearys_c(a: "array", b: "array"), where b is 2d is expected to have a shape like: (variable, number_of_cells) – the ufunc shape signature would be: (m,m)(n,m)->(n,). This is because it needs fast access to each variable, so they correspond to rows. Also the length of the returned array depends on the first axis of the passed input. Is this intuitive, or should the input be transposed?

Second, for confusion_matrix, I'm thinking I should make it singly dispatched on the first argument. This way if a dataframe is passed, the next two arguments could correspond to keys in that dataframe. Otherwise, vectors can be passed directly. Under that, these calls would be equivalent:

sc.metrics.confusion_matrix(adata.obs, "sample_labels", "leiden")
sc.metrics.confusion_matrix(adata.obs["sample_labels"], adata.obs["leiden"])

Right now it has the seaborn style argument handling shown at the top of this PR. I'm not sure that's really caught on in other packages or fits with scanpy.

@LuckyMD
Copy link
Contributor

LuckyMD commented Dec 1, 2019

Do you reckon it makes sense to make sc.tl.marker_genes_overlap() use this code internally to reduce code redundancy?

@ivirshup
Copy link
Member Author

ivirshup commented Dec 2, 2019

Nah, I think it'd end up being at least the same amount of code converting values into different types.

@adamgayoso
Copy link
Member

adamgayoso commented Jan 20, 2021

any updates on this PR? @ivirshup I know some people in the lab really like your Geary's C implementation. Any thoughts on making it a small standalone package?

Also I do like the idea of the sc.metrics module if it makes more sense here.

@ivirshup ivirshup added this to the 1.8.0 milestone Jan 20, 2021
@ivirshup
Copy link
Member Author

@adamgayoso, I should definitely get around to merging this. I think I can pretty much do it as is, and open a second issue for getting the docs looking good. I'd like to target an initial metrics module for 1.8 (we're working on upping the release cadence as well).

Question for your lab, are our implementations equivalent? I haven't actually gotten around to testing against the VISION R/C++ version.

@adamgayoso
Copy link
Member

I'd have to check with @deto or @mattjones315 about equivalency.

@adamgayoso
Copy link
Member

also as an aside, would it be appropriate to include some of @LuckyMD scIB integration metrics here? It would give people easier access and probably expand general use.

@LuckyMD
Copy link
Contributor

LuckyMD commented Jan 20, 2021

At the moment we're trying to clean up scIB that it becomes easier to use. We're still not certain how to best deal with metrics that rely on R and C++ code though. The current plan is to make a more usable pypi package where some metrics give you a warning on additional requirements/manual C++ compilation. Apologies for the usability mess that a package that also assesses usability has become ^^. I'd prefer to keep it separate for now to facilitate maintenance and citation though.

That being said, maybe we could think about an optional requirement for scIB to integrate them? At least when we've cleaned up our side of things.

@ivirshup
Copy link
Member Author

ivirshup commented Jan 21, 2021

@LuckyMD, I've actually been meaning to ask you about this. I'd like to include at least some of those metrics here. In some cases it mean making numba implementations (which would ultimately be easier to distribute). There would definitely be a prioritization for general usefulness of the metric.

What do you think about that?

@LuckyMD
Copy link
Contributor

LuckyMD commented Jan 21, 2021

A numba reimplementation of some of the metrics sounds pretty awesome actually. That's out of scope for scIB at the moment. We didn't bother with parallelization for most of the metrics (beyond what was already implemented in sc.tl.louvain and the sklearn dependencies) as the slowest ones were in R anyway (and now also C++ with our LISI update). Would really welcome that. I can help where I can, although not so familiar with numba.

@adamgayoso
Copy link
Member

adamgayoso commented Jan 21, 2021

just want to say that even a first release with some of the easiest to implement metrics could help lead to greater widespread use and IMO would generally be appreciated by the community. Besides the fact that it seems like a perfect fit for this scanpy module as I understand it.

Though I do understand the citation issue. Maybe it's time for a global citation table and each function can add to the table if there is an appropriate citation?! Maybe it could be accessed with sc.citation_table and displays which function calls used which paper's methods.

@ivirshup
Copy link
Member Author

ivirshup commented Jan 21, 2021

So, question from a user stand point:

Is it worth it for us to include the really really easy to implement metrics? The ones where we'd basically just be wrapping scikit-learn? I think this fits with the idea of scanpy's contents being curatorial to some extent.

Though I do understand the citation issue.

It's definitely good to have a citation in the docstring for each function.

For the docs of the metrics module, I think there would be a subsection for "Integration metrics" which could definitely point to scIB as a more comprehensive package for evaluating integration.

Maybe it's time for a global citation table and each function can add to the table if there is an appropriate citation?!

Are you suggesting that the table would be added to at runtime (when a function is called)? I think this may be better addressed by a broader solution to "what has been done to this dataset?". I'm not sure how this could be done without buy in from third party libraries. Also has been discussed a bit previously: #472.

@adamgayoso
Copy link
Member

Are you suggesting that the table would be added to at runtime (when a function is called)? I think this may be better addressed by a broader solution to "what has been done to this dataset?". I'm not sure how this could be done without buy in from third party libraries. Also has been discussed a bit previously: #472.

Yeah I suppose, though I could see how this gets complicated by the fact that I imagine more people than myself use multiple h5ads throughout their analysis of the same dataset. Though maybe something simpler is to be able to access a global table of functions and citations, and it gives you the bibtex. But if there's any interest in it (a bit of a weird idea I understand) I can make an issue for it.

@ivirshup
Copy link
Member Author

Though maybe something simpler is to be able to access a global table of functions and citations, and it gives you the bibtex.

@adamgayoso From my point of view, references are already available (source, rendered) and linked to in the documentation. I'm not against the idea, I'm just not seeing how it makes this information more accessible/ prominent. A separate issue for the topic would be good for more discussion.


Would really welcome that. I can help where I can, although not so familiar with numba.

@LuckyMD, meant to say, numba is super easy, it's really just python. Next best thing to Julia. Definitely worth some time to learn, but also it won't take that much time to learn.

Add a module for computing useful metrics. Started off with Geary's C since I'm using it and finding it useful. I've also got a fairly fast way to calculate it worked out.

Unfortunatly my implementation runs into some issues with some global configs set by umap (see lmcinnes/umap#306), so I'm going to see if that can be resolved before changing it.
* Work around lmcinnes/umap#306 by not
  calling out to kernel function. That code has been kept, but commented
  out.
* Increase numerical stability by casting data to system width. Tests
  were failing due to instability.
@ivirshup
Copy link
Member Author

ivirshup commented Mar 2, 2021

I think this is currently bugged by: numba/numba#6774. It's a weird bug: some code just doesn't execute, unless I swap out a prange with a range, in which case it errors. Unless I add an expression that does nothing. Then it can work, except it's doing the expensive computation again 🤯.

It looks like this won't be solved by the next numba release, so working around it will be necessary for timely inclusion.

@ivirshup ivirshup changed the title [RFC] sc.metrics module (add confusion matrix & Geary's C methods) sc.metrics module (add confusion matrix & Geary's C methods) Mar 3, 2021
@ivirshup ivirshup merged commit b3a2a6d into scverse:master Mar 4, 2021
Zethson pushed a commit that referenced this pull request Mar 15, 2021
* Add `sc.metrics` with `gearys_c`

Add a module for computing useful metrics. Started off with Geary's C since I'm using it and finding it useful. I've also got a fairly fast way to calculate it worked out.

Unfortunatly my implementation runs into some issues with some global configs set by umap (see lmcinnes/umap#306), so I'm going to see if that can be resolved before changing it.

* Add sc.metrics.confusion_matrix

* Better tests and output for confusion_matrix

* Workaround umap<0.4 and increase numerical stability of gearys_c

* Work around lmcinnes/umap#306 by not
  calling out to kernel function. That code has been kept, but commented
  out.
* Increase numerical stability by casting data to system width. Tests
  were failing due to instability.

* Split up gearys_c tests

* Improved unexpected error message

* gearys_c working again. Sadly, a bit slower

* One option for doc strings

* Simplify implementation to use single dispatch

* release notes
ivirshup added a commit that referenced this pull request Mar 18, 2021
* add flake8 pre-commit

Signed-off-by: Zethson <[email protected]>

* fix pre-commit

Signed-off-by: Zethson <[email protected]>

* add E402 to flake8 ignore

Signed-off-by: Zethson <[email protected]>

* revert neighbors

Signed-off-by: Zethson <[email protected]>

* fix flake8

Signed-off-by: Zethson <[email protected]>

* address review

Signed-off-by: Zethson <[email protected]>

* fix comment character in .flake8

Signed-off-by: Zethson <[email protected]>

* fix test

Signed-off-by: Zethson <[email protected]>

* black

Signed-off-by: Zethson <[email protected]>

* review round 2

Signed-off-by: Zethson <[email protected]>

* review round 3

Signed-off-by: Zethson <[email protected]>

* readded double comments

Signed-off-by: Zethson <[email protected]>

* Ignoring E262 & reverted comment

Signed-off-by: Zethson <[email protected]>

* using self for obs_tidy

Signed-off-by: Zethson <[email protected]>

* Restore setup.py

* rm call of black test (#1690)

* Fix print_versions for python<3.8 (#1691)

* add codecov so we can have a badge to point to (#1693)

* Attempt server-side search (#1672)

* Fix paga_path (#1047)

Fix paga_path

Co-authored-by: Isaac Virshup <[email protected]>

* Switch to flit

This reverts commit d645790

* add setup.py while leaving it ignored

* Update install instructions

* Circumvent new pip check (see pypa/pip#9628)

* Go back to regular pip (#1702)

* Go back to regular flit

Co-authored-by: Isaac Virshup <[email protected]>

* codecov comment (#1704)

* Use joblib for parallelism in regress_out (#1695)

* Use joblib for parallism in regress_out

* release note

* fix link in release notes

* Add todo for resource test

* Add sparsificiation step before sparse-dependent Scrublet calls (#1707)

* Add sparsificiation step before sparse-dependent Scrublet calls

* Apply sparsification suggestion

Co-authored-by: Isaac Virshup <[email protected]>

* Fix imports

Co-authored-by: Isaac Virshup <[email protected]>

* Fix version on Travis (#1713)

By default, Travis does `git clone --depth=50` which means the version can’t be detected from the git tag.

* `sc.metrics` module (add confusion matrix & Geary's C methods) (#915)

* Add `sc.metrics` with `gearys_c`

Add a module for computing useful metrics. Started off with Geary's C since I'm using it and finding it useful. I've also got a fairly fast way to calculate it worked out.

Unfortunatly my implementation runs into some issues with some global configs set by umap (see lmcinnes/umap#306), so I'm going to see if that can be resolved before changing it.

* Add sc.metrics.confusion_matrix

* Better tests and output for confusion_matrix

* Workaround umap<0.4 and increase numerical stability of gearys_c

* Work around lmcinnes/umap#306 by not
  calling out to kernel function. That code has been kept, but commented
  out.
* Increase numerical stability by casting data to system width. Tests
  were failing due to instability.

* Split up gearys_c tests

* Improved unexpected error message

* gearys_c working again. Sadly, a bit slower

* One option for doc strings

* Simplify implementation to use single dispatch

* release notes

* Fix clipped images in docs (#1717)

* Cleanup normalize_total (#1667)

* Cleanup normalize_total

* Add modification tests and copy kwarg for normalize_total

* Test that 'layers' argument is deprecated

* Added more mutation checks for normalize_total

* release note

* Error message

* deprecate scvi (#1703)

* deprecate scvi

* Update .azure-pipelines.yml

Co-authored-by: Isaac Virshup <[email protected]>

* remove :func: links to scvi in release notes

* remove tildes in front of scvi in release notes

* Update docs/release-notes/1.5.0.rst

Co-authored-by: Michael Jayasuriya <[email protected]>
Co-authored-by: Isaac Virshup <[email protected]>

* updated ecosystem.rst to add triku (#1722)

* Minor addition to contributing docs (#1726)

* Preserve category order when groupby is a list (#1735)

Preserve category order when groupby is a list

* Asymmetrical diverging colormaps and vcenter (#1551)

Add vcenter and norm arguments to plotting functions

* add flake8 pre-commit

Signed-off-by: Zethson <[email protected]>

* add E402 to flake8 ignore

Signed-off-by: Zethson <[email protected]>

* revert neighbors

Signed-off-by: Zethson <[email protected]>

* address review

Signed-off-by: Zethson <[email protected]>

* black

Signed-off-by: Zethson <[email protected]>

* using self for obs_tidy

Signed-off-by: Zethson <[email protected]>

* rebased

Signed-off-by: Zethson <[email protected]>

* rebasing

Signed-off-by: Zethson <[email protected]>

* rebasing

Signed-off-by: Zethson <[email protected]>

* rebasing

Signed-off-by: Zethson <[email protected]>

* add flake8 to dev docs

Signed-off-by: Zethson <[email protected]>

* add autopep8 to pre-commits

Signed-off-by: Zethson <[email protected]>

* add flake8 ignore docs

Signed-off-by: Zethson <[email protected]>

* add exception todos

Signed-off-by: Zethson <[email protected]>

* add ignore directories

Signed-off-by: Zethson <[email protected]>

* reinstated lambdas

Signed-off-by: Zethson <[email protected]>

* fix tests

Signed-off-by: Zethson <[email protected]>

* fix tests

Signed-off-by: Zethson <[email protected]>

* fix tests

Signed-off-by: Zethson <[email protected]>

* fix tests

Signed-off-by: Zethson <[email protected]>

* fix tests

Signed-off-by: Zethson <[email protected]>

* Add E741 to allowed flake8 violations.

Co-authored-by: Isaac Virshup <[email protected]>

* Add F811 flake8 ignore for tests

Co-authored-by: Isaac Virshup <[email protected]>

* Fix mask comparison

Co-authored-by: Isaac Virshup <[email protected]>

* Fix mask comparison

Co-authored-by: Isaac Virshup <[email protected]>

* fix flake8 config file

Signed-off-by: Zethson <[email protected]>

* readded autopep8

Signed-off-by: Zethson <[email protected]>

* import Literal

Signed-off-by: Zethson <[email protected]>

* revert literal import

Signed-off-by: Zethson <[email protected]>

* fix scatterplot pca import

Signed-off-by: Zethson <[email protected]>

* false comparison & unused vars

Signed-off-by: Zethson <[email protected]>

* Add cleaner level determination

Co-authored-by: Isaac Virshup <[email protected]>

* Fix comment formatting

Co-authored-by: Isaac Virshup <[email protected]>

* Add smoother dev documentation

Co-authored-by: Isaac Virshup <[email protected]>

* fix flake8

Signed-off-by: Zethson <[email protected]>

* Readd long comment

Co-authored-by: Isaac Virshup <[email protected]>

* Assuming X as array like

Co-authored-by: Isaac Virshup <[email protected]>

* fix flake8

Signed-off-by: Zethson <[email protected]>

* fix flake8 config

Signed-off-by: Zethson <[email protected]>

* reverted rank_genes

Signed-off-by: Zethson <[email protected]>

* fix disp_mean_bin formatting

Co-authored-by: Isaac Virshup <[email protected]>

* fix formatting

Signed-off-by: Zethson <[email protected]>

* add final todos

Signed-off-by: Zethson <[email protected]>

* boolean checks with is

Signed-off-by: Zethson <[email protected]>

* _dpt formatting

Signed-off-by: Zethson <[email protected]>

* literal fixes

Signed-off-by: Zethson <[email protected]>

* links to leafs

Signed-off-by: Zethson <[email protected]>

* revert paga variable naming

Co-authored-by: Philipp A <[email protected]>
Co-authored-by: Sergei Rybakov <[email protected]>
Co-authored-by: Isaac Virshup <[email protected]>
Co-authored-by: Jonathan Manning <[email protected]>
Co-authored-by: mjayasur <[email protected]>
Co-authored-by: Michael Jayasuriya <[email protected]>
Co-authored-by: Alex M. Ascensión <[email protected]>
Co-authored-by: Gökçen Eraslan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants