Tracking operations performed on AnnData objects #472

afrendeiro · 2019-02-10T11:20:10Z

When exploring various options of preprocessing data, I try to avoid having several copies of AnnData objects in memory if they're not sparse, so I save them to h5ad at key steps. Sometimes alas, after a few iterations I re-write stuff and forget what operations have been performed in my "X" (particularly in the preprocessing steps).

So, because being lazy makes me creative, I started tracking these in the object itself (see example https://gist.github.com/afrendeiro/7ccaf324bfdbff042ae36f734f544860) by decorating the preprocessing functions post hoc (this could even easily be used to save the values of kwargs passed potentially).

I wonder if an internal implementation of this would be of broad interest, particularly for functions which modify "X" inplace?
Of course this would be no replacement for proper documentation of one's steps, etc but I thought it could be an interesting addition to scanpy in any case.

ivirshup · 2019-02-13T09:06:49Z

I like this idea, but think it could be expanded on a bit. I think there are benefits to approaching this through logging. Some advantages of doing this through logging:

Global record. If a copy is made or an AnnData split, you could figure out which object came from where.
Control over level of detail. What kind of information is recorded can be customized. Maybe the user wants provenance, but maybe they want performance information.

What if tracking was done through logging? Here's a couple quick examples of what I mean:

Simple example. Logs `anndata` used, function called, time elapsed

from anndata import AnnData
from datetime import datetime
from functools import wraps
from structlog import get_logger
from time import sleep
import uuid

logger = get_logger()

def logged(func):
    @wraps(func)
    def func_wrapper(*args, **kwargs):
        call_id = uuid.uuid4() # So we can always match call start with call end
        call_start_record = dict(call_id=call_id, called_func=func.__name__)
        if type(args[0]) is AnnData:
            call_start_record["adata_id"] = id(args[0])
        logger.msg("call", **call_start_record)

        t0 = datetime.now()
        output = func(*args, **kwargs)
        dt = datetime.now() - t0

        call_finish_record = dict(called_func=func.__name__, elapsed=dt)
        if type(output) is AnnData:
            call_finish_record["returned_adata_id"] = id(output)
        logger.msg("call_finish", **call_finish_record, call_id=call_id)
        return output
    return func_wrapper

# Usage

@logged
def foo(adata, x, copy=False):
    sleep(0.5)
    if copy: return adata.copy()

import scanpy as sc
pbmcs = sc.datasets.pbmc68k_reduced()

foo(pbmcs, 1)
# 2019-02-13 19:27.58 call                           adata_id=4937049368 call_id=UUID('82f3944c-08c1-470a-9d39-03dcabc091a2') called_func=foo
# 2019-02-13 19:27.58 call_finish                    call_id=UUID('82f3944c-08c1-470a-9d39-03dcabc091a2') called_func=foo elapsed=datetime.timedelta(microseconds=500777)
foo(pbmcs, 1, copy=True);
# 2019-02-13 19:28.02 call                           adata_id=4937049368 call_id=UUID('986f57e4-656a-41b1-9c7c-a7c5ad5b01fc') called_func=foo
# 2019-02-13 19:28.03 call_finish                    call_id=UUID('986f57e4-656a-41b1-9c7c-a7c5ad5b01fc') called_func=foo elapsed=datetime.timedelta(microseconds=505970) returned_adata_id=4940502352

More complicated example with argument value logging

from anndata import AnnData
from copy import copy
from datetime import datetime
from functools import wraps
import inspect
from itertools import chain
from structlog import get_logger
from time import sleep
import uuid

logger = get_logger()

def logged(logged_args=None):
    """
    Params
    ------
    logged_args : list[str], optional (default: `None`)
        Names of arguments to log.
    """
    if logged_args is None:
        logged_args = []
    def logged_decorator(func):
        argnames = inspect.getfullargspec(func).args
        @wraps(func)
        def func_wrapper(*args, **kwargs):
            call_id = uuid.uuid4() # So we can always match call start with call end
            logged_params = {}

            for param, val in chain(zip(argnames, args), kwargs.items()):
                if type(val) is AnnData:
                    logged_params[param] = id(val)
                elif param in logged_args:
                    logged_params[param] = copy(val) # Probably need to consider how these values get logged

            logger.msg("call", called_func=func.__name__,
                       logged_args=logged_params, call_id=call_id)

            t0 = datetime.now()
            output = func(*args, **kwargs)
            dt = datetime.now() - t0

            call_finish_record = dict(
                called_func=func.__name__, elapsed=dt,
            )
            if type(output) is AnnData:
                call_finish_record["returned_adata_id"] = id(output)
            logger.msg("call_finish", **call_finish_record, call_id=call_id)
            return output
        return func_wrapper
    return logged_decorator

# Usage

@logged(logged_args=["x"])
def foo(adata, x, copy=True):
    sleep(0.5)
    if copy: return adata.copy()

import scanpy as sc
pbmcs = sc.datasets.pbmc68k_reduced()

foo(pbmcs, 1, copy=True)
# 2019-02-13 19:35.48 call                           call_id=UUID('f7623504-31c5-4afa-ae26-4f58fc5341a8') called_func=foo logged_args={'adata': 4476410456, 'x': 1}
# 2019-02-13 19:35.48 call_finish                    call_id=UUID('f7623504-31c5-4afa-ae26-4f58fc5341a8') called_func=foo elapsed=datetime.timedelta(microseconds=507880) returned_adata_id=5064494384

afrendeiro · 2019-02-17T18:34:40Z

Interesting ideas. I actually didn't notice scanpy has no logging implemented - this would indeed be useful and could already solve half the problem indeed. However, I doubt the best way to go about this would be post hoc with decorators etc, but rather intrinsically throughout the various API functions.

Regardless of logging, I still think that having something which is intrinsically attached to the object would have the advantage of knowing the exact set of operations solely from the h5ad file/AnnData object itself. Don't know if people are actually out there are also sharing these or not but it could be useful from that perspective too.

ivirshup · 2019-02-18T04:00:44Z

Scanpy does have logging implemented (examples: neighbors, highly variable genes), but it's not that widely used. I think this is because it has to be implemented manually in the code (not sure if this is what you mean by "intrinsic"?), which makes it take some effort to implement and not all contributors are aware of.

I think using a decorator would be nice for abstracting out the process. This would have benefits of consistency of usage by making it easy, consistency of logged messages, and separation of concerns between computation and tracking.

I also think you'd be able to know the exact set of operations from this approach. Assuming all top level functions have been wrapped with a decorator like the one I presented above, this code:

adata = sc.read_10x_h5("./10x_run/outs/filtered_gene_matrix.h5")
sc.pp.normalize_per_cell(adata, 1000)
sc.pp.log1p(adata)
sc.pp.pca(adata)
adata.write("./cache/01_simple_process.h5ad")

Should result in a set of (psuedo-)records like:

# Where id(1) is a stand in for value like `id(adata)`
{"call": "read_10x_h5", "args": {"filename":  "./10x_run/outs/filtered_gene_matrix.h5"}, "returned_adata": id(1)}
{"call": "normalize_per_cell", "args": {"counts_per_cell_after": 1000}, "adata_id": id(1)}
{"call": "log1p", "adata_id": id(1)}
{"call": "pca", "adata_id": id(1)}
{"call": "write", "args" : {"filename": "./cache/01_simple_process.h5ad"}, "adata_id": id(1)}

It's pretty trivial to go through these logs and figure out what happened to the AnnData, and made accessible through helper functions. Maybe they'd look like sc.logging.get_operations(adata_id=id(adata)) or sc.logging.get_operations(written_to="./cache/01_simple_process.h5ad"). There could also be a helper function to add the relevant records to some field in .uns of the relevant AnnData object or a setting which has a log handler do that automatically.

Is there some set of information this wouldn't capture?

afrendeiro · 2019-02-18T11:14:13Z

Sorry I missed the logging. I also didn't see the sc.settings.logfile option, which obviously makes absolute sense and is convenient to have persistent records when working interactively with anndata objects. I guess just more consistent logging across scanpy functions would be really great.

Something like sc.logging.get_operations(adata_id=id(adata)) would also be super cool, but would it be able to retrieve records of operations performed within rounds of object serialization?

e.g.:

adata = sc.read_10x_h5("./10x_run/outs/filtered_gene_matrix.h5")
sc.pp.normalize_per_cell(adata, 1000)
sc.pp.log1p(adata)
sc.pp.pca(adata)
adata.write("./cache/01_simple_process.h5ad")

adata = sc.read("./cache/01_simple_process.h5ad")
sc.pp.scale(adata)
adata.write("./cache/01_simple_process.h5ad")
print(sc.logging.get_operations(adata_id=id(adata)))

would probably forget the first set of operations?

# Where id(1) is a stand in for value like `id(adata)`
{"call": "scale", "adata_id": id(1)}
{"call": "write", "args" : {"filename": "./cache/01_simple_process.h5ad"}, "adata_id": id(1)}

I guess one solution would be to follow the path of ids up the log to retrieve all which seems doable, so this could be a good system.

The one thing this wouldn't cover though is persistence within the h5ad object itself. This would be useful in the case of sharing the object with someone for example. As I mentioned before, I'm not sure this is a widespread use case yet, but could be useful.

ivirshup · 2019-02-19T03:16:37Z

Yeah that's what I was thinking for tracking between serializations. I figure there could be a boolean argument like exhaustive which would signify whether you want this particular AnnData or all previous AnnDatas this could be derived from in the logs.

I think it'll be possible to write the logs to some field in an object. There is a question of how complicated this would be to implement, which I haven't quite figured out yet. Maybe you'd add a reference to the AnnData to the logging context, and make a custom logger which decides where to write based on that? Alternatively, maybe this just gets handled by some logic in the decorator. So after a method is called, there's a flag about whether to add records to the modified object.

Of course, nothing is logged persistently by default, so it's already an extra step to enable. It's possible sharing could just need two extra steps, "enable persistent logging" and "send them the logs".

afrendeiro · 2019-02-20T10:44:33Z

sc.logging.get_operations with exhaustive would be great, but if one could find a way to store the same persistently or in the object too upon the user's request that would cover all the ground.

ivirshup mentioned this issue Apr 1, 2019

Expression atlas #573

Merged

ivirshup mentioned this issue Jan 21, 2021

sc.metrics module (add confusion matrix & Geary's C methods) #915

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking operations performed on AnnData objects #472

Tracking operations performed on AnnData objects #472

afrendeiro commented Feb 10, 2019

ivirshup commented Feb 13, 2019

afrendeiro commented Feb 17, 2019

ivirshup commented Feb 18, 2019

afrendeiro commented Feb 18, 2019

ivirshup commented Feb 19, 2019

afrendeiro commented Feb 20, 2019

Tracking operations performed on AnnData objects #472

Tracking operations performed on AnnData objects #472

Comments

afrendeiro commented Feb 10, 2019

ivirshup commented Feb 13, 2019

afrendeiro commented Feb 17, 2019

ivirshup commented Feb 18, 2019

afrendeiro commented Feb 18, 2019

ivirshup commented Feb 19, 2019

afrendeiro commented Feb 20, 2019