-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking operations performed on AnnData objects #472
Comments
I like this idea, but think it could be expanded on a bit. I think there are benefits to approaching this through logging. Some advantages of doing this through logging:
What if tracking was done through logging? Here's a couple quick examples of what I mean: Simple example. Logs `anndata` used, function called, time elapsedfrom anndata import AnnData
from datetime import datetime
from functools import wraps
from structlog import get_logger
from time import sleep
import uuid
logger = get_logger()
def logged(func):
@wraps(func)
def func_wrapper(*args, **kwargs):
call_id = uuid.uuid4() # So we can always match call start with call end
call_start_record = dict(call_id=call_id, called_func=func.__name__)
if type(args[0]) is AnnData:
call_start_record["adata_id"] = id(args[0])
logger.msg("call", **call_start_record)
t0 = datetime.now()
output = func(*args, **kwargs)
dt = datetime.now() - t0
call_finish_record = dict(called_func=func.__name__, elapsed=dt)
if type(output) is AnnData:
call_finish_record["returned_adata_id"] = id(output)
logger.msg("call_finish", **call_finish_record, call_id=call_id)
return output
return func_wrapper
# Usage
@logged
def foo(adata, x, copy=False):
sleep(0.5)
if copy: return adata.copy()
import scanpy as sc
pbmcs = sc.datasets.pbmc68k_reduced()
foo(pbmcs, 1)
# 2019-02-13 19:27.58 call adata_id=4937049368 call_id=UUID('82f3944c-08c1-470a-9d39-03dcabc091a2') called_func=foo
# 2019-02-13 19:27.58 call_finish call_id=UUID('82f3944c-08c1-470a-9d39-03dcabc091a2') called_func=foo elapsed=datetime.timedelta(microseconds=500777)
foo(pbmcs, 1, copy=True);
# 2019-02-13 19:28.02 call adata_id=4937049368 call_id=UUID('986f57e4-656a-41b1-9c7c-a7c5ad5b01fc') called_func=foo
# 2019-02-13 19:28.03 call_finish call_id=UUID('986f57e4-656a-41b1-9c7c-a7c5ad5b01fc') called_func=foo elapsed=datetime.timedelta(microseconds=505970) returned_adata_id=4940502352 More complicated example with argument value loggingfrom anndata import AnnData
from copy import copy
from datetime import datetime
from functools import wraps
import inspect
from itertools import chain
from structlog import get_logger
from time import sleep
import uuid
logger = get_logger()
def logged(logged_args=None):
"""
Params
------
logged_args : list[str], optional (default: `None`)
Names of arguments to log.
"""
if logged_args is None:
logged_args = []
def logged_decorator(func):
argnames = inspect.getfullargspec(func).args
@wraps(func)
def func_wrapper(*args, **kwargs):
call_id = uuid.uuid4() # So we can always match call start with call end
logged_params = {}
for param, val in chain(zip(argnames, args), kwargs.items()):
if type(val) is AnnData:
logged_params[param] = id(val)
elif param in logged_args:
logged_params[param] = copy(val) # Probably need to consider how these values get logged
logger.msg("call", called_func=func.__name__,
logged_args=logged_params, call_id=call_id)
t0 = datetime.now()
output = func(*args, **kwargs)
dt = datetime.now() - t0
call_finish_record = dict(
called_func=func.__name__, elapsed=dt,
)
if type(output) is AnnData:
call_finish_record["returned_adata_id"] = id(output)
logger.msg("call_finish", **call_finish_record, call_id=call_id)
return output
return func_wrapper
return logged_decorator
# Usage
@logged(logged_args=["x"])
def foo(adata, x, copy=True):
sleep(0.5)
if copy: return adata.copy()
import scanpy as sc
pbmcs = sc.datasets.pbmc68k_reduced()
foo(pbmcs, 1, copy=True)
# 2019-02-13 19:35.48 call call_id=UUID('f7623504-31c5-4afa-ae26-4f58fc5341a8') called_func=foo logged_args={'adata': 4476410456, 'x': 1}
# 2019-02-13 19:35.48 call_finish call_id=UUID('f7623504-31c5-4afa-ae26-4f58fc5341a8') called_func=foo elapsed=datetime.timedelta(microseconds=507880) returned_adata_id=5064494384 |
Interesting ideas. I actually didn't notice scanpy has no logging implemented - this would indeed be useful and could already solve half the problem indeed. However, I doubt the best way to go about this would be post hoc with decorators etc, but rather intrinsically throughout the various API functions. Regardless of logging, I still think that having something which is intrinsically attached to the object would have the advantage of knowing the exact set of operations solely from the h5ad file/AnnData object itself. Don't know if people are actually out there are also sharing these or not but it could be useful from that perspective too. |
Scanpy does have logging implemented (examples: neighbors, highly variable genes), but it's not that widely used. I think this is because it has to be implemented manually in the code (not sure if this is what you mean by "intrinsic"?), which makes it take some effort to implement and not all contributors are aware of. I think using a decorator would be nice for abstracting out the process. This would have benefits of consistency of usage by making it easy, consistency of logged messages, and separation of concerns between computation and tracking. I also think you'd be able to know the exact set of operations from this approach. Assuming all top level functions have been wrapped with a decorator like the one I presented above, this code: adata = sc.read_10x_h5("./10x_run/outs/filtered_gene_matrix.h5")
sc.pp.normalize_per_cell(adata, 1000)
sc.pp.log1p(adata)
sc.pp.pca(adata)
adata.write("./cache/01_simple_process.h5ad") Should result in a set of (psuedo-)records like:
It's pretty trivial to go through these logs and figure out what happened to the AnnData, and made accessible through helper functions. Maybe they'd look like Is there some set of information this wouldn't capture? |
Sorry I missed the logging. I also didn't see the Something like e.g.: adata = sc.read_10x_h5("./10x_run/outs/filtered_gene_matrix.h5")
sc.pp.normalize_per_cell(adata, 1000)
sc.pp.log1p(adata)
sc.pp.pca(adata)
adata.write("./cache/01_simple_process.h5ad")
adata = sc.read("./cache/01_simple_process.h5ad")
sc.pp.scale(adata)
adata.write("./cache/01_simple_process.h5ad")
print(sc.logging.get_operations(adata_id=id(adata))) would probably forget the first set of operations?
I guess one solution would be to follow the path of ids up the log to retrieve all which seems doable, so this could be a good system. The one thing this wouldn't cover though is persistence within the h5ad object itself. This would be useful in the case of sharing the object with someone for example. As I mentioned before, I'm not sure this is a widespread use case yet, but could be useful. |
Yeah that's what I was thinking for tracking between serializations. I figure there could be a boolean argument like I think it'll be possible to write the logs to some field in an object. There is a question of how complicated this would be to implement, which I haven't quite figured out yet. Maybe you'd add a reference to the AnnData to the logging context, and make a custom logger which decides where to write based on that? Alternatively, maybe this just gets handled by some logic in the decorator. So after a method is called, there's a flag about whether to add records to the modified object. Of course, nothing is logged persistently by default, so it's already an extra step to enable. It's possible sharing could just need two extra steps, "enable persistent logging" and "send them the logs". |
|
When exploring various options of preprocessing data, I try to avoid having several copies of AnnData objects in memory if they're not sparse, so I save them to h5ad at key steps. Sometimes alas, after a few iterations I re-write stuff and forget what operations have been performed in my "X" (particularly in the preprocessing steps).
So, because being lazy makes me creative, I started tracking these in the object itself (see example https://gist.github.com/afrendeiro/7ccaf324bfdbff042ae36f734f544860) by decorating the preprocessing functions post hoc (this could even easily be used to save the values of kwargs passed potentially).
I wonder if an internal implementation of this would be of broad interest, particularly for functions which modify "X" inplace?
Of course this would be no replacement for proper documentation of one's steps, etc but I thought it could be an interesting addition to scanpy in any case.
The text was updated successfully, but these errors were encountered: