Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support function specific dependencies #3439

Open
drorata opened this issue Mar 4, 2020 · 48 comments
Open

Support function specific dependencies #3439

drorata opened this issue Mar 4, 2020 · 48 comments
Assignees
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@drorata
Copy link

drorata commented Mar 4, 2020

Assume we have 4 stages:

  • Stages 0 and 2 depend on the function foo
  • Stages 1 and 3 depend on the function bar

In order to reduce code duplication we want to put the utilities foo and bar in one place; let them be in utils.py. So, utils.py becomes a dependency of all four stages.

Now... after some time we realize we have to update something in bar. So we edit utils.py. This, in turn, invalidate all stages. Obviously, a simple solution is to have utils_0_2.py which will be a dependency of 0 and 2 and similarly utils_1_3.py. But this solution conflicts with good practices of coding.

A possible solution could be to adopt some notation like the one used in pytest and specify the function inside utils.py. For example, something like:

deps:
- md5: 409a438ee6ac70fc1f911631412325ea
  path: utils.py::foo

This is a continuation of this discussion.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Mar 4, 2020
@pared pared added discussion requires active participation to reach a conclusion feature request Requesting a new feature labels Mar 4, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Mar 4, 2020
@pared
Copy link
Contributor

pared commented Mar 4, 2020

Ok, so first obstacle I see is that we are trying to make dvc language agnostic, so the specification of how to define this function dependency and how to calculate its checksum would have to be quite abstract. Having said so, it still looks like reasonable feature request.
@iterative/engineering what do you think?

@casperdcl
Copy link
Contributor

We'd need specific per-language support (based on file extension) to extract source code of functions/classes and md5 hash them. Python would be a first obvious language to support but in general we'd need one plugin/parser per lang.

@shcheklein
Copy link
Member

Alternative to this is to go all Python. We've seen examples when companies use DVC core API to manipulate stage files from Python. It usually means that users do not use dvc run, etc. They use some custom decorators that internally handle stage files, etc, etc.

@casperdcl
Copy link
Contributor

Not sure whether going "all Python" would be a different implementation than supporting other langs. We'd probably still have to check file extensions/types and print out errors if it's a binary file etc...

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 4, 2020

Note: this may be a duplicate of #1577.

@casperdcl I think Ivan means an alternative for @drorata is to integrate with the DVC using the Python package (pip installation) instead of using dvc run and CLI in general. We don't have any docs about it yet though (see iterative/dvc.org/issues/998), and docstrings are sometimes very complete, sometimes inexistent (see #3176), so that's probably a corner solution for rather advanced users.

@casperdcl
Copy link
Contributor

Well we could support a user-defined external filter tool (similar to git):

deps:
- md5: 409a438ee6ac70fc1f911631412325ea
  path: utils.py
  filter: python extract_function.py --name foo

@fabiosantoscode
Copy link
Contributor

fabiosantoscode commented Mar 6, 2020

I agree with @casperdcl here. I don't think we should be maintaining a whole bunch of tools for this. Maybe we can write just one, for python, and distribute it on pypi to "set the tone" for others to provide other tools with similar arguments (and possibly command name prefixes like dvc-extractfn-python --name foo).

Users of other languages will want to distribute this as a package in their own language. And as long as it's in the PATH and has the same signature, we're fine with it.

Edit: same argument spec and same command prefix aren't mandatory but sure would be nice for user experience.

@fabiosantoscode
Copy link
Contributor

fabiosantoscode commented Mar 6, 2020

If anyone in the world wants one of these for JavaScript I can write it, it should be 50 lines or less using a pre-existing parser.

@elgehelge
Copy link
Contributor

Are you sure you really want such detailed dependency management @drorata ? Will you remember to update the dependency when somebody comes along and for instance splits the function into two?

def foo(bar):
    return [ba + 1 for ba in bar]

to

def foo(bar):
    return [fo(ba) for ba in bar]

def fo(ba):
    return ba + 1

Do you have an example of "this solution conflicts with good practices of coding" to make it a little less abstract?

@casperdcl
Copy link
Contributor

casperdcl commented Mar 11, 2020

@elgehelge good point, but I think it's fine for dvc to give users double-edged swords with caveat emptor warnings. If we go down the "support user-defined filter: " route then users could define a really clever filter which even analyses the whole code base and pulls in a function's dependencies. Probably not too hard considering:

it should be 50 lines or less using a pre-existing parser

I see this as part of DVC's "Makefiles-but-better" bit, where there's sub-file granularity.

The question is, does this feature make bad practice so easy for end-users to slip into that DVC should stay away from it? I don't think so.

If nothing else the end-user filter could be a simple dos2unix | sed -i '/^$/d' to minimise re-running things after whitespace changes...

btw I assume filter is a POSIX-style pipe, i.e. cat path | filter | md5sum is the md5 saved in the DVC-file.

CC @efiop

@drorata
Copy link
Author

drorata commented Mar 12, 2020

@elgehelge Good point indeed but still, I believe you can face many cases where this conflict between good coding practices and dependencies can happen. For example, assume that stages 1 and 3 need to fetch data from S3 and stages 2 and 3 need to fetch data from MongoDB. Now, you have two utility functions that take care of fetching the data from the two different sources. From the codebase's structure point of view, I would argue that it make sense to keep these two utility functions inside the same utils.py.

@dmpetrov
Copy link
Member

Well we could support a user-defined external filter tool (similar to git):

It is a very good idea to unify it to the command level. So, DVC can support dependencies for any language or tables in databases (#1577).

There is also an opportunity to go further to the unification direction and make it general for other types of "conditional" dependencies such as hyper-parameters from params file (#3393) and Python dependencies as a specially important use case (as @shcheklein mentioned).

The unification can look something like this:

deps:
- md5: 97215532a21d13daa5bf932f93a7d518        # <-- regular file
  path: users.csv
- type: cmd                              # <-- command (can be code deps or DB check)
  checksum: 409a438ee6ac70fc1f911631412325ea
  cmd: node db_table_last_update.js --table users
- type: python                         # <-- native python deps
  checksum: 1df9d34e2031bb7e8
  path: utils.py
  function: utils.check_db
- type: params
  params: "depth=17,L1=0.25,eta=0.2,rotate=[10,20,35]"   # <-- hyper-parameter
  path: params.conf

The question is, does this feature make bad practice so easy for end-users to slip into that DVC should stay away from it? I don't think so.

I don't see any issues so far (I just don't like the term filter :) ).

btw I assume filter is a POSIX-style pipe, i.e. cat path | filter | md5sum is the md5 saved in the DVC-file.

Right. cmd type should support pipes.

Will you remember to update the dependency when somebody comes along and for instance splits the function into two?

Very good point, @elgehelge! I don't think it makes sense to go too deep into function code and all the dependencies (it's okay to trigger stage if code was refactored without any semantic changes like your example). Also, for simplicity, we can agree to ignore function dependencies changes (so, no checks for fo in your or version of math.exp() if you use that). In your example, for a proper check you need to depend on both python functions: foo() and fo().

@casperdcl
Copy link
Contributor

I would recommended against a type: python.

- type: python                         # <-- native python deps
  checksum: 1df9d34e2031bb7e8
  path: utils.py
  function: utils.check_db

Should be:

- type: cmd
  checksum: 1df9d34e2031bb7e8
  cmd: dvc extract utils.py::utils.check_db

@dmpetrov
Copy link
Member

I would recommended against a type: python.

Type python is a "Syntactic sugar". So, I agree, it is better not to start with it and implement down the road if it's needed.

@dmpetrov
Copy link
Member

In addition.... type: params is also a "syntactic sugar". However, it is too fundamental for DVC that it has to be supported in dvc-files. I'd even say it is more important than the command-type.

Some external automation tools (like hyperparameters automation/visualization, CD4ML, DVC itself - dvc repro -p learning_rate=0.04) will use that type from dvc-file and the param files (see #3393).

@casperdcl
Copy link
Contributor

So we're down to 2 options:

type: cmd  # or potentially other things
cmd: custom_filter

or

filter: custom_filter

I still prefer the latter (and as a minor point also prefer the word "filter" since that's what git calls it and what users will easily understand). However I understand the need for a type with the implicit default type: file if unspecified.

So really it seems like there's only one option:

type: file  # optional, assumed `file` by default
cmd: custom_filter  # optional
  • we don't need an explicit type: cmd as it's inferred from the existence of cmd: custom_filter
  • we only need to debate whether we call this cmd: or filter: ...

@dmpetrov
Copy link
Member

We need to specify the way how to make checks and what are the requirements for the user's commands/cmd. DVC needs some checksum (or timestamp or another signal) from cmd to identify if the cmd needs to be reproduced.

The requirement for the user code/cmd can be like:

DVC reproduces the stage with a special/conditional command dependency ('type: cmd') only if:
1. The command returns zero return code - success.
2. The command output string does not match the string (checksum) from the last run. (with some reasonable buffer limit like 4KB)

Also, I quite don't understand the analogy with Git filters. Is it about git-filter-branch? How does this apply for the conditional commands? @casperdcl, I'd appreciate if you could explain.

@casperdcl
Copy link
Contributor

identify if the cmd needs to be reproduced

I was intending it to always be run (unconditionally) but good point that this may be inefficient. Maybe we need a hash both before and after cmd and only re-run cmd if the hash before has changed.

path: utils.py
md5: 1q2w3e4r
cmd: custom_filter  # if `md5` changed, will be run to produce `cmd_md5`
cmd_md5: 7u8i9o0p

or:

path: utils.py
path_md5: 1q2w3e4r
cmd: custom_filter  # if `path_md5` changed, will be run to produce `md5`
md5: 7u8i9o0p
  1. The command returns zero return code - success.
  2. The command output string does not match the string (checksum) from the last run. (with some reasonable buffer limit like 4KB)

I disagree with both options. I'd expect a modification of 2: cat path | cmd | md5sum is the checksum, ergo there is no buffer limit.

analogy with Git filters

I refer to e.g. smudge and clean filters https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes#filters_a

@dmpetrov
Copy link
Member

identify if the cmd needs to be reproduced

I was intending it to always be run (unconditionally) but good point that this may be inefficient.

It might be some misunderstanding... The whole point of the specific dependencies (this issue as well as hyper-param issue #3393) - not to run stage when the user's code (cmd) can identify that the result will be the same despite the dependency file change.

Maybe we need a hash both before and after cmd and only re-run cmd if the hash before has changed.

Kinds of...

  • before run: what you already have in dvc-file (might be empty).
  • after run: what cmd returns. then you update the old value to the new one. So, there are no need in storing both together.

I disagree with both options. I'd expect a modification of 2: cat path | cmd | md5sum is the checksum, ergo there is no buffer limit.

These are not options. These are 2 steps that you need to run to check if repro of a stage is needed. I didn't get your modification (except the limit which is an implementation detail):

  1. Will you accept the result if cmd failed (see step 1)?
  2. Are you forcing users to run exactly the same commands with md5sum etc (md5sum does not exist in Mac OS btw). What if a user'd like to run db_table_last_update.js which returns a table last modification timestamp?

@casperdcl
Copy link
Contributor

casperdcl commented Mar 13, 2020

@dmpetrov definite misunderstanding here. In the issue here I'm referring to filter as the a command which extracts a subset of the file defined in path.

This is completely and utterly different from the usual (current) definition of cmd which is why I'd prefer a different name. To be explicit:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4  # <-- from `cat path | md5sum`
  path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84  # <-- from `cat path | filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # <-- user filter
  md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8  # <-- optional, from `cat path | md5sum`,
  # used to determine if we should bother re-running `filter`
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
  path: model/mymodel.pkl

Where "user filter" is as equivalent to cat path | filter | md5sum to generate the md5 associated with the path. In this case: cat utils.py | python extract_function.py --name check_db | md5sum. The normal (no filter) behaviour (md5 based on path) would be equivalent to filter: cat (or is it actually currently filter: head -c 4M for speed purposes at the cost of correctness? Off-topic question).

I also prefer to do without md5_orig (i.e. instead we should run filter unconditionally) as it's a minimal overhead/inefficiency which can be dealt with in a different issue. Running filter unconditionally will re-calculate path's md5 to be exactly the same, and thus cmd and other dependees won't re-run.

Are you forcing users to run exactly the same commands with md5sum etc (md5sum does not exist in Mac OS btw).

No, the user never runs md5sum, I only use it as shorthand for whatever dvc uses internally (basically python -c 'import sys, hashlib; print(hashlib.md5(sys.stdin.read().encode("U8")).hexdigest())').

What if a user'd like to run db_table_last_update.js which returns a table last modification timestamp

Yes, exactly what filter supports (the timestamp will be md5summed by dvc, though, so not particularly human-readable).


About the confusion, I think you were probably talking about something completely different along the lines of simplifying:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
  path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84
  path: /tmp/__check_db
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
  path: model/mymodel.pkl

...
md5: 34984b7d96f970d23bb38cda7d7b3f1a
cmd: cat utils.py | python extract_function.py --name check_db  > /tmp/__check_db
deps:
- md5: 500f914dd06cf1f588b8e281c8fb3dd8
  path: utils.py
outs:
- md5: 809776eaeef854443d39e4f76bd64c84
  path: /tmp/__check_db

@dmpetrov
Copy link
Member

dmpetrov commented Mar 15, 2020

@casperdcl thank you for the clarification. It looks like we have less misunderstanding than I thought. And now I understand where your filters come from - filter a part of the source code file.

"user filter" is as equivalent to cat path | filter | md5sum to generate the md5

Anyway we need an output from user defined function (good opportunity to introduce term UDF to DVC 😀 ). The misunderstanding was - should we calculate md5 from the output or not. If we do md5:

  • [pros] it unifies the checksum representations and we don’t need to invent limits (size limit like 4K or format limit like - only printable symbols)
  • [cons] user can come up with a better, human-readable representation of the checksum: last update timestamp, number of records in DB table/csv-file, file timestamp (this is what DVC does today in internal DB).

Any of this approaches can work just fine. I’m good with md5. And we can easily extend this later if needed. One more addition, in any of these approaches DVC has to check the command return code as I mention before - no need to run md5 check if cmd failed.

I also prefer to do without md5_orig (i.e. instead we should run filter unconditionally)

Totally agree. The check/cmd has to be lightweight and run for each repro/deps check.

So, it looks like the only disagreement here is the name. I got your analogy with filtering function code from a code file but the code check is not the only option here.

Even for source code deps checks, it is not easy to imagine that someone will be filtering an actual code lines. I’m sure that even for a simple checks we will be dealing with bytecode or different forms of abstract-syntax-trees (ASTs). I guess, this is why I didn’t get the filtering analogy and I’m not sure other users will understand the analogy.

I suggest usingcmd but not filter. Other ideas are also welcome (see UDF up the beginning).

@dmpetrov
Copy link
Member

dmpetrov commented Mar 15, 2020

There is one more subject to discuss - do we need the checksums from user defined functions in dvc-file or we can store them in the dvc DB (.dvc/state)?

  • [pros] No additional checksums in dvc-files which improves human readability.
  • [cons] The changes will be localized in a user machine (in a repo clone). As a result, any user defined function has to run in a new repo clone at least once (the first time) even if no changes in the “global state” (source code, DB-table).

Implementation details

If someone does not know about DVC DB. .dvc/state file is SQLite database with some internal information. Today DVC stores FS timesteps for large data file in the DB. This is an optimization - if timestemp didn’t change for a file then DVC does not calculate md5 for this file. But this md5 calculation has to run at least one in each clone of a repo for each data file.

There is an analogy between the current FS timestemp optimizations and this one. And we can generalize this kind of optimizations which should improve the codebase.

If you try to imagine how dvc-file would look like without DVC DB you will see something like:

deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
  path: users.csv
  type: fs_timestemp_check
  timestemp: 123456789012

That means that DVC needs to check file timestemp and if it does not match then it should calculate a new checksum and save it with the timestemp in dvc-file.

The idea of moving the checksum to DVC DB seems elegant to me. I don’t see any concerns for now. @iterative/engineering could you please take a look at this approach. Is there something that I missed?

@casperdcl
Copy link
Contributor

When you say

checksums from user defined functions

Do you mean the hash of the UDF/filter itself (rather than its output)? Similar to the md5 for cmd right now?

In which case do we need to store checksums for filters at all? I thought we'd agreed (for now) that they should be run unconditionally so I don't see what said checksums could be used for.

@dmpetrov
Copy link
Member

dmpetrov commented Mar 16, 2020

Do you mean the hash of the UDF/filter itself (rather than its output)? Similar to the md5 for cmd right now?

Yes.

In which case do we need to store checksums for filters at all?

🤔 when you run UDF (unconditionally) and it returns a checksum how you decide if the dependency changed and the stage needs to be re-run (if you have only an actual dependency checksum but not UDF checksum)?

Let's take one step back. I don't think we can use UDF checksum as the file checksum (md5) because some of DVC algorithms (status, gc) are based on the assumption that file checksum is an actual one and can be recalculated purely based on the file content. UDF checksum is just a "proxy" for optimization purposes (this is why I'm thinking that it needs to be moved away from dvc-file). In theory, we can rethink this principle but it can be a big design change and a really good reason is needed.

@casperdcl
Copy link
Contributor

casperdcl commented Mar 16, 2020

Hmm I see your concern, but we definitely need to save the filter's output checksum (i.e. subset of file checksum) in the DVC-file for reproducibility across systems. How about appearing twice? gc probably won't even need patching with this:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84  # == `cat path | filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # <-- user filter
- md5: fd2b169580906cd21350d47b041461b4  # `cat path | md5sum`
  path: utils.py

@dmpetrov
Copy link
Member

dmpetrov commented Mar 16, 2020

@casperdcl before we jump to cross-system reproducibility let's make sure we are on the same page... It looks like you missed the UDF checksum from the file (md5 is not enough as I said). Didn't you?

EDITED: I expected it to have something like udf_hash:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84  # == `cat path | filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # <-- user filter
  udf_hash: d5964e5fb17c88bbed0d8
- md5: fd2b169580906cd21350d47b041461b4  # `cat path | md5sum`
  path: utils.py

EDITED 2: the last checksum fd2b169580906cd21350d47b041461b4 is supposed to be removed because of udf_hash

@casperdcl
Copy link
Contributor

casperdcl commented Mar 16, 2020

when you run UDF (unconditionally) and it returns a checksum how you decide if the dependency changed and the stage needs to be re-run

The stage's checksum is the UDF's output's checksum. We run the UDF unconditionally and anything that depends on it will "see" a different checksum (and thus a "changed stage") only if the checksum has changed.

Am I missing something super basic about DVC's use of checksums? Still don't see a need for md5_udf (a.k.a. udf_hash)

@dmpetrov
Copy link
Member

I got it. I misinterpreted the phrase "How about appearing twice?". You suggest to keep both of the checksums but in different places of the dvc-file. Yeah, it is the same as udf_hash but with some duplications:

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84  # == `cat path | filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # <-- user filter
- md5: fd2b169580906cd21350d47b041461b4  # `cat path | md5sum`
  path: utils.py

or

md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py  # <-- user cmd
deps:
- md5: fd2b169580906cd21350d47b041461b4  # `cat path | md5sum`
  path: utils.py
  udf: python extract_function.py --name check_db  # <-- user filter
  udf_hash: 809776eaeef854443d39e4f76bd64c84  # == `cat path | filter | md5sum`

Ok, it does not matter at this point.

@dmpetrov
Copy link
Member

we definitely need to save the filter's output checksum (i.e. subset of file checksum) in the DVC-file for reproducibility across systems.

Let's return back to this cross-systems checksum question.

First, storing checksums in a local DVC DB does not affect reproducibility across systems. However, it affects performance a bit - in a new, cloned repo you have to rerun every stage "protected" by UDF because these udf-checksums are not transferable from machine to machine. In exchange, we improve dvc-files readability by getting rid of these UDF checksums.

Second, once build/run-cache is implemented (#1871 (comment), originally is here #1234) we can transfer the udf-checksums through the cache (and storages) without over polluting dvc-files.

The biggest concern that I have today regarding the udf-checksums in DVC DB - how well it fits the current DVC architecture.

@casperdcl
Copy link
Contributor

casperdcl commented Mar 17, 2020

storing checksums in a local DVC DB does not affect reproducibility across systems. However, it affects performance

Yes, that's what I meant; should have been more precise.

I also think my suggestion (splitting into path/md5 and path/md5/filter where path is duplicated) misses the point as there would be a dependency on the whole file which is explicitly what we/@drorata do not want.

some of DVC algorithms (status, gc) are based on the assumption that file checksum is an actual one

That's a bit of an issue. I presume going down the route of adding a filter/md5_filter (AKA udf/udf_hash) to the current path as you suggest would not break status/gc while also allowing us to implement special handling of run/repro to check for and use filters?

This takes us all the way back to my first suggestion (albeit now we fully understand it? 🙂) #3439 (comment)

- md5: 809776eaeef854443d39e4f76bd64c84  # `cat $path | $filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # user filter
  md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8  # <-- optional, from `md5sum $path`,
  # used to determine if we should bother re-running `$filter`

which needs to be tweaked to:

- md5_filter: 809776eaeef854443d39e4f76bd64c84  # `cat $path | $filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # user filter
  md5: 500f914dd06cf1f588b8e281c8fb3dd8  # `md5sum $path`

or even:

- filter_hash: 'def check_db():\n    pass'  # `cat $path | filter`
  filter_type: raw  # default `md5`. Using `raw` would skip the md5 sum,
  # (e.g. leave user-readable time stamp)
  path: utils.py
  filter: python extract_function.py --name check_db  # user filter
  md5: 500f914dd06cf1f588b8e281c8fb3dd8  # `md5sum $path`

@dmpetrov
Copy link
Member

dmpetrov commented Mar 19, 2020

This takes us all the way back to my first suggestion (albeit now we fully understand it? 🙂) #3439 (comment)

- md5: 809776eaeef854443d39e4f76bd64c84  # `cat $path | $filter | md5sum`
  path: utils.py
  filter: python extract_function.py --name check_db  # user filter
  md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8  # <-- optional, from `md5sum $path`,
  # used to determine if we should bother re-running `$filter`

Right, but it needs to be inverted. The key has to be the file checksum, not UDF.

- md5: 500f914dd06cf1f588b8e281c8fb3dd8  # `md5sum $path
  path: utils.py
  udf: python extract_function.py --name check_db  # user filter
  udf_md5: 809776eaeef854443d39e4f76bd64c84  # `cat $path | $filter | md5sum`

@casperdcl
Copy link
Contributor

casperdcl commented Mar 19, 2020

Yes that's what I said :)

EDIT: Ah I only swapped the key but not the comment. So not quite what I said, but what I meant to say.

So... fine for me to be assigned unless someone else wants to tackle this.

@dmpetrov
Copy link
Member

@casperdcl one of the important questions that left is where we store the checksum: dvc-file, DVC DB or run/build-cache (which is not implemented yet). Initially, we can just save in dvc-file. But I'd love to hear @iterative/engineering opinion.

@casperdcl
Copy link
Contributor

casperdcl commented Mar 19, 2020

I'd prefer DVC-file for now.

The analogy between timestamp-path_checksum and path_checksum-filter_checksum (ordering important) means that filter_checksum shouldn't be put in DVC DB as far as I can see.

Not sure if this has been discussed but when you say run/build cache, do you mean having different classes of cache (e.g. not synced with a remote, manually syncable, etc)?

@dmpetrov
Copy link
Member

I don't see any issues with DVC DB except the warm-up run (the first mandatory run after repo cloning). If we can get rid of checksums in dvc-files completely - even better.

run/build cache is about caching run/files checksum globally and storing them in remote. Today, DVC can make repro/not-repro decision based only on the last run only (not runs a few commits back in Git history). Also, this is the way of moving (some of) checksums away from dvc-files and make it more human-readable. The most recent discussion related to run/build-cache: #1871 (comment)

@casperdcl
Copy link
Contributor

casperdcl commented Mar 19, 2020

I think some users would care about the warm-up. Run/build cache sounds great. Different issue though.

@shcheklein
Copy link
Member

My 2cs:

  • it also reminds me more of a filter then udf (mostly because it's kinda a pipe, not a function definition that could be applied in other place)
  • also, not clear why would we need to have udf_md5 or filter_md5 - it feels that we can definitely start without them - very simple one field filter: cmd is enough. Later we can add caching if we see use cases when filters run too long, etc. Probably, I would prefer to have them locally (DB or some ./dvc file) vs saving to Git.

Looks cool overall.

@shcheklein
Copy link
Member

A few questions to consider, answer:

  • does it make sense to start providing some very common functions as import dvc.filters?
  • need to think about forward compatibility a bit - what will be happening to the new project, with new DVC-files that will be running on an old DVC-version - I hope some schema exception.
  • are we going to provide some CLI support for this? (dvc run).
  • do we need this for dvc add-ed files - effectively any dependency? Probably we won't be supporting external deps/outs first, right?

@shcheklein
Copy link
Member

Had a private chat with @dmpetrov . Summary we both agree that neither udf, nor filter are perfect.

The biggest issue with filter, no matter how much I like it, that it's too low level. There is a risk the name won't resonate with users looking for a solution for certain use cases. Probably, hard to come гз with a full list of those, but think about stuff that is a bit outside of just reading the file and transforming it into something. Let's imagine we want to count number of records in a DB, or custom hash - those are still technically could be called filters to my mind, but are there other human friendly alternatives?

@casperdcl
Copy link
Contributor

suggestions (in order of preference):

  1. filter
  2. filter
  3. filter
  4. map(per)
  5. trans(form(er))
  6. extract(or)
  7. morph(er)
  8. hasher
  9. process(or)
  10. cmd

I think this is also a ranking by word familiarity to non-programmers so really best all round.

@elleobrien
Copy link

chiming in bc i get notifications from this post- "filter" has a very specific meaning to many data scientists that is particular to relational algebra. it looks like you are referring to a broader set of functions being encompassed in filter, so I would agree with @shcheklein's last comment

@elgehelge
Copy link
Contributor

"modifier"?

@casperdcl casperdcl self-assigned this Jun 27, 2020
@casperdcl
Copy link
Contributor

casperdcl commented Jul 18, 2020

The API has changed a bit now in v1 dvc.yaml...

deps:
- md5: 409a438ee6ac70fc1f911631412325ea
  path: utils.py

is now simply

deps:
- utils.py

This complicates things a bit in terms of how to save the "filter" command seeing as we now only have a deps list rather than dict.

Can't really think of anything better than having some sort of "virtual" deps, i.e. a dep which is a command to execute and read stdout. Either:

deps:
- 'cmd:python extract_function.py --name check_db' # `cmd:` prefix signifies UDF/filter

or:

deps: []
deps_cmd:
- python extract_function.py --name check_db

Or we can go with supporting a mix of string & dict types (maybe easiest?):

deps:
- some_other_file.py
- path: utils.py
  udf: python extract_function.py --name check_db

@dmpetrov
Copy link
Member

Good point re dvc 1.0, @casperdcl

Or we can go with supporting a mix of string & dict types (maybe easiest?):

deps:
- some_other_file.py
- path: utils.py
  udf: python extract_function.py --name check_db

This looks like the best option so far.

@Suor
Copy link
Contributor

Suor commented Jul 20, 2020

People will start to use it to make some external things like db state a dependency.

@casperdcl
Copy link
Contributor

casperdcl commented Jul 25, 2020

after discussion with @efiop:

path: utils.py
udf: python extract_function.py --name check_db

should mean either python extract_function.py --name check_db utils.py | md5sum - or just python extract_function.py --name check_db | md5sum -. It should not mean cat utils.py | ....

Also, a better spec seems to be:

path: utils.py
udf:
  cmd: python extract_function.py --name check_db
  #md5: s0m3h45h  # in dvc.lock but not dvc.yaml

or even (similar approach to params):

# in dvc.yaml
utils.py:
  cmd: python extract_function.py --name check_db

# in dvc.lock
path: utils.py
cmd: python extract_function.py --name check_db
md5: s0m3h45h

@ijlyttle
Copy link

ijlyttle commented Sep 25, 2023

I'm interested in this issue, so forgive me for asking if it is still under consideration.

For me, the use case would be to filter for the source fields of code cells in Jupyter notebooks - to avoid re-running if there are only changes in prose, execution-counts, or the like.

@petebachant
Copy link

petebachant commented Oct 2, 2024

Going to +1 this as well. I typically like to organize projects around larger modules with a CLI, and defining dependencies inside would be helpful. Syntax like this seems to follow similar to the outs spec:

my_stage:
  deps:
    - mymodule.py:
        - filter: python -c "import inspect, mymodule; print(inspect.getsource(mymodule.myfunc))"
    - the-plan.csv:  # I only want to pay attention to a single row of this CSV
        - filter: python -c "import pandas as pd; print(pd.read_csv('the-plan.csv').iloc[3])"

So, stdout from the filter command is then taken to compute the MD5. However, in this case, the path doesn't actually matter. I guess it's there for readability?

@shcheklein shcheklein added the p1-important Important, aka current backlog of things to do label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.