Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] making compute CLI command usable inside Python #245

Closed
wants to merge 3 commits into from

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented May 20, 2017

I'm opening this PR for requesting comments about how to properly support calling CLI commands from the Python API, without being too verbose.

This is similar to the design I'm going for in sourmash-utils. (Old PR comments, now this is adapted to #785 ideas)

I'll add inline comments with details, but the rationale is:
We focus on exposing the CLI, but if I want to calculate a signature using the Python API there is quite a bit of boilerplate involved:

import screed
minhashes = []
for g in filenames:
    E = sourmash.MinHash(n=500, ksize=31)
    for record in screed.open(g):
        E.add_sequence(record.sequence, True)
    minhashes.append(E)

what I want to do:

from sourmash.command_compute import compute
sigs = compute(filenames)

(or something similar).

The idea of this PoC is to make arguments explicit in the function signature. The CLI exposes commands and options from subparsers, and function arguments are 'fake-positional': only use kwargs, but leave the required ones (like filenames in the compute(filenames) example) in the beginning to avoid having to define explicitly all the keywords.

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

commands[cmd].add_args(subp)
subp.set_defaults(cmd=commands[cmd])

args = parser.parse_args()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subparsers for all commands must be added before calling this. That's why I'm going for some sort of command.add_args function for each command.

cmd = args.cmd
args_dict = vars(args)
args_dict.pop('cmd')
cmd(**args_dict)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dirty trick with vars to convert the args (a Namespace object?) to a dictionary that can be expanded for the function call.

def compute(filenames=None, check_sequence=False, ksizes=(21,), dna=True, singleton=False,
email='', scaled=10000, force=False, output=None, num_hashes=500, protein=False,
name_from_first=False, seed=42, input_is_protein=False, merge=None,
track_abundance=False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible issue: keeping defaults here and in the argparse code the same. Probably leave the add_args function before the command definition, so at least they are close to each other...

click uses decorators for this:

import click

@click.command()
@click.option('--count', default=1, help='Number of greetings.')
@click.option('--name', prompt='Your name',
              help='The person to greet.')
def hello(count, name):
    """Simple program that greets NAME for a total of COUNT times."""
    for x in range(count):
        click.echo('Hello %s!' % name)

if __name__ == '__main__':
    hello()

notify('WARNING: input is protein, turning off DNA hashing')
args.dna = False
args.protein = True
dna = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of changes in this diff is just removing args. from the arguments...

help='set resolution of signature size')
parser.add_argument('--seed', type=int, help='hash seed',
default=sourmash_lib.DEFAULT_SEED)
compute.add_args = compute_add_args
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dirty, horrible hack. Sorry. Will think about something better than this later 😄

@luizirber
Copy link
Member Author

luizirber commented May 20, 2017

I talked with @standage yesterday and he has another approach for kevlar, where you build a list of CLI-like arguments and then parse with argparse, passing the resulting args object into the main function: https://github.com/dib-lab/kevlar/blob/3130636d5b3822502708084e547b63b671fb2cc2/kevlar/tests/test_dump.py#L36
(and he is also using subparsers:
https://github.com/dib-lab/kevlar/blob/3130636d5b3822502708084e547b63b671fb2cc2/kevlar/cli/__init__.py#L53 )

@luizirber
Copy link
Member Author

@ctb
Copy link
Contributor

ctb commented May 21, 2017

@luizirber
Copy link
Member Author

@luizirber luizirber force-pushed the refactor/subparsers branch 4 times, most recently from d8cbe5e to c265796 Compare August 22, 2018 21:52
@luizirber luizirber changed the base branch from feature/utils_subcmd to master August 22, 2018 22:50
@luizirber
Copy link
Member Author

I was talking with @taylorreiter and it seems like this is a good PR to start going more into sourmash code, so I updated it to the current master.

After this PR was created there were code changes (for search and gather) where @ctb took some code away from sourmash/commands.py and into function that are easier to call in Python. The issue of exposing the CLI commands as Python functions is that they take A LOT of arguments, but I think it is still good to have them available.

else:
cmd = commands.get(args.command)
return cmd(sys.argv[2:])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once all commands are converted we can remove this if block and keep only the first part (the else is the current implementation for subcommands)

from sourmash_utils.__main__ import main as utils_main
UTILS_AVAILABLE = True
except ImportError:
UTILS_AVAILABLE = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is here because the original PR was against #236 and I changed it to be against master to have Travis run the tests...

@ctb
Copy link
Contributor

ctb commented Aug 24, 2019

I updated in this PR in #722, which resolves the conflicts. Still have some kind of bug in there but it's safe to merge.

@codecov
Copy link

codecov bot commented Nov 2, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@f6ad66c). Click here to learn what that means.
The diff coverage is 96%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #245   +/-   ##
=========================================
  Coverage          ?   89.47%           
=========================================
  Files             ?       29           
  Lines             ?     4570           
  Branches          ?       46           
=========================================
  Hits              ?     4089           
  Misses            ?      478           
  Partials          ?        3
Impacted Files Coverage Δ
sourmash/__main__.py 92.68% <88.88%> (ø)
sourmash/command_compute.py 96.55% <98.24%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6ad66c...61f07ea. Read the comment docs.

@luizirber
Copy link
Member Author

@ctb can you take a look at this commit: 193748c

This is adding a ksize_parser function for the args (that converts strings into a ksize list), and also modifying the cli test to also run with the compute function. If it looks OK, I would go modify the other tests (for sourmash compute) to match the modified one.

@luizirber
Copy link
Member Author

I'm also starting to feel like this should move from [RFC] to something actually mergeable.

prepare compute
test command
fix default ksize parsing
failing test: if output is a string or path, should we open it as a file?
@luizirber
Copy link
Member Author

Revamped this on top of #785, and updated the PR description. Comments @ctb @standage?

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like everything about this, except for the final command "path" of sourmash.command_compute.compute, which seems ...long. No immediate solution (and happy to merge this!) but let's think about how to simplify that!

@luizirber
Copy link
Member Author

luizirber commented Jan 8, 2020

I like everything about this, except for the final command "path" of sourmash.command_compute.compute, which seems ...long. No immediate solution (and happy to merge this!) but let's think about how to simplify that!

yeah, the command_compute thing is... long.

What I've been doing in the Rust crate is creating submodules over time. Example: at the moment the src dir is

.
├── cmd.rs
├── errors.rs
├── ffi/
├── from.rs
├── index/
├── lib.rs
├── signature.rs
├── sketch/
└── wasm.rs

ffi/, index/ and sketch/ started as single files (like cmd.rs, signature.rs and wasm.rs are now). As more features were implemented I expand the single file into a submodule, but still using the top-level name (index, for example) to control what to export (for the public API).

On the Python side, it would translate to commands.py becoming a dir module, and importing/exposing items from its submodules.

Currently:

sourmash
├── cli/
├── command_compute.py
├── commands.py
...
└── utils.py

Proposed (like what sourmash.cli is doing now):

sourmash
├── cli/
├── commands/
│   ├── __init__.py
│   └── compute.py
...
└── utils.py

This way we can start with a massive implementation file (like commands.py is now) and keep the API working when we reorganize into a submodule and split the code (because both could be accessed by from sourmash.commands import compute, for example). Doing sourmash/command_compute.py goes against this (and I know you like flat structures =]), but it will also start to get hard to find files if all of them are in sourmash/ (which is similar to the sourmash/sbt*.py situation)

Before merging this, I think it is good to have some guidelines on how to proceed for all commands. From the top of my head:

  • Required arguments in the CLI -> positional arguments in the commands.py function
    • Ideally we could use * in the function signature and make some args keyword-only, but that's not valid in Python 2 =(
  • Keeping defaults from argparse in sync with defaults in the commands.py function
  • Commands taking output (like sourmash compute -o out input): how to map out into an argument? Accept strings and opened files (argparse open a file)? (This is the failing test in this PR, for example)
  • docstrings for each command (Add docstrings to functions in sourmash/commands.py #828)
  • testing: update all tests calling utils.runscript to also test the API command? That's what I did with test_do_sourmash_compute and test_do_sourmash_compute_output_valid_file in this PR.
    • Better: modify utils.runscript to do that, if it is passed an extra arg?

@ctb
Copy link
Contributor

ctb commented Jan 21, 2020

@luizirber has thus declared: "#853 should fix the commands-taking-output problem mentioned above." yay w00t!

@luizirber
Copy link
Member Author

Still think this is useful, especially considering pyodide #2433 and using it in a browser repl, but lots to update... And so I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants