Consistent interface for algorithms & single training script #79

AdamGleave · 2019-08-14T12:21:21Z

Currently we don't have any CLI script for behavioral cloning or the density baseline. I envisage this codebase as being particularly useful in being able to rapidly benchmark against a wide variety of algorithms. For this to be possible, we need to have (as far as is possible) a consistent interface between different imitation learning algorithms, similarly to how Stable Baselines has a similar interface for all RL algorithms implemented. Ideally we'd then also have imitation.scripts.train work for all of them, although there'll clearly need to be som algorithm-specific configs.

The text was updated successfully, but these errors were encountered:

shwang · 2021-08-28T20:36:12Z

We now have a SB3-like .train(total_timesteps) interface for GAIL, AIRL, DAgger, and BC. Also we have Sacred scripts for each of these algorithms, though they are not unified into a single train.

AdamGleave · 2022-10-22T02:08:05Z

We've made a lot of progress since this issue was first opened but there's still room for improvement so I'm leaving this open for now.

ernestum · 2023-01-24T12:55:22Z

I think this issue is required before we can address #587 and start a new take on #602.
I reviewed the scripts and this are some points we should address:

This is the current relationship between scripts and algorithms

Algo	Script	comments
`DAgger`, `BC`	`train_imitation`	BC is a "hidden feature" by setting use_dagger=False
`MCE IRL`	-	do we need a script for this?
`AIRL`, `GAIL`	`train_adversarial`	They being thrown together into one script results in lots of weird constructs such as an `algorithm_specific` configuration that are merged with the config using dubious hacks and configuration being modified during runtime.
`PreferenceComparisons`	`train_preference_comparisons`	-

There is a reward ingredient, that should probably be called reward_net. It takes care of edge cases for AIRL which I think is a bad idea. It also has some named configs that only enable/disable individual configurations. I think this is something that should be handled by the config_hook.

There is a train ingredient which should be split up into a policy and an evaluation ingredient. Ingredients are objects, not verbs.

We should probably have an ingredient for each algorithm. This makes constructing more complex experiments (such as warm-starting) easier.
I also would vote for one train_<algo> script per algorithm to follow unix philosophy that each tool should only do one job.

AdamGleave · 2023-01-31T03:55:11Z

We should probably have an ingredient for each algorithm. This makes constructing more complex experiments (such as warm-starting) easier.

Note sacred.Experiment is-a sacred.Ingredient (parent class), so I think we already have this more or less? EDIT: After seeing #663 I do think there's a case for making BC an ingredient if we want to use it to warm-start a variety of other algorithms, and because it already has a compositional relationship with DAgger. BC might be a bit of a special case in that regard though.

I also would vote for one train_ script per algorithm to follow unix philosophy that each tool should only do one job.

I feel torn here. It's nice to be able to do benchmarks with a variety of algorithms using a consistent interface. Having one script with different subcommands seems more natural here. You could split it up into different scripts, but it seems like a bit of a nightmare to maintain a consistent interface across many files, although perhaps doable if we can factor out most things into ingredients?

However, the algorithms do have different enough fundamental properties we can't literally have the same interface for all. Current approach was a compromise grouping similar algorithms together: e.g. AIRL/GAIL are almost identical except for the parametric form of reward network. But I agree the way they (and BC/DAgger) are currently merged is pretty ad-hoc.

Before we do a complete rewrite of the scripts I do think it's worth reflecting whether we want to keep building on top of Sacred. I do like it but the project is sadly basically abandonware at this point.

ernestum · 2023-01-31T11:56:39Z

Good point, that each Experiment is actually an Ingredient. I will double-check where it makes sense to turn algorithms to ingredients. It would be pointless to have e.g. a DAggerExperiment who only as a single DAggerIngredient.

IMO the consistent interface should result from a consistent design of the ingredients and their configuration. The scripts just pull together the ingredients to form experiments. They allow to run an experiment with the given ingredients but also to explore what configuration values are needed for it. When I run python -m imitation.scripts.<some_script> print_config I expect it to only show me the configuration specific to what <some_script> does. E.g. right now we use the train_imitation to train BC as well as DAgger. If I am only interested in training BC, then I don't want to be bothered with the configuration values for DAgger in the print_config output.

Throwing together AIRL and GAIL eventually resulted in some weird design choices (see comments in above table). It is tempting to fuse them since they are so similar but I feel like the decision to generalize here (which is a good idea on its own) required so many hacks that the gained elegance/reduction of redundancy was outweighed.

I totally agree that we should review the usage of Sacred before we put too much time into refactoring the scripts.
Why do you believe that Sacred is abandoned? When I look just at their releases page it does not seem too bad.
Alternatives to consider are Guild AI and MLFlow. After a quick glance Guild AI looks more promising but we should investigate this more in-depth.

AdamGleave · 2023-02-05T23:20:14Z

E.g. right now we use the train_imitation to train BC as well as DAgger. If I am only interested in training BC, then I don't want to be bothered with the configuration values for DAgger in the print_config output.

That's a good point, I think I'm convinced it's worth splitting up scripts, so long as we can avoid substantial code duplication by splitting things into ingredients.

Throwing together AIRL and GAIL eventually resulted in some weird design choices (see comments in above table). It is tempting to fuse them since they are so similar but I feel like the decision to generalize here (which is a good idea on its own) required so many hacks that the gained elegance/reduction of redundancy was outweighed.

I agree there's a lot of hacks. I still feel a bit uncertain about this, probably easier to discuss with a concrete idea of how to split them up and what can be shared via ingredients etc. It would be nice to have an easy way to use shared configs between AIRL/GAIL as the right hyperparameters are often quite similar, although this use case is less common now we're using automatically tuned hyperparameters (the configs saved in benchmarks/).

I totally agree that we should review the usage of Sacred before we put too much time into refactoring the scripts. Why do you believe that Sacred is abandoned? When I look just at their releases page it does not seem too bad. Alternatives to consider are Guild AI and MLFlow. After a quick glance Guild AI looks more promising but we should investigate this more in-depth.

Abandoned was too strong a word. The original maintainers of Sacred stepped back from the project and did not handover cleanly. Currently @thequilo has been doing a heroic job reviewing and fixing outstanding issues, and things are improving now that he's got access to make PyPI releases. However, the situation is a bit precarious: if @thequilo leaves there's no ones, and he still doesn't have full permissions on the project. IDSIA/sacred#871 (comment) and IDSIA/sacred#879 have some relevant discussion.

Thanks for looking into alternatives. From a quick skim of Guild AI's docs agree it seems promising, though some design decisions I feel hesitant about (e.g. treating global constants as flags by default...)

thequilo · 2023-02-14T08:41:03Z

@AdamGleave is not wrong, there is a risk of sacred becoming unmaintained when I step out. As I am currently a Ph.D. student, and it's completely unclear what will happen after I finished my Ph.D., this might happen in the not-so-far future. But, since my lab is heavily using sacred for all experiment tracking, I'm sure we'll find someone from my lab to continue to maintain sacred, given that Qwlouse gives us access.

AdamGleave · 2023-02-15T00:58:21Z

Thanks for clarifying @thequilo -- good to know someone else in your lab is likely to pick it up. And good luck wrapping up your PhD!

ernestum · 2023-02-27T17:48:47Z

I had a deeper look at Guild AI, ML Flow and neptune.ai. They all are mostly geared to logging, analysis and reproducibility. In comparison to sacred they lack the (hierarchical) configuration management system with experiments and (reusable) ingredients.

I think we have two options here:
(would be interested in your opinion @AdamGleave ? I would lean towards option 1)

1. Stick with sacred

... and hope that it stays maintained. Maybe put some of our time into maintaining it if necessary, maybe push the rl-baselines-zoo to also use sacred.

Pro

avoids lots of code duplication
easy to add new parameters ad-hoc (no need to change config files for that)

Con

we depend on sacred and it is unclear for how long it will be maintained
the nested nature of experiments/ingredients invites for (often unnecessary) complexity in the experiment scripts which might sometimes outweigh the benefits of reduced code duplication

2. Use any of the other frameworks

... (probably Guild AI) and use their configuration system (or something else like argparse/click)

Pro

We don't depend on sacred (if using Guild AI we even minimize dependency to one particular framework)
Other tools support things like HP tuning/grid runs out of the box (no need to maintain our own parallel.py script)

Con

We need to rewrite all our scripts
Configuration of experiments will become harder (we need to do everything with argparse/click/global variables and there is no support for nested config scopes)

AdamGleave · 2023-03-16T00:21:04Z

I'll try to look at this next week. @qxcv any thoughts on above, I think you've used Sacred a fair bit?

qxcv · 2023-03-16T01:15:02Z

I don't have a strong opinion on MLFlow/Guild AI/neptune.ai. I have used and like click (for simple stuff), especially over argparse. As for Sacred, I only used it for the EIRLI project. I've since moved away from using Sacred because (1) I only needed config parsing, and not the more sophisticated features (like tracking git revisions, logging to S3/wandb/whatever, etc.), (2) I couldn't get the configs-as-functions stuff to play well with type checking, and (3) I found argument parsing/option override behaviour hard to reason about.

My current go-to is Hydra with structured configs. Structured configs mean that your config is just a dataclass. This was an improvement for me, but still has shortcomings, like poor support for union types and a slightly awkward API for turning on/off "magic" features. Also it may not do some of the things you want if you're already using all the features in Sacred. It has 7k GitHub stars & is used/maintained by Facebook, though, so I expect that it will get better over time. It also has a surprising number of plugins/integrations (submitit/Slurm, Ray, etc.).

ernestum · 2023-03-17T11:20:54Z

I think we mostly use sacred for the configuration management and then W&B for logging? @AdamGleave you probably know better what other features of sacred are used but in the code it is mostly about configuration.

I will have a look at Hydra thanks for the hint! What made you move from click to Hydra?

qxcv · 2023-03-17T20:01:20Z

What made you move from click to Hydra?

Mostly the fact that I could declare my config as a (nested) dataclass & pass around the appropriate dataclass for each part of the code. Doing that with Click would require a lot of code duplication (repeating arguments as both Click options and dataclass members) and some extra parsing magic (to handle the nested options).

The Sacred-like experiment management features in Hydra (automatically creating experiment directories, automatically setting up the logging module) are sometimes mildly convenient, but honestly something like this library would have solved my problems just as well.

ernestum · 2023-03-20T10:06:18Z

I think the idea of putting configs in data-classes is interesting.
I could implement a draft of a port to Hydra for one of the scripts to explore any unforeseeable difficulties coming up.
What do you think @AdamGleave ?

Edit: looks like Hydra would let us get rid of the parallel.py script!

AdamGleave · 2023-03-27T17:54:55Z

I'm in favor of trying to port one script to Hydra to try it out.

Eliminating parallel.py would be nice, I wrote that as a temporary hacky solution and somehow it's stuck around.

One feature Hydra seems to be missing is logging Git commit hash and other things needed for reproducibility. That said, seems like one could add that via callbacks, and WandB already does that so I don't think we need our config framework to do it.

ernestum · 2023-04-26T20:19:00Z

Feel free to have a look at #703 to see how it turned out.

qxcv · 2023-04-26T23:59:02Z

Wow, thanks for all that work converting things to Hydra! It's using a lot of patterns that I didn't even know existed, like instantiating objects with _target_. I haven't reviewed #703 in detail, but the PR seems very cool from just a skim.

AdamGleave added the enhancement New feature or request label Aug 14, 2019

shwang mentioned this issue Oct 14, 2020

GAIL: Use DataLoader interface for loading expert demonstrations #231

Closed

1 task

AdamGleave mentioned this issue Sep 15, 2021

Unify Demonstration Learning Algorithms #350

Merged

ernestum mentioned this issue Jan 24, 2023

Establish BC as an ingredient #663

Merged

ernestum mentioned this issue Feb 27, 2023

enable warm starting adversarial algorithms w/ behavior cloning #602

Closed

ernestum mentioned this issue Apr 14, 2023

Test speedups #689

Merged

ernestum added this to the Release v1 milestone May 24, 2023

ernestum modified the milestones: Release v1.0, Release v1.x Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent interface for algorithms & single training script #79

Consistent interface for algorithms & single training script #79

AdamGleave commented Aug 14, 2019

shwang commented Aug 28, 2021

AdamGleave commented Oct 22, 2022

ernestum commented Jan 24, 2023 •

edited

Loading

AdamGleave commented Jan 31, 2023 •

edited

Loading

ernestum commented Jan 31, 2023

AdamGleave commented Feb 5, 2023

thequilo commented Feb 14, 2023

AdamGleave commented Feb 15, 2023

ernestum commented Feb 27, 2023

AdamGleave commented Mar 16, 2023

qxcv commented Mar 16, 2023

ernestum commented Mar 17, 2023

qxcv commented Mar 17, 2023

ernestum commented Mar 20, 2023 •

edited

Loading

AdamGleave commented Mar 27, 2023

ernestum commented Apr 26, 2023

qxcv commented Apr 26, 2023

Consistent interface for algorithms & single training script #79

Consistent interface for algorithms & single training script #79

Comments

AdamGleave commented Aug 14, 2019

shwang commented Aug 28, 2021

AdamGleave commented Oct 22, 2022

ernestum commented Jan 24, 2023 • edited Loading

AdamGleave commented Jan 31, 2023 • edited Loading

ernestum commented Jan 31, 2023

AdamGleave commented Feb 5, 2023

thequilo commented Feb 14, 2023

AdamGleave commented Feb 15, 2023

ernestum commented Feb 27, 2023

1. Stick with sacred

Pro

Con

2. Use any of the other frameworks

Pro

Con

AdamGleave commented Mar 16, 2023

qxcv commented Mar 16, 2023

ernestum commented Mar 17, 2023

qxcv commented Mar 17, 2023

ernestum commented Mar 20, 2023 • edited Loading

AdamGleave commented Mar 27, 2023

ernestum commented Apr 26, 2023

qxcv commented Apr 26, 2023

ernestum commented Jan 24, 2023 •

edited

Loading

AdamGleave commented Jan 31, 2023 •

edited

Loading

ernestum commented Mar 20, 2023 •

edited

Loading