Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: configuring modules #46

Closed
karlicoss opened this issue May 10, 2020 · 15 comments
Closed

Discussion: configuring modules #46

karlicoss opened this issue May 10, 2020 · 15 comments

Comments

@karlicoss
Copy link
Owner

karlicoss commented May 10, 2020

Discussion around the design decisions here

@purarue
Copy link
Contributor

purarue commented Aug 27, 2020

Just some thoughts on configuring/modifying this.

This is heavily biased as someone who is familiar with python, and I expected that I'd have to fork/modify HPI to fit my needs.

Edit: Though, not sure if forking/modifying is the common case, so take this with a grain of salt. Is just my though process on using this so far.

In general, I'm happy with configuration/tooling that this has provided around a layer to load data. Splitting up the configuration into ~/.config/my and using dataclasses to validate attributes makes sense once I've seen it a few times.

Though, when configuration is programmatic (python), and theres a lack of documentation/dataclass (e.g. my.coding.commits), its unclear what exactly to do. I was able to figure out what fields were necessary by inference, but in a larger namespace/module that might not be obvious.

(sidenote; I believe its this, feel free to correct me if I'm wrong)

class commits:
    names: List[str] = ["Sean Breckenridge"]
    emails: List[str] = ["[email protected]", "email two...."]
    roots: Paths = [Path(environ["REPOS"])];

I'd think that everyone has different modules/needs. Not everyone is going to use the same hardware, social media, or systems for note taking, so if your objective is to have programmatic access to all of your data, I don't think pip install hpi would work in most cases. Either pip install -e . or with_my seem like fine solutions to someone who wants to heavily configure/modify this.

I feel like using HPI is closer to the process of curating a vim/emacs/dotfiles configuration. This provides a useful starting point/core for handling data, and the exporters (rexport/ghexport) can be plugged in for commonly used services. Then (optionally), you write your own custom modules for your own data sources. Its slightly more annoying in this case, because of all the parsing/normalization required for ever-changing APIs/Data-Sources, but when/if exporters/DALs are somewhat standardized/pushed to pypi, I think a configuration file/object in ~/.config/my that specifies which modules are active/inactive - I'm a fan of how doom emacs does this - could be useful. Similar to doom/spacemacs, each module/namespace package could (optionally) have a doctor.py which would do checks (run tests or check if packages (like cachew/gitpython/pytz) are installed), model.py for the NamedTuple representation, and/or __init__.py/<modulename>.py, which would be the python module one would import from the REPL to get access to the data.

I'm not saying this should be extended to the extent that spacemacs has been, just that splitting it into namespace packages with an option to enable/disable 'modules' with common files makes it easier for someone to understand/edit, and it provides more structure than the current solution (I'd prefer to not mess with core on my fork whenever possible).

Commenting on the other thread, I'm sort of weary of the idea that users who are less familiar with python could place custom python files as modules in ~/.config/my/plugins. For a newer user, it would also make possible code reuse across namespaces more complicated. Personally, I prefer the idea of cloning/editing this, and then using with_my/pip install -e . to allow import my... in a REPL/python script, but its possible it could work either way.

As a side note, the random TODO and general thought-process comments have been helpful in modifying/creating modules of my own.

@karlicoss
Copy link
Owner Author

karlicoss commented Aug 30, 2020

Hey, thanks for taking time into digging into the code, really appreciated!

Though, not sure if forking/modifying is the common case

It's hard to tell what's a common case, the whole thing is pretty experimental! Would be great to get the "APIs" right so they can suit everyone, but I always keep in mind that you can't really suit anyone, so forking and modifying is definitely one of the usecases I'd like to work for people!

I feel like using HPI is closer to the process of curating a vim/emacs/dotfiles configuration.

Yes!

Though, when configuration is programmatic (python), and theres a lack of documentation/dataclass (e.g. my.coding.commits), its unclear what exactly to do. I was able to figure out what fields were necessary by inference, but in a larger namespace/module that might not be obvious.

Yep, definitely agree! For an example that's hopefully a bit easier to follow and similar to what you suggest, check out Polar module, for example: https://github.com/karlicoss/HPI/blob/master/my/reading/polar.py#L24-L30
, and the corresponding bit of documentation, autoextracted from the code.

I'm writing a bit more about the approach I'm currently taking here, (after the "My conclusion was using a combined approach" line). I should add there an example with default attributes, perhaps.

Similar to doom/spacemacs, each module/namespace package could (optionally) have a doctor.py which would do checks (run tests or check if packages (like cachew/gitpython/pytz) are installed).

Yep, good idea! Currently hpi doctor searches for stats() function in the module, and cals it, e.g.

def stats():
from ..common import stat
return stat(measurements)

So in a sense this does end-to-end testing of the module, along with giving a short summary of your data.

~ ✔ ❯ hpi doctor my.bluemaestro --verbose
✅ config file: /home/karlicos/.config/my/my/config/__init__.py
✅ mypy check: success
   Success: no issues found in 4 source files
✅ OK  : my.bluemaestro
✅     - stats: {'measurements': {'count': 1151314, 'last': datetime.datetime(2020, 8, 26, 10, 45, 6, 988000)}}

~ ✔ ❯ hpi doctor my.emfit --verbose
✅ config file: /home/karlicos/.config/my/my/config/__init__.py
✅ mypy check: success
   Success: no issues found in 4 source files
✅ OK  : my.emfit
✅     - stats: {'pre_dataframe': {'count': 691, 'errors': 9}}

Another nice thing about this is that it these stats can be used as a part of script that regularly checks that new data gets pulled in, similar to "metrics" idea described here.

While this will obviously error if some optional PIP package is missing, definitely agree that ideally these checks should be a bit more user friendly. Also would be nice to automatically intergrate it with the setup script.

Ideally I think I'd like to make it at least possible to keep everything in a single file (to make writing new modules as easy as possible). But these are fine details, and it could be a combined approach, e.g. if doctor.py doesn't exist, search for class Doctor in __init__.py, or something.

enabling/disabling modules

Hm, can you elaborate on 'enabling' modules? In Emacs this makes sense because many modules are on hooks and depend on the file mode, whereas in HPI, modules are imported explicitly, so if you don't import it, you won't even know about the module!

I guess personally I can see couple of points where explicit enabling/disabled could be useful:

  • hpi doctor: perhaps useful to restrict modules when it's run without any arguments to avoid visual spam from the modules one is not using
  • I was thinking of maybe providing an automatic interface where one could explore all their data in a browser (based on NamedTuples/pandas DataFrame, etc). Also could be useful to only import stuff that you want.

And it provides more structure than the current solution (I'd prefer to not mess with core on my fork whenever possible)

Ah, definitely agree! The thing in util.py is hardly a solution, more of a temporary hack for hpi doctor CLI till we figure out something better. It only ended up in the core because the CLI is in the core :)

As a side note, the random TODO and general thought-process comments have been helpful in modifying/creating modules of my own.

Thanks! Glad that my habit of random todos is useful not just for me.

@purarue
Copy link
Contributor

purarue commented Aug 31, 2020

Hm, can you elaborate on 'enabling' modules? In Emacs this makes sense because many modules are on hooks and depend on the file mode, whereas in HPI, modules are imported explicitly, so if you don't import it, you won't even know about the module!

May have been naive optimism on my part, dynamic code loading like this isn't something I've ever dealt with before.

To elaborate on my thought process a bit when initially modifying this, on master (your repo), there are lots of things that I'm not using. I don't use twitter, or blemaestro, or lastfm, and I have my own modules that I'd replace that with.

So, for example, when I'm modifying my.body.weight, (I don't use emacs as much, though org mode would probably be the reason I would), instead of using org mode, I wrote a little TUI to read/write from a CSV file instead (unrelated, but ended up creating an entire package to create those TUIs for me, so I can manually log lots of other things, like food, weight, water etc., since I don't have org mode tables/tags). But, in your my.body.weight, it imports from ..notes import orgmode, so this means that modules are loading in functionality from across modules (that aren't core), and they're not always independent.

Modules not being independent is fine, but them not being independent is what led me to decide to fork this and modify it by removing/adding my own modules, instead of trying to figure out some way to add to ~/.config/my/plugins as described here.

Perhaps modules using other modules should be should be put behind a block like:

try:
	from ..notes import orgmode
except ImportError as e:
    # so that this appears in the `hpi doctor` output in addition to the `my.config` import error.
    warnings.warn("Could not import orgmode from notes, perhaps you've modified that?")

# func below could check if `orgmode` `NameErrors` before calling `stats` (if a `stats` function existed in `my.body.weight`)

Its not really that bad technically, because hpi doctor gives me the --verbose flag, and I can go ad add/edit those imports if I want, but that wasn't obvious to me to begin with. I think it'd just be nice to have from the perspective of someone modifying this, to better understanding how the module system works/doctor works.

Currently, a bunch of the doctor calls fail:

❗ FAIL: my.pdfs                        loading failed; pass --verbose to print more information
❗ FAIL: my.photos                      loading failed; pass --verbose to print more information
...
✅ OK  : my.reddit
✅     - stats: {'saved': {'count': 50, 'last': datetime.datetime(2015, 7, 30, 5, 27, 54, tzinfo=datetime.timezone.utc)}, 'comments': {'count': '100+'}, 'submissions': {'count': 39}, 'upvoted': {'count': '100+'}}
❗ FAIL: my.rss.all                     loading failed; pass --verbose to print more information
❗ FAIL: my.rss.feedbin                 loading failed; pass --verbose to print more information

(presumably because I don't have the blocks in my.config)

(Yeah)

If you only have few modules set up, lots of them will error for you, which is expected, so check the ones you expect to work.

I'm just leaving what my thought process was here, not sure if the catch ImportError blocks are totally needed? Since, HPI in general has to be personally customized anyway. Its not strange that modules aren't completely independent, and perhaps that's fine.

For an example that's hopefully a bit easier to follow and similar to what you suggest, check out Polar module

Yeah, makes sense to me.

I'll take a look at polar and perhaps add it back, I still don't have a great solution for reading/documents/pdfs.

But, when I was initially trying to parse through the modules, I wasn't sure if polar is something that I'd want to use or not, but since I wasn't sure all these different modules were interconnected somehow like my.notes was to my.body.weight, I deleted some of the ones I wouldn't end up using.

Ah, definitely agree! The thing in util.py is hardly a solution, more of a temporary hack for hpi doctor CLI till we figure out something better. It only ended up in the core because the CLI is in the core :)

Ah, yeah. I think this was the major misunderstanding for me.

It seems that it already picks up the modules I've defined, the exclude list in util.py made me think it wouldn't.

>>> from my.core import util
>>> util.get_modules()
['my.body', 'my.body.weight', 'my.body.weight_prompt', 'my.browsing', 'my.coding.commits', 'my.demo', 'my.github.all', 'my.github.common', 'my.github.gdpr', 'my.github.ghexport', 'my.google.takeout.html', 'my.google.takeout.paths', 'my.media.movies', 'my.media.youtube', 'my.pdfs', 'my.photos', 'my.reading.polar', 'my.reddit', 'my.rss.all', 'my.rss.feedbin', 'my.rss.feedly', 'my.smscalls', 'my.stackexchange', 'my.todotxt', 'my.zsh']

I didn't realize that was already what doctor did, searching for the stats function and directly calling it, makes more sense now. I guess what I wanted wouldn't be that hard to implement.

perhaps useful to restrict modules when it's run without any arguments to avoid visual spam from the modules one is not using

This is mostly what I meant by enabling/disabling. One would still be able to import my.module, even if module is 'ignored'. Disabling a module would mean it doesnt show up in hpi modules or hpi doctor, and perhaps for any future dashboard/browser-like projects that you describe (I eventually plan to do something similar)

Or perhaps hpi modules shows:

$ hpi modules
- my.bluemaestro [disabled] # with terminal color red?
- my.body
- my.body.weight

Perhaps util.py could optionally import a user_config from my.config, that looks something like:

class modules:
    ignored = [
		"bluemaestro",
       "polar"
    ]

class github:
    ....

and those chould be added to the values ignored returns, which influences hpi doctor/hpi modules

The reference to doom emacs was also partly because it provides me with a list of modules when I'm starting out.

Perhaps hpi modules could be extended or something else added to generate something like the output of this?

$ hpi modules | sed -e 's/my\.//' -e 's/^- / # "/' -e 's/$/",/' | awk 'BEGIN {print "class modules:\n ignored = ["} {print $0} END {print "]"}'

class modules:
  ignored = [
    # "body",
    # "body.weight",
    # "body.weight_prompt",
    # "browsing",
    # "coding.commits",
    # "demo",
....

(or just generate a list and put it in the doc folder somewhere?)

That way I have a hook into core.util, even if it is a hack for now. That would mean I could disable polar, bluemaestro from showing up in doctor and anything else I wouldn't be using for a while, without explicitly deleting the file.

Of lesser importance, but may also mean that when I'm merging from upstream on my fork, might have less patch issues/merge conflicts, since I havent removed those files.

I know I could just edit the list in excluded function myself, but I think it'd be nicer if it was configurable without editing core.

at least possible to keep everything in a single file (to make writing new modules as easy as possible).

Yeah, I think this is fine as well. agree that keeping everything in a single file is nice when starting modules.

e.g. if doctor.py doesn't exist, search for class Doctor in __init__.py, or something.

Yeah, sounds good.

I think how to modify doctor (and perhaps the stats function/Doctor class) could be described in more detail somewhere here or here? I think thats what led to most of my confusion.

https://brokensandals.net/technical/backup-tooling/making-backup-validation-easier/

Yeah, I can see how that'd be a nice way of providing a snapshot of the amount of data by using doctor. Will probably do the same for my modules now that I know it works like that.

Has cleared up a couple of my issues, thanks.

@karlicoss
Copy link
Owner Author

So, for example, when I'm modifying my.body.weight

Yeah, agree it would be a problem for modules like that! Just like you mentioned for food, weight, water -- people use gazillion of different formats/methods for logging it, and it's unlikely it can be handled it in a generic way (even I used many different methods for that throughout my life!). Perhaps this is where people would have to write their own modules to handle this, unless they are using some widespread app (e.g. myfitnesspal or something).

I guess I only shared my org-mode my.body.weight module as a demonstration. I think later, I'll move it away completely to my 'personal' overlay, so there is less confusion!

Ideally I think, my.body.weight would only contain a 'common' interface with a "reasonable", minimalist schema. For example, with the ft: datetime and weight: float fields. Tools that consume HPI weight data do it through this common interface, as an example of such tools, I'm working on sharing a quantified self dashboard I'm using (some screnshots here ).
Then, whoever wants to connect their weight data sources with the dashboard, would need to write a small adapter to this schema, and that's it. It would also allow use multiple data sources, and combine/merge them with just few lines of code.

To be more specific, I tried this with some existing modules:

This is pretty experimental, there are some issues with this approach, so very open to suggestions if you have some thoughts on this!

I wrote a little TUI to read/write from a CSV file instead (unrelated, but ended up creating an entire package to create those TUIs for me, so I can manually log lots of other things, like food, weight, water etc., since I don't have org mode tables/tags)

Nice 👍 I landed on org-capture (which can quickly record an org-mode table row), but I like the automatic namedtuple conversions. I had some ideas regarding type safety for org-capture, but so far there are only few such datasources for me, so I put it on hold.

Since, HPI in general has to be personally customized anyway. Its not strange that modules aren't completely independent, and perhaps that's fine.

Yep, as I said above, this particular modules (my.body.weight) is pretty specific to my usecase, so that's how it ended up with a dependency. I think it's possible for modules to have meaningful dependencies that would be useful for everyone, e.g. if it's something to do with the filesystem perhaps? In a sense, my.core is one such common dependency. But yeah, mostly I'd expect such complicated modules to be mostly private/personal. The most common usecase I imagine is similar to my.notes -- someone could have Google Keep, or Nomie, or something else instead as their datapoints storage, and having a common module that extracts data helps.

This is mostly what I meant by enabling/disabling
Perhaps hpi modules could be extended or something else added to generate something like the output of this?

Agreed, makes sense. I think there could be both enabled and disabled config sections; so depending on which one you specify it either only enables modules you listed explicitly, or enables everything and disables only the ones you listed. I guess disabled/ignored is mostly useful to me at the moment rather than other people, but hopefully doesn't hurt to have :)

Of lesser importance, but may also mean that when I'm merging from upstream on my fork, might have less patch issues/merge conflicts, since I havent removed those files.

Yes! Definitely wouldn't want people to remove files just because they get in the way.

think how to modify doctor (and perhaps the stats function/Doctor class) could be described in more detail somewhere here or here?

Yep! stats is somewhat recent, so I was reluctant to document it until it's "stable", but agree that ultimately it's useful to add to the docs.

P.S. Sorry for lag in replying, just got back from holiday and things piled up. Normally I reply quicker!

@karlicoss
Copy link
Owner Author

karlicoss commented Sep 28, 2020

Some initial work on enabled/disabled modules, would be happy for feedback! #85

@purarue
Copy link
Contributor

purarue commented Sep 30, 2020

After some fiddling to get it to work on my fork, Works great!

@karlicoss
Copy link
Owner Author

relevant issue: #211
importing a module when you don't have a config section results in a somewhat obscure import error, wonder if we can improve it

@purarue
Copy link
Contributor

purarue commented Feb 9, 2022

Yeah... may be possible to inspect the error to see if its importing from my.config?

can see if thats easy to do

@karlicoss
Copy link
Owner Author

Yeah, ideally would be nice to achieve lazy configs with some magic, so attributes are evaluated on the first use
But perhaps parsing the exception is nice to start with

karlicoss added a commit that referenced this issue Feb 9, 2023
properly reload/unload the relevant modules so hopefully no more weird hacks should be required

relevant
- karlicoss/promnesia#340
- #46
karlicoss added a commit that referenced this issue Feb 9, 2023
properly reload/unload the relevant modules so hopefully no more weird hacks should be required

relevant
- karlicoss/promnesia#340
- #46
karlicoss added a commit that referenced this issue Feb 9, 2023
properly reload/unload the relevant modules so hopefully no more weird hacks should be required

relevant
- karlicoss/promnesia#340
- #46
karlicoss added a commit that referenced this issue Feb 9, 2023
properly reload/unload the relevant modules so hopefully no more weird hacks should be required

relevant
- karlicoss/promnesia#340
- #46
karlicoss added a commit that referenced this issue Feb 9, 2023
properly reload/unload the relevant modules so hopefully no more weird hacks should be required

relevant
- karlicoss/promnesia#340
- #46
@karlicoss
Copy link
Owner Author

So, recently I've been thinking about HPI module configuration again -- relevant doc is here. Would be interested what you think @seanbreckenridge before I incorporate this into the docs!

To recap, currently most modules (let's say my.module):

  1. define their class config as a top level definition in the module
  2. wrap it in @dataclass
  3. inherit from the corresponding my.config.user_config

I find that (3) specifically has actually turned out to be somewhat problematic.
Since the config is in the top level, my.config.user_config also has to be imported in the top level, so my.config is imported and evaluated during the import of my.module. This introduces some unfortunate side effects, making much harder to

  • test my.module -- if you want to run the module against a 'test config' instead of the real one
  • hack/monkey patch my.module, e.g. dynamically modify the behaviour in your personal overlay

While it's kind of possible (e.g. via tmp_config decorator), it uses some heavy sys.modules manipulation -- turned out to be extremely unintuitive and flaky, imposes some artificial restrictions on where we should import stuff from HPI in general, etc.

Instead we should aim to avoid importing the user config until the moment it's actually needed, i.e. first runtime use of the actual config.

So I tinkered a bit, did some experiments and investigated other options, want to share them here to potentially discuss before making them 'official' documentation.

The main goal of experiment was to make sure we can defer user config evaluation until the config is actually used, while at the same time:

  • make sure it's still possible to check configs with mypy, in particular flagging any missing required attriburtes (i.e. export_path in my examples)
  • make sure proposed changes don't break backwards compatibility with existing user configs
  • make sure it doesn't add (too much) boilerplate

Proposal: avoid accessing config attributes as class variables

First thing is that we should avoid accessing config via the type, preferring config instances. Currently there are many places where we do access config attributes via the class, e.g.

HPI/my/pinboard.py

Lines 18 to 29 in 5ec3579

@dataclass
class config(my.config.pinboard): # TODO rename to pinboard.pinbexport?
# TODO rename to export_path?
export_dir: Paths
# TODO not sure if should keep this import here?
Bookmark = pinbexport.Bookmark
def inputs() -> Sequence[Path]:
return get_files(config.export_dir)

Upsides of using instance vs type:

  • less general confusion and more consistency

  • easier for testing modules -- much easier to create a custom instance of a config instead of hacking class attributes

  • we need to instantiate configs anyway if we want to apply backwards compatibility migrations

  • for 'lazy' config attributes, we can utilize properties. @property is a builtin in python, unlike @classproperty which is deprecated since 3.11

  • there is a mypy bug/quirk when handling class variables in dataclasses.

    Consider:

    @dataclasss
    class user_config:
        # whoops typo.. this was meant to be export_path
        export_dir: str = '/path/to/data'
    
    @dataclass
    class config(user_config):
        export_path: str

    If we now use config.export_path in code, it will crash in runtime when accessing it.

    However, mypy wouldn't actually complain: https://mypy-play.net/?mypy=master&python=3.12&flags=strict&gist=2c0cec417b726435303ab8b25c07cccf
    (TODO link to issue tracker if I create one? asked on mypy gitter for now)

    However, if we instantiate the config (e.g. cfg = config()), mypy would complain that export_path is a required attribute and it's missing, just as we want it to.

Downsides: hopefully none?

Notes:

  • There is actually a my.core.make_config helper that does instantiate the config (alongside with potential config migrations), but it's easy to forget to use it

Experiments

I experimented in isolation to make things simpler. The user config is defined here: https://github.com/karlicoss/HPI/blob/wip-config-rethink/doc/experiments_with_config/src/pkg/config.py -- in four different 'flavours'.Of course, ideally we'd support all possible ways of defining user config or at least provide runtime backwards compatibility.

The comments contain results of experiments against three different ways of 'mixing in' the user configuration:

Proposal: use regular class for config definition with properties (possibly abstract ones)

I quickly refactored one of the modules on a branch as an example how would it look like: master...wip-config-rethink#diff-bb8c3e5fdb4fe8a0b9f92f66ee2092d7178d7071c6b94a060526ff1fb1b90304 .

Here's is how it vaguely would look like:

from abc import abstractmethod

class config:
    @property
    @abstractmethod
    def export_path(self) -> str:
        raise NotImplementedError

    @property
    def cache_path(self) -> str | None:
        return None

def make_config() -> config:
    from pkg.config import module_config as user_config

    class combined_config(user_config, config): ...

    return combined_config()

Some explanations:

  • having config definition wrapped in @dataclass (the way we're doing it now) can actually be somewhat problematic,

    After some testing, it seems like we can guarantee better mypy/runtime behaviour if we define config attributes as properties (abstract properties if the attribute is required, non-abstract if it has default value).

  • each module defines a helper function to actually mix in the user config

    This way my.config is only imported on first make_config call, which makes it much easier to setup a test/mock environment with a custom config.

    In addition, this structure sort of forces us to instantiate the config, making it harder to access it via class variables by accident.

Upsides:

  • solves the early user config import issue
  • achieves even better mypy safety (e.g. no more false negatives with missing dataclass attributes mentioned above), allows using both regular attributes and properties while staying mypy checkable
  • achieves even more flexibility comparing to using dataclasses -- no restrictions on order of default/required attributes, no hard requirement for user config to have type annotations

Downsides:

  • all the decorators are a little bit of a mouthful (comparing to current dataclass-style definitions)

    It also makes it a little harder to extract documentation for modules automatically.

    I'll experiment more, perhaps it's possible to write some mypy-friendly helper functions, e.g. something like

    from my.core.cfg import required, optional
    class config:
        export_path = required(str)
        cache_path  = optional(str | None, default=None)
  • make_config adds extra 1-2 LOC of boilerplate, e.g. combined_config definition

Overall

In general, these proposals are not really 'binding' in any way.
It doesn't require changing all the modules at once, it's possible to quickly write a new module without all that abstractmethod business.
Perhaps it's more of a 'goal' of what ideal module would look like, e.g. when you start hardening it so it's usable by other people, tested, etc.

@purarue
Copy link
Contributor

purarue commented Aug 21, 2024

avoid importing the user config until the moment it's actually needed

yeah, I think this is generally a good idea.

I think having the from my.config import something as user_config indented in a function is good, but am just trying to figure out how this would be used by other functions.

As a sidenote, the more I've used definitions like these:

config = make_config(...)

def history(paths: Callable[[], Iterator[Path] = inputs) -> Results:
     # use inputs or some other function that the user passed

... the less I like them. mypy tends to have issues with them, I think I'd prefer if you can keep the 'top-level' history function as small as possible. It just takes the config, passes it to a function like _parse_history that takes one file as input, and maybe does a more_itertools.unique_everseen wrapper.

That way if you want to create some custom script, you can just use the internal _parse_file function, instead of trying to modify the config at runtime etc.

Possible Example

Just to wrap my head around how we might use a config that is not loaded when the file loads, converted my zsh.py file to just see how it might work, does this seem correct?

from abc import abstractmethod

class config:
    @property
    @abstractmethod
    def export_path(self) -> str:
        raise NotImplementedError

def make_config() -> config:
    from my.config import zsh as user_config

    class combined_config(user_config, config): ...

    return combined_config()


def history() -> Results:
    config = make_config()
    files = get_files(config.export_path)
    yield from unique_everseen(
        chain(*map(_parse_file, files)),
        key=lambda e: (
            e.dt,
            e.command,
        ),
    )

def _parse_file(histfile: Path) -> Results:
    # do actual parsing here...

Just trying to figure out if we have a strategy for caching or if we want to avoid the problem entirely. As it stands, having the config = make_config(...) at the global level sort of acts like a cache, so it would be a change if its all behind a function.

May impact perf a bit, but it probably isnt a big deal either way.

It may actually be a welcome change that the history function recomputes config, letting you run history with some default config, then patch/modify some envvar that changes the computed export_path returns, letting you get different output from history.

If someone wanted to do something expensive in their @property, they might use a @cached_property, as far as I've played around with it, mypy seems to work well with those, but I haven't used them in anything this complicated (with the abstract classes and all).

@karlicoss
Copy link
Owner Author

As a sidenote, the more I've used definitions like these, the less I like them

Yeah makes sense!

does this seem correct?

Yep, looks good!

May impact perf a bit, but it probably isnt a big deal either way.

Yep, I think in 90% of cases, config/inputs are retrieved once per module anyway, so doesn't really matter. And even if it is called multiple times, usually parsing etc would be much slower anyway.

And yeah there is always cached_property/lru_cache. Even if it gets in the way of testing/dynamic hacking, it's far easier to flush than dynamically reloading modules, there is a cache_clear function available on the decorated method.

@karlicoss
Copy link
Owner Author

So I was trying to implement this

from my.core.cfg import required, optional
class config:
    export_path = required(str)
    cache_path  = optional(str | None, default=None)

, ended up digging in mypy source code, and actually turned out Protocol supports default methods/attributes! Came as a complete surprise to me

from typing import assert_type

from abc import abstractmethod
class config_via_abstract_properties:
    @property
    @abstractmethod
    def required_attr(self) -> int:
        raise NotImplementedError

    @property
    def optional_attr(self) -> str:
        return 'default value'


from typing import Protocol
class config_via_protocol(Protocol):
    required_attr: int

    optional_attr: str = 'default value'


# config = config_via_abstract_properties
config = config_via_protocol


def check(cfg: config) -> None:
    print(cfg)
    print(cfg.required_attr)
    print(cfg.optional_attr)


###
class user_config_1:
    required_attr = 123

class combined_1(user_config_1, config): ...

# ok, mypy doesn't complain!
cfg1 = combined_1()
check(cfg1)
# types are inferred correctly
assert_type(cfg1.required_attr, int)
assert_type(cfg1.optional_attr, str)
###

print()

###
class user_config_2:
    optional_attr = 'user_config value'

class combined_2(user_config_2, config): ...

# ok, mypy complains about missing property
cfg2 = combined_2()  # type: ignore[abstract]
check(cfg2)
# types are inferred correctly
assert_type(cfg2.required_attr, int)
assert_type(cfg2.optional_attr, str)
###

The only difference is that it produces slightly different error in runtime if default attribute is missing.

With abstract property:

<__main__.combined_1 object at 0x790795707c20> 123 default value

Traceback (most recent call last):
  File "/code/hpi_dev/experiments/config_property/run.py", line 56, in <module>
    check(cfg2)
  File "/code/hpi_dev/experiments/config_property/run.py", line 29, in check
    print(cfg, cfg.required_attr, cfg.optional_attr)
               ^^^^^^^^^^^^^^^^^
  File "/code/hpi_dev/experiments/config_property/run.py", line 10, in required_attr
    raise NotImplementedError
NotImplementedError

With Protocol:

<__main__.combined_1 object at 0x759bae503aa0> 123 default value

Traceback (most recent call last):
  File "/code/hpi_dev/experiments/config_property/run.py", line 56, in <module>
    check(cfg2)
  File "/code/hpi_dev/experiments/config_property/run.py", line 29, in check
    print(cfg, cfg.required_attr, cfg.optional_attr)
               ^^^^^^^^^^^^^^^^^
AttributeError: 'combined_2' object has no attribute 'required_attr'

Arguably both are equally cryptic, so if we really want can add some sort of wrapper to present a nicer error to user.

So looks like we can just use Protocol + combined config, so not that much boilerplate at all after all!

Tried on my.whatsapp.android, works nicely!

class Config(Protocol):
# paths[s]/glob to the exported sqlite databases
export_path: Paths
my_user_id: Optional[str] = None
def make_config() -> Config:
import my.config as user_config
class combined_config(user_config.whatsapp.android, Config): ...
return combined_config()

@karlicoss
Copy link
Owner Author

Sigh.. so I did even more digging, and perhaps it'll have to be @property after all.
In runtime it all works anyway, which is nice since preserves backwards compatibility -- the only issue is with mypy.

Basically you can't override a regular attribute in base_config with a @property in user_config, because properties are read only, whereas regular attributes are read/write, so allowing that wouldn't be sound.

So this:

class base_config:
    value: int

class user_config:
    @property
    def value(self) -> int:
        # do some long computation..
        return 123

class combined(user_config, base_config): ...

cfg = combined()
print(cfg.value)

results in:

yy.py: note: In class "combined":
yy.py:10:1:10:45: error: Definition of "value" in base class "user_config" is incompatible with definition in base class "base_config"  [misc]
    class combined(user_config, base_config): ...

Seems that all the standard docs also recomment properties/abstract for these:

I tried to figure out what mypy actually does (actually ran out of the box from master branch, totally didn't expect it! just need to pass --no-incremental to avoid caching) -- so down to this bit https://github.com/python/mypy/blob/fe15ee69b9225f808f8ed735671b73c31ae1bed8/mypy/checker.py#L2777-L2791 . Basically it allows coercing regular attribute in derived class to property in base class -- but not the other way around.

So from this mypy code, seems like at the moment there are two options:

  • somehow convince mypy that attribute in the base class is actually a property -- then no matter what is in user config, it's gonna work

    This doesn't seem like an option, property decorators seem to be baked into mypy AST analysis, so not that you can define a type that returns a property (to be exact, there is one, but it's not generic, it's for cases when you do a bare call like prop = property(something)

    Here's some relevant discussion about potentially richer attribute types (e.g. value: ReadOnly[int]), but doesn't look like there is any work on this happening really https://discuss.python.org/t/need-a-way-to-type-hint-attributes-that-is-compatible-with-duck-typing/41137/18

  • second option is to somehow convince mypy the property in derived class is not actually a property as far as mypy is concerned

    So I came up with this:

    from typing import TYPE_CHECKING
    if TYPE_CHECKING:
        from typing import Any, Callable, TypeVar
        T = TypeVar('T')
        def lazy(_: Callable[[Any], T]) -> T:
            ...
    else:
        lazy = property
    
    
    class user_config:
        @lazy
        def value(self) -> int:
            return 123

    lazy of course can just live in core, and only needs to be imported once per config. Arguably it even indicates the intent better?

    The downsides are

    • the users would need to migrate to it for type checking to pass
    • there might be some other unintended type checking effects from 'shadowing' actual @property

So I guess

  • safer option is to just use @property (possibly abstract) in base configs for now
  • I might play with @lazy thing in my personal overlay (or modules that are unlikely to be used by anyone else)
  • on a longer term maybe chase on mypy issue tracker and present a case for ReadOnly wrapper

karlicoss added a commit that referenced this issue Aug 22, 2024
@karlicoss
Copy link
Owner Author

Experimented with migrating a couple of modules, quite happy about it!

The tests are much nicer now -- no random reset_modules and strategically placed imports, things just work with top level imports. I had to add a temporary ignore to reset_modules, but that's because other tests reset everything 😅 -- once we migrate more things it won't be necessary at all.

Some nice side effect: instead of a dynamic migration via untyped Attrs, it's actually possible to define a mixin class migration and inherit from it in combined_config! https://github.com/karlicoss/HPI/pull/384/files#diff-e18b4b97030e7947741957258414f22ba370e9e9e652bd3a846a9b5c175b1d7eR35-R50 So looks way cleaner and type checked.

karlicoss added a commit that referenced this issue Aug 25, 2024
Repository owner locked and limited conversation to collaborators Aug 26, 2024
@karlicoss karlicoss converted this issue into discussion #385 Aug 26, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants