-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge rexport, pushshift and gdpr reddit data #89
Comments
Nice job! Just checked it out too (although I don't have as many comments to need it), and it works! Wondering, how do you approach Python entry points/executable scripts while developing? E.g. one could either:
These have a benefit of not messsing with installed packages at all, which I feel is nice, but maybe not that big of a deal?
|
Regarding merge function: looks good! I guess ideally I see it organized similarly to other multisource data providers, e.g. something like:
The main goal in this structure IMO is to keep |
Recently new thought process, but; If I'm developing something that I'd run once and it exports to JSON/some nice format, and I don't have to do any sort of normalization when loading in, (for example my recent forum_parser and steam_scraper projects), then I just use individual python scripts with no packaging, which require you to clone and run directly. If its something that requires a DAL, or it would be nice to have library usage to be able to merge exports together (so, like ffexport), I prefer to just use I would agree with what you say for If something is installed with setuptools:
It is quite annoying that this isn't a supported use case for python development. Once I'm done, I'll reinstall using
That re-runs a script/file I currently have installed/runs the
I don't really think of it as cluttering user packages. I agree the way pip handles installing into a bin directory isn't ideal. For newer users they might not want to mess with the But, as long as the package name isn't If cluttering the namespace is that much of an issue, a virtualenv could potentially solve that, but I think virtualenvs themselves are another layer of complexity that people might be turned off from. Deciding whether or not to push a setuptools package to Regarding dynamic imports, I think specifying a directory for the
Right. It would probably be a breaking change anyways, to convert
The upper Converting to a package has other benefits, like someone can also do something like:
I think allowing people to import from rexport using If the namespace package names are unique enough, it could potentially let you handle multiple providers in a nicer way, like:
and then for each provider, you could have:
Sounds good. I'm still currently trying to figure out my HPI process, been doing this for less than 2 weeks. Once I'm a bit more comfortable with this and I have things I'd like to merge back in, I'll patch changes into this fork and make some PRs. I think my own fork has diverged so much that its not possible to do so anymore. |
Yep, agree, that's mostly what I've been doing so far.
Oh yeah, it's really nice! I should remember to use
Yeah agree. I had wtfs maybe a couple of times when something was used as an editable, and not cleaned up properly (I think?), but apart from that no real issues.
Yep! It would certainly be optional, just thinking that potentially might be useful if the number of data providers gets out of hand. I guess another similarity to Emacs packages, having them all in a single directory (even potentially under a massive git repository) helps with quick experiments and rollback.
Yep, that's the plan. At first I had this urge to impose as minimal structure as possible, but it seems that conforming to setuptools is worth it. The only price to pay is adding a minimal
Yep, this as well, certainly easier than carefully cloning at the correct location or copy pasting the path into the config.
That way could work too, yeah. I guess one downside is that it feels too defensive, e.g. if something is misconfigured, you might fallback onto empty data provider without realizing it. But on the other hand breaking things is more annoying. Maybe it could benefit from different 'strategies' for handing missing data providers, e.g.
Yeah no problem, having fun with the code is more important :) Always can just copy-paste the code onto a fresh fork later. |
I've added the link to readme and moving the issue to HPI (most of the discussion is relevant to it), hope you don't mind! |
total misclick there, sounds good. |
Once I've merged changes from rexport and #83 into my fork... Would be possible to merge some of this (pushishift). Would probably be optional, like: @dataclass
class reddit(uconfig)
...
# path[s]/glob to one or more comment exports
# https://github.com/seanbreckenridge/pushshift_comment_export
pushshift_export_path: Optional[Paths]
....
# if user has set pushshift comment export path...
from pushshift_comment_export.dal import PComment, read_file I can probably work on a PR at some point if thats something you're interested in. |
Just merged #83 ! Let me know if you have any issues :) Absolutely, would be great to merge!
P.S. Just recalled, I also had some experiments/notes about possible ways of overriding/extending modules, will dig them out and post somewhere for the discussion. |
all good on my end, been merging commits as you've been merging them into master; Though, I'm only using
Would that work with
Feels sort of wrong from a
Yeah, this is better organized, I'll follow how its done in |
I think so? As far as I understand, nested classes in Python are merely a syntactic thing (e.g. even if you want to access an attribute of the enclosing class, you'd have to specify a fully qualified name)
Yeah. Regarding backward compatibility, also thought that
Ah yeah. Ideally, if the config has empty path, the list of inputs is empty, and the module just emits nothing and everything just works. The issue now is that the modules (e.g. Something I can think on spot: reorganize
, it could perhaps be somehow configurable, for example, respect
I'll think about it more.. maybe I'll come up with a way of lazily importing module dependencies so it's mypy friendly. |
Recently created an To keep this more mypy-friendly, I ended up creating a top level 'Location' NamedTuple, even though each data source has its on 'Location' NamedTuple, though one could probably get around that by standardizing the interface/duck typing. Each source looks like: def _google_locations() -> Iterator[Location]:
yield from map(
lambda gl: Location(lng=gl.lng, lat=gl.lat, dt=gl.at),
filter(
lambda g: isinstance(g, GoogleLocation),
google_events(),
),
) ... filtering, and then mapping it onto the common NamedTuple representation. (imports are at the top since I don't expect anyone else to be using this, but they could be put in each function) Otherwise, I could've defined some type like:
...but I was already writing If someone was using this for the first time, I think the 'X' function which adds the fallback/warning message benefits outweigh the downsides of it being biolerplatey; If a user wanted to customize I can't think of any reason the 'X' function couldn't be abstracted away into
yeah.... I've done some weird things before to try and get around this. elixir has spoiled me there https://github.com/seanbreckenridge/mint/blob/master/budget/budget/cleandata/transactions/__init__.py#L8-L16 |
Yep, or
Good point! I guess having a helper in the core would also allow making it configurable, e.g. defensive + warning by default and error/throw in 'developer mode.
|
Just as an update, Will probably attempt to merge a PR using an |
Yeah, I guess it makes sense -- people are more likely to start with just one data source, so would be nicer to make this easier. And then I or generally more paranoid people could have an explicit policy to always throw instead of handling defensively. Also there is |
Hmm, regarding
Was trying to localize the However, it being more composable means its more magical, and mypy might not be able to type it statically? This code doesnt work, its just me trying to get something to work, trying subclassing/ diff --git a/my/core/cfg.py b/my/core/cfg.py
index b23fa86..7e6190c 100644
--- a/my/core/cfg.py
+++ b/my/core/cfg.py
@@ -1,25 +1,41 @@
-from typing import TypeVar, Type, Callable, Dict, Any
+from typing import TypeVar, Type, Callable, Dict, Any, Set
+
Attrs = Dict[str, Any]
C = TypeVar('C')
+# extract the names of nested classes from a dataclass object
+# (see reddit config for example)
+def _nested_classes(cls: Type[C]) -> Set[str]:
+ class_names: Set[str] = set()
+ import inspect
+ for name, _ in inspect.getmembers(cls, inspect.isclass):
+ if name == "__class__": # ignore own class type
+ continue
+ class_names.add(name)
+ return class_names
+
+
# todo not sure about it, could be overthinking...
# but short enough to change later
# TODO document why it's necessary?
def make_config(cls: Type[C], migration: Callable[[Attrs], Attrs]=lambda x: x) -> C:
user_config = cls.__base__
old_props = {
- # NOTE: deliberately use gettatr to 'force' lcass properties here
+ # NOTE: deliberately use gettatr to 'force' class properties here
k: getattr(user_config, k) for k in vars(user_config)
}
new_props = migration(old_props)
from dataclasses import fields
+ allowed_keys = {f.name for f in fields(cls)}.union(_nested_classes(cls))
params = {
k: v
for k, v in new_props.items()
- if k in {f.name for f in fields(cls)}
+ if k in allowed_keys
}
+ print(params)
+ import pdb; pdb.set_trace()
# todo maybe return type here?
return cls(**params) # type: ignore[call-arg]
diff --git a/my/reddit/__init__.py b/my/reddit/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/my/reddit/all.py b/my/reddit/all.py
new file mode 100644
index 0000000..e69de29
diff --git a/my/reddit/common.py b/my/reddit/common.py
new file mode 100644
index 0000000..9071b59
--- /dev/null
+++ b/my/reddit/common.py
@@ -0,0 +1,59 @@
+from abc import ABCMeta
+from dataclasses import dataclass, field
+
+from my.core.common import Paths
+from my.config import reddit as uconfig
+
+
+# TODO: move to core? create a shared 'datasource' superclass?
+# need to think more about how to handle nested classes with make_config
+
+@dataclass
+class datasource(metaclass=ABCMeta):
+ export_path: Paths
+
+
+# empty strings for Paths objects resolve to empty data
+@dataclass
+class reddit(uconfig):
+ rexport: datasource = field(default_factory=lambda: datasource(''))
+ pushshift: datasource = field(default_factory=lambda: datasource(''))
+
+
+from my.core.cfg import make_config, Attrs
+# hmm, also nice thing about this is that migration is possible to test without the rest of the config?
+def migration(attrs: Attrs) -> Attrs:
+ # check for legacy path names
+ for deprecated_attr in ['export_path', 'export_dir']:
+ if deprecated_attr in attrs:
+ # assume the user is trying to use this attribute to set rexport Paths
+ @dataclass
+ class rexport:
+ export_path: Paths = attrs[deprecated_attr]
+
+ @dataclass
+ class pushshift:
+ export_path: Paths = ''
+
+ attrs['rexport'] = rexport
+ attrs['pushshift'] = pushshift
+
+ from my.core.warnings import high
+ high(f'''"{deprecated_attr}" is deprecated! Please update your reddit configuration to look like:
+
+class reddit:
+
+ class rexport:
+ # path[s]/glob[s] to data
+ export_path: Paths = {repr(attrs[deprecated_attr])}
+
+ class pushshift:
+ # path[s]/glob[s] to data
+ export_path: Paths
+
+''')
+ break
+ return attrs
+
+config = make_config(reddit, migration=migration)
+
diff --git a/my/reddit/pushshift.py b/my/reddit/pushshift.py
new file mode 100644
index 0000000..84d1af9
--- /dev/null
+++ b/my/reddit/pushshift.py
@@ -0,0 +1,9 @@
+
+@dataclass
+class pushshift_config(datasource):
+ '''
+ Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift_comment_export]] output.
+ '''
+ # path[s]/glob to the exported JSON data
+ export_path: Paths
+
diff --git a/my/reddit.py b/my/reddit/rexport.py
old mode 100755
new mode 100644
similarity index 89%
rename from my/reddit.py
rename to my/reddit/rexport.py
index bbafe92..de28944
--- a/my/reddit.py
+++ b/my/reddit/rexport.py
@@ -1,26 +1,37 @@
"""
Reddit data: saved items/comments/upvotes/etc.
"""
+
REQUIRES = [
'git+https://github.com/karlicoss/rexport',
]
-from .core.common import Paths
-from my.config import reddit as uconfig
+
+from .common import datasource
from dataclasses import dataclass
+from my.core.common import Paths
+
+rexport_config: datasource
+try:
+ from my.config import reddit as user_config
+ rexport_config = user_config.rexport # type: ignore[attr-defined]
+except (ImportError, AttributeError):
+ from my.config import reddit as user_config
+ rexport_config = user_config
+
+# localize possible migrations for rexport to this module
@dataclass
-class reddit(uconfig):
+class rexport(metaclass=rexport_config):
'''
Uses [[https://github.com/karlicoss/rexport][rexport]] output.
'''
-
# path[s]/glob to the exported JSON data
export_path: Paths
-from .core.cfg import make_config, Attrs
+from my.core.cfg import make_config, Attrs
# hmm, also nice thing about this is that migration is possible to test without the rest of the config?
def migration(attrs: Attrs) -> Attrs:
export_dir = 'export_dir'
@@ -29,34 +40,31 @@ def migration(attrs: Attrs) -> Attrs:
from .core.warnings import high
high(f'"{export_dir}" is deprecated! Please use "export_path" instead."')
return attrs
-config = make_config(reddit, migration=migration)
-###
-# TODO not sure about the laziness...
+config = make_config(rexport, migration=migration)
+
try:
from rexport import dal
except ModuleNotFoundError as e:
- from .core.compat import pre_pip_dal_handler
+ from my.core.compat import pre_pip_dal_handler
dal = pre_pip_dal_handler('rexport', e, config, requires=REQUIRES)
# TODO ugh. this would import too early
# but on the other hand we do want to bring the objects into the scope for easier imports, etc. ugh!
# ok, fair enough I suppose. It makes sense to configure something before using it. can always figure it out later..
# maybe, the config could dynamically detect change and reimport itself? dunno.
-###
-############################
+from pathlib import Path
from typing import List, Sequence, Mapping, Iterator
-from .core.common import mcachew, get_files, LazyLogger, make_dict
+from my.core.common import mcachew, get_files, LazyLogger, make_dict
logger = LazyLogger(__name__, level='debug')
-from pathlib import Path
def inputs() -> Sequence[Path]:
- return get_files(config.export_path)
+ return get_files(config.rexport.export_path)
Sid = dal.Sid
@@ -223,9 +231,6 @@ def stats():
}
-##
-
-
def main() -> None:
for e in events(parallel=False):
print(e) Will try again this weekend, just posting incase you had any ideas |
similar problem described in the location issue I raised a while back -- essentially 'where does the configuration for each module live?' I may be repeating points already said above or in other discussions, just wanted to get this down somewhere. Ideally I'm leaning towards the following, as it means there is less problems overlaying/adding additional modules, and its slightly more mypy friendly -- as were not defining some expected nested configuration object (which the user may change if they add an additional source), with the two data sources here -- the only interaction
try:
from my.config import reddit_rexport as user_config
except:
from my.config import reddit as user_config
# send deprecation warning
class rexport(user_config):
export_path: Path
# handle migration/make_config and possible deprecation for the rexport configuration block ... so that any deprecation warnings/configuration issues specific to This does mean the configuration is a bit more verbose ( |
agh. seems like pushshift is going through a bit of a restructure to improve their API/has been down a bit recently (i.e. now when I'm testing) going to remind myself to go look back at here incase its still having issues in a while; may need to update |
Hmm. Yeah -- agree it would be nice to split up. The only thing to keep in mind is that backwards compatbility (with re: dataclasses -- right, yeah nestedness seems like it would be hard to handle.
|
ahh... for some reason I had it in my head that we needed to have the nested config as a dataclass somewhere in the module code and call make_config on it to handle migrations, but that isnt really necessary - that can just be handled in
yeah, don't think there is a workaround that allows overlaying code over
yeah, that sounds good I think all the patterns here work for the |
Cool! Guess would be nice to compile a proper doc eventually so we remember our own past discussions xD Btw, even if either of |
The nested config reminds me a bit of a nested JSON/YAML like object, which is typically used for configuration anyways, so it works well in that sense. Also, another random thought I had recently -- if the user is using the deprecated version of the config, we could probably use the
would probably be a good idea to update the docs some, yeah... we've had some long discussions in PRs before. could try doing some PRs, would give me practice in org mode am currently writing up a readme that expands a bit on the problems with editable/namespace packages (and their usage in HPI as a plugin system, by extension) as part of a naive script to re-order my easy-install.pth, can probably link to that in |
Also worth mentioning that this seems to exist now: https://www.reddit.com/settings/data-request seems pretty complete, may be able to use that instead of pushshift, unsure on the amount of data in each, will try and parse the GDPR export and see. Would probably be combined like the github module, into |
Been trying to come up with common protocol across rexport and the GDPR export, but the data is quite different: For example, saves look something like this in a CSV file in the GDPR export:
No title, datetime, or text, unlike rexport, so is a bit difficult to merge the data, unlike when trying to do pushshift and rexport (those have enough raw metadata that it makes sense) I don't really want to remove a bunch of attributes on the shared classes just so that its consistent across rexport/gdpr, but I also don't really want to make the attributes nullable, as that propogates a bunch of breaking changes across promnesia and anyone else who is using this Currently looks a bit like: """
This defines Protocol classes, which make sure that each different
type of Comment/Save have a standard interface
"""
from typing import Dict, Any, TYPE_CHECKING
from datetime import datetime
Json = Dict[str, Any]
if TYPE_CHECKING:
try:
from typing import Protocol
except ImportError:
# hmm -- does this need to be installed on 3.6 or is it already here?
from typing_extensions import Protocol # type: ignore[misc]
class Save(Protocol):
created: datetime # gdpr exports dont have
title: str # gdpr exports dont have
raw: Json # dont have
@property
def sid(self) -> str: ...
@property
def url(self) -> str: ...
@property
def text(self) -> str: ... # Note: gdpr exports don't have text here
@property
def subreddit(self) -> str: ... but since half of those attributes aren't present in the GDPR exports, Im almost leaning towards not including To sort of illustrate, what I mean is, when the fields don't match, the data in each
If you wanted upvote/save information from the gdpr export, you'd have to import those functions from Then, lots of additional functions that gdpr provides, which has no additional complications;
Let me know if you can think of something better here |
I've totally missed your message from April about official reddit GDPR, thanks that's exciting! Requested mine as well. Yeah I see the problem with missing attributes.. I guess makes sense to discuss here, since it's kind of a generic discussion about combining data sources. I see also options
Considering neither of these options is particularly great, it could be configurable (fits nicely into allowing different 'defensiveness policies' for different modules). But I guess it makes it pretty complicated too, and unclear if there are any real benefits... So what you suggested with combining different sources for different data makes sense for now, and we can change it later if we come up to something better. |
Will leave this open for now, since |
…itching to namespace packages for now just a manual ad-hoc test, will try to set it up on CI later relevant to the discussion here: https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/extending.20HPI/near/270465792 also potentially relevant to - #89 (will try to apply to this to reddit/__init__.py later) - #102
…itching to namespace packages for now just a manual ad-hoc test, will try to set it up on CI later relevant to the discussion here: https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/extending.20HPI/near/270465792 also potentially relevant to - #89 (will try to apply to this to reddit/__init__.py later) - #102
Was able to get
pushshift
as mentioned in the README working to export old comments. Thought I'd mention it here.It's possible to use pushshift to get data further back, but I'm not sure if it should be part of this project, since some of the older comments don't have the same JSON structure, I can only assume pushshift is getting data from multiple sources the further it goes back. It requires some normalization, like here and here.
The only data I was missing due to the 1000 limit on queries from using
rexport
were comments. It exported the last 1000 but I have about 5000 on reddit in total.Regarding HPI:
Wrote a simple package to request/save that data, with a
dal
(whosePComment
NamedTuple has similar@property
attributes torexports
DAL), and a merge function, and now:Its possible that one could write enough
@property
wrappers to handle the differences in the JSON representations of old pushshift data, unsure if thats something you want to pursue here.The text was updated successfully, but these errors were encountered: