Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster loading of Anthology class #835

Open
mjpost opened this issue May 25, 2020 · 5 comments
Open

Faster loading of Anthology class #835

mjpost opened this issue May 25, 2020 · 5 comments
Assignees

Comments

@mjpost
Copy link
Member

mjpost commented May 25, 2020

Instantiating the Anthology class in the Python API takes 30 seconds or so, while all the files are loaded and parsed. This is inconvenient when testing. It would be nice if the expensive data structures could be build lazily, or cached in some manner.

@mbollmann
Copy link
Member

Caching

I've tried adding serialization functionality to the Anthology class, so it could potentially be instantiated from a cached file instead of loading from the XML/YAML. This adds quite a bit of overhead that also needs to be maintained if the classes change; you can see this in commit 7604717.

Results on my system are:

  • Loading from XML/YAML takes around 22 seconds
  • Serializing to the cache file takes around 2.2 seconds
  • Loading from the cache file takes around 6.6 seconds
    • 2.2s deserializing the data format
    • 4.4s instantiating all the Python objects

6-7 seconds are a lot less than 22 seconds, but not quite the performance I'd be hoping for. I've used msgpack as the data format and lz4 for fast compression; maybe there are faster options, but my timings suggest that instantiating the Python objects is the larger bottleneck here anyway.

Lazy loading

This could certainly be an alternative; however, I tried the caching route first as this would also cut down on regular build times (since the Anthology object is instantiated by multiple scripts in the process). Lazy loading, on the other hand, would only help with scripts outside the build chain (that don't need to access all the info) or while developing.

Access to anthology.volumes or anthology.papers could easily be made to lazily load the respective XML files (as long as their name can be uniquely inferred from the ID, which is currently the case). The same goes for anthology.venues and anthology.sigs, which are instantiated from YAML files.

The only thing that can't really be lazily instantiated is the person index (anthology.pindex), as accessing information about a given person requires loading in all the XML files to see where that name appears. A possible compromise could therefore be to pre-compute this index as an XML or YAML file, maybe? I'm not sure if that should be done as a file we keep in the repo (and update with a script when XML files change), or as a cached file managed by the AnthologyIndex.

@mjpost
Copy link
Member Author

mjpost commented Jun 5, 2020

Thanks for this investigation!

Caching to improve the build time makes sense, but I worry about adding maintenance. It looks like you have to manually specify how to serialize and unserialize each object. Let's say I add a class member and then forget to add it to the serialization code—will the mistake be obvious?

My main interest was really speed for external use of the API, so the lazy loading is appealing. But why can't the person index be lazily loaded? I am thinking of use cases where I instantiate Anthology but then never access the person index. It would be nice not to have to pay that overhead until I try to create a person object.

I like the idea of lazy loading and caching the person index only, which would presumably add only a bit of complexity (versus caching for everything).

@mbollmann
Copy link
Member

Some more thoughts (mainly for myself :)) and investigation:

As a first step, I now think the most promising route is to start optimizing the code we already have; profiling the code with pyinstrument gives some interesting insights (full profiler output here):

pyinstrument output for instantiating the Anthology class

This reveals that we spend 30% of the time instantiating the Anthology class – 7 seconds out of 23 total on my machine – on resolving names, so our current code to deal with this is super costly! There's also a lot of cumulative time going into building full Anthology IDs, formatting titles, and inferring URLs, which all seem like things that could probably be done lazily on-demand.

Second, to be able to compute more things lazily, I'd like to refactor the code to get rid of the universal .attrib dict with metadata that many objects hold. The way it's done now, the .attrib dict is filled with all kinds of information that is needed for building the Anthology website, but it's filled immediately upon parsing the objects from the XML. The build scripts then simply copy .attrib as a starting point. This effectively prevents lazily computing these attributes when needed, and also obscures in the build scripts what metadata attributes there actually are (something that already annoyed me in the past).

By that time, it's probably best to look at some profiler graphs again to see what bottlenecks remain and how to solve them.

@mbollmann
Copy link
Member

mbollmann commented Sep 2, 2021

Here's an updated report from pyinstrument after merging #1473 (caveat: I don't remember which device I ran the previous test on :)):

image

Besides already looking much faster, I suspect the refactoring the name handling (re #1530) might shave off another second or so.

@mjpost
Copy link
Member Author

mjpost commented Sep 2, 2021

Fantastic—thanks for the report and the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants