Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient datacollection #575

Closed
Corvince opened this issue Aug 23, 2018 · 2 comments
Closed

More efficient datacollection #575

Corvince opened this issue Aug 23, 2018 · 2 comments

Comments

@Corvince
Copy link
Contributor

As discussed on the mailing list the current implementation of DataCollector.collect can be improved in terms of speed and memory usage. I spend too much time in the last couple of days to profile the function and try to come up with a better one. For model reporters performance is not an issue since they are only evaluated once per step and not for every agent. But agent reporters could be improved. Here is a comparison between no data collection, the current data collection and my proposed function:

dc_compare

As you can see, the speed overhead is much smaller now, especially for few (<1000) agents.

Here is the downside, tho: The performance gains are only substantial if the agent_reporters only consist of agent attributes given as string values (e.g. "x": "x"). In the case of custom functions and/or a mix, the performance (and code) will be almost the same.

However, judging by the examples, custom functions are rarely used for agent reporters, so most models can benefit from the new code.

This might also be related to #574, since (at least collecting) the attribute state now has minimal overhead.

@Corvince
Copy link
Contributor Author

New approach online: Instead of calculating reporter values directly we can use map() to "attach" functions to agents. This returns a generator that uses lazy evaluation to calculate its values.

This means the downside for non string based agents mentioned above is circumvented and we have the result
no_overhead
for ANY reporter function.

However, the work is only shifted to the get_agent_vars_dataframe() method. If we evaluate this function at the end of a model run, the total runtime will be the same as in the updated data collection above (again, with the potential fast path for attribute reporters).

But if we are not interested in the collected data (for example for interactive runs) the collect method no longer slows down model runs.

Two things to discuss:
1) In the current implementation accessing the generator functions "empties" them, meaning we can only call get_agent_vars_dataframe() once. This is easy to prevent, but it could be a desirable side-effect to free up memory (e.g. dumping the data every step into a database). Don't know which version to prefer although I guess the emptying will be unexpected for most users (and so should be prevented)
2) We could use the same approach for model reporters.

@jackiekazil
Copy link
Member

@Corvince Ty! This is great! <3

Re: Discussion --

  1. I am thinking hold it, but build emptying as an option? Because memory could be an issue.
  2. Sounds good.... maybe separate PR?

@dmasad do you have any feels about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants