More efficient datacollection #575

Corvince · 2018-08-23T10:16:34Z

As discussed on the mailing list the current implementation of DataCollector.collect can be improved in terms of speed and memory usage. I spend too much time in the last couple of days to profile the function and try to come up with a better one. For model reporters performance is not an issue since they are only evaluated once per step and not for every agent. But agent reporters could be improved. Here is a comparison between no data collection, the current data collection and my proposed function:

As you can see, the speed overhead is much smaller now, especially for few (<1000) agents.

Here is the downside, tho: The performance gains are only substantial if the agent_reporters only consist of agent attributes given as string values (e.g. "x": "x"). In the case of custom functions and/or a mix, the performance (and code) will be almost the same.

However, judging by the examples, custom functions are rarely used for agent reporters, so most models can benefit from the new code.

This might also be related to #574, since (at least collecting) the attribute state now has minimal overhead.

Corvince · 2018-08-24T11:50:15Z

New approach online: Instead of calculating reporter values directly we can use map() to "attach" functions to agents. This returns a generator that uses lazy evaluation to calculate its values.

This means the downside for non string based agents mentioned above is circumvented and we have the result

for ANY reporter function.

However, the work is only shifted to the get_agent_vars_dataframe() method. If we evaluate this function at the end of a model run, the total runtime will be the same as in the updated data collection above (again, with the potential fast path for attribute reporters).

But if we are not interested in the collected data (for example for interactive runs) the collect method no longer slows down model runs.

Two things to discuss:
1) In the current implementation accessing the generator functions "empties" them, meaning we can only call get_agent_vars_dataframe() once. This is easy to prevent, but it could be a desirable side-effect to free up memory (e.g. dumping the data every step into a database). Don't know which version to prefer although I guess the emptying will be unexpected for most users (and so should be prevented)
2) We could use the same approach for model reporters.

jackiekazil · 2018-09-07T02:35:48Z

@Corvince Ty! This is great! <3

Re: Discussion --

I am thinking hold it, but build emptying as an option? Because memory could be an issue.
Sounds good.... maybe separate PR?

@dmasad do you have any feels about this?

Corvince mentioned this issue Aug 23, 2018

Faster agent attribute collection #576

Merged

jackiekazil added the 0 - Triage label Sep 7, 2018

Corvince closed this as completed May 6, 2019

rht mentioned this issue Oct 29, 2022

Feature Request: Agent DataCollection Can't Handle Different Attributes in ActivationByType #1419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient datacollection #575

More efficient datacollection #575

Corvince commented Aug 23, 2018

Corvince commented Aug 24, 2018

jackiekazil commented Sep 7, 2018

More efficient datacollection #575

More efficient datacollection #575

Comments

Corvince commented Aug 23, 2018

Corvince commented Aug 24, 2018

jackiekazil commented Sep 7, 2018