Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1419. DataCollector accepts an arbitrary schedule at creation (d… #1481

Conversation

woolgathering
Copy link

…efaults to model.schedule otherwise) and will return None if an attribute is not found instead of throwing an AttributeError.

Fix #1419. DataCollector now accepts an optional fourth argument schedule which is an arbitrary schedule in the model. If none is provided, the DataCollector defaults to previous behavior, assuming that there exists an attribute to each agent called model.scheudle.

This also does not result in an AttributeError when an attribute does not exist for an agent. It instead returns None.

…creation (defaults to model.schedule otherwise) and will return None if an attribute is not found instead of throwing an AttributeError.
@woolgathering
Copy link
Author

I'm unclear why the black hook is failing. Running flake8 . --ignore=F403,E501,E123,E128,W504,W503 --exclude=docs,build on my machine returns no errors. Please advise.

@Tortar
Copy link
Contributor

Tortar commented Oct 29, 2022

You need to apply black formatter to your changed files after pip installing it

@woolgathering
Copy link
Author

Whoop, yes done.

@codecov
Copy link

codecov bot commented Oct 30, 2022

Codecov Report

Base: 91.34% // Head: 91.13% // Decreases project coverage by -0.20% ⚠️

Coverage data is based on head (d486f2e) compared to base (952857c).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1481      +/-   ##
==========================================
- Coverage   91.34%   91.13%   -0.21%     
==========================================
  Files          15       15              
  Lines        1306     1309       +3     
  Branches      223      226       +3     
==========================================
  Hits         1193     1193              
- Misses         80       81       +1     
- Partials       33       35       +2     
Impacted Files Coverage Δ
mesa/datacollection.py 95.29% <100.00%> (-2.39%) ⬇️
mesa/space.py 95.78% <0.00%> (-0.45%) ⬇️
mesa/batchrunner.py 92.88% <0.00%> (-0.04%) ⬇️
mesa/visualization/TextVisualization.py 39.47% <0.00%> (ø)
mesa/visualization/ModularVisualization.py 78.88% <0.00%> (ø)
mesa/visualization/modules/ChartVisualization.py 91.66% <0.00%> (ø)
mesa/visualization/modules/HexGridVisualization.py 96.29% <0.00%> (ø)
mesa/visualization/modules/NetworkVisualization.py 100.00% <0.00%> (ø)
...esa/visualization/modules/BarChartVisualization.py 88.88% <0.00%> (ø)
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

return agent_records

@staticmethod
def _get_reports(collector, steps, agent):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name is too generic. It could have meant getting the model report.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used elsewhere in the code. Why not put the function back in _record_agents?

Copy link
Author

@woolgathering woolgathering Oct 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrongly assumed that by avoiding redefining the same function every call we'd get a performance improvement. This was not the case.

I used the introductory tutorial at "Collecting Data" and removed the model_reporters, then timed this:

from timeit import default_timer as timer # not called while timing, just here to show what package I was using
start = timer()
model.datacollector.collect(model)
end = timer()
print(end - start) # time in seconds

For each variation:

6.0040998505428433e-05 # original
0.00016220801626332104 # this commit
0.00014566699974238873 # leaving the function in _record_agents
4.683298175223172e-05 # leaving the function in _record_agents and removing the model argument in _record_agents that is no longer used

So actually I was mistaken but we can get it to be pretty fast by leaving the func inside and removing the unused arg. The functions in question for the last timed test are:

def _record_agents(self, schedule):
    """Record agents data in a mapping of functions and agents."""
    rep_funcs = self.agent_reporters.values()

    def get_reports(agent):
        _prefix = (schedule.steps, agent.unique_id)
        reports = tuple(rep(agent) for rep in rep_funcs)
        return _prefix + reports

    agent_records = map(get_reports, schedule.agents)
    return agent_records

def collect(self, model):
    """Collect all the data for the given model object."""
    if self.schedule is None:
        schedule = model.schedule
    else:
        schedule = self.schedule

    if self.model_reporters:

        for var, reporter in self.model_reporters.items():
            # Check if Lambda operator
            if isinstance(reporter, types.LambdaType):
                self.model_vars[var].append(reporter(model))
            # Check if model attribute
            elif isinstance(reporter, partial):
                self.model_vars[var].append(reporter(model))
            # Check if function with arguments
            elif isinstance(reporter, list):
                self.model_vars[var].append(reporter[0](*reporter[1]))
            else:
                self.model_vars[var].append(self._reporter_decorator(reporter))

    if self.agent_reporters:
        agent_records = self._record_agents(schedule)
        self._agent_records[schedule.steps] = list(agent_records)

If we like that, I can commit whichever is preferred. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the function defined inside _record_agents is clearer. This way, you can have the name remain short, get_reports.

@rht
Copy link
Contributor

rht commented Oct 30, 2022

The only reason we need to pass in model in collect is to access its scheduler. I wonder if we should make model an optional argument in collect.

@woolgathering
Copy link
Author

woolgathering commented Oct 30, 2022

The only reason we need to pass in model in collect is to access its scheduler. I wonder if we should make model an optional argument in collect.

I thought about this but I decided against it because that would negate what I was trying to do: namely, to have an arbitrary schedule be data-collected (really the agents in the schedule) where a model can have several schedules with different types of agents. I was also able to mix the agents together if I wanted them all intermingled during a step() with a simple schedule subclass which did not interfere with the data collecting.

@rht
Copy link
Contributor

rht commented Oct 31, 2022

But you can still have a data collector collect receiving arbitrary schedule without model. Making it optional doesn't negate this capability.

@tpike3 tpike3 added this to the v1.2.0 Taylor milestone Oct 31, 2022
@woolgathering
Copy link
Author

To make sure I'm understanding, we're suggesting that collect have two arguments: model (optional) and schedule (optional), correct?

I noticed that we pass model to the reporter() when we loop over the reporters. If schedule is defined in collect, we would just do reporter(schedule.model). But if model is defined in collect, we then do reporter(model), yes? I just want to make sure I understand before I start making commits.

I'm also unclear on why we would want to pass in a schedule to a datacollector at the collect stage. This to me would imply that someone would want to change the collected schedule while the model is running which seems weird. I would think that a datacollector is "tied" to a specific schedule with specific agents and that we wouldn't want to change it while the model is running, hence having the datacollector have a schedule attribute.

@Corvince
Copy link
Contributor

Great discussion, now that we talk about it I wonder if we should require a model or schedule at all.

I think we could achieve a clearer separation of concerns if we keep the data collection as independent as possible.

We could also just pass in a list of agents to the collect method and then collect over these agents. I am not sure anymore about the advantage of baking them in. Expanding on this idea we could gain maximal flexibility by also providing the reporter functions per collect call.

@woolgathering
Copy link
Author

woolgathering commented Oct 31, 2022

We could also just pass in a list of agents to the collect method and then collect over these agents. I am not sure anymore about the advantage of baking them in.

This is nice: to be able to collect on a subset of agents (a representative sample, say). My question is then: what about model level reporters? Would collect accept a list of agents and a model on which to collect model level info? I also can't think of an instance in which I'd want to have more than one model running in a simulation, though it's entirely possible that I just haven't encountered the use-case yet.

Expanding on this idea we could gain maximal flexibility by also providing the reporter functions per collect call.

Is there a reason one would want to change the reporter functions during the execution of a model? I also don't know enough about CS to know whether or not this would create performance issues. What're your thoughts?

@Corvince
Copy link
Contributor

We could also just pass in a list of agents to the collect method and then collect over these agents. I am not sure anymore about the advantage of baking them in.

This is nice: to be able to collect on a subset of agents (a representative sample, say). My question is then: what about model level reporters? Would collect accept a list of agents and a model on which to collect model level info? I also can't think of an instance in which I'd want to have more than one model running in a simulation, though it's entirely possible that I just haven't encountered the use-case yet.

Yeah, I didn't really think about model level parameters. I don't have an idea on the top of my head, but I don't think we need to consider multi model simulations. I don't think this is a common use case (although maybe one should think about it, maybe it is just something that was feasible until now).

Expanding on this idea we could gain maximal flexibility by also providing the reporter functions per collect call.

Is there a reason one would want to change the reporter functions during the execution of a model? I also don't know enough about CS to know whether or not this would create performance issues. What're your thoughts?

Na I don't think reporters would change, I was thinking in the direction that you could use different reporters for different sets of agents. ( So some reporters a, b for type A, some other reporters c, d for type B)

1 similar comment
@Corvince
Copy link
Contributor

We could also just pass in a list of agents to the collect method and then collect over these agents. I am not sure anymore about the advantage of baking them in.

This is nice: to be able to collect on a subset of agents (a representative sample, say). My question is then: what about model level reporters? Would collect accept a list of agents and a model on which to collect model level info? I also can't think of an instance in which I'd want to have more than one model running in a simulation, though it's entirely possible that I just haven't encountered the use-case yet.

Yeah, I didn't really think about model level parameters. I don't have an idea on the top of my head, but I don't think we need to consider multi model simulations. I don't think this is a common use case (although maybe one should think about it, maybe it is just something that was feasible until now).

Expanding on this idea we could gain maximal flexibility by also providing the reporter functions per collect call.

Is there a reason one would want to change the reporter functions during the execution of a model? I also don't know enough about CS to know whether or not this would create performance issues. What're your thoughts?

Na I don't think reporters would change, I was thinking in the direction that you could use different reporters for different sets of agents. ( So some reporters a, b for type A, some other reporters c, d for type B)

@woolgathering
Copy link
Author

woolgathering commented Oct 31, 2022

Yeah, I didn't really think about model level parameters. I don't have an idea on the top of my head, but I don't think we need to consider multi model simulations. I don't think this is a common use case (although maybe one should think about it, maybe it is just something that was feasible until now).

Agreed. I suppose one new thing that would be possible would be the possibility of the same agents acting with different models in the same simulation. But even then I'm unsure how this could not be accomplished by just using different groups of agents that have different characteristics.

Na I don't think reporters would change, I was thinking in the direction that you could use different reporters for different sets of agents. ( So some reporters a, b for type A, some other reporters c, d for type B)

Makes sense. My way of solving this was to just have different schedules (which are then passed to different datacollectors, potentially) and mixing them when stepping through the model. I was able to make a simple subclass of the BaseScheduler that took in multiple schedules and mixed the agents in them. So sort of like RandomActivation but it could mix agents from multiple schedules. This achieves the same functionality as what you're suggesting if all the agents are the same type, if I'm understanding correctly.

I suppose I'm also weary of breaking changes which might be the case if we made it possible to send in a list of agents to collect data from and completely separated the DataCollector out. I'd leave that up to the existing contributors to decide if/when that is appropriate.

@rht
Copy link
Contributor

rht commented Nov 1, 2022

I forgot that model is needed in collect for the model reporters. Let's keep the scope of this PR small for now, i.e. to allow passing an explicit schedule object in collect. The use case is specifically when you have multiple schedulers in one model.

@rht
Copy link
Contributor

rht commented Nov 1, 2022

to allow passing an explicit schedule object in collect.

This is not accurate. It should be "to allow associating a DataCollector object with a custom scheduler, in case a model has multiple schedulers"

@woolgathering
Copy link
Author

@rht Great, shall I commit with the changes above; i.e. keeping get_reports defined inside _record_agents and passing in model to collect()? I thought it was pretty clean overall and it ran pretty quick.

@rht
Copy link
Contributor

rht commented Nov 1, 2022

Yes, sounds good.

@EwoutH
Copy link
Member

EwoutH commented Nov 1, 2022

Thanks for this effort! When you've processed the review comments, could you add a (few) usage examples?

Maybe we also can mention it in one of the tutorials.

@woolgathering
Copy link
Author

Thanks for this effort! When you've processed the review comments, could you add a (few) usage examples?

Yes, no problem. Should that go in examples/ or shall I update some of the docs?

…() now accepts a scheudle instead of a model. Attributes that do not exist return None instead of an AttributeError.
@rht
Copy link
Contributor

rht commented Nov 2, 2022

I think for the scope of this PR, adding a test case for a custom schedule object would be sufficient.

As for the doc/example, we don't have a documentation .rst that describes the complete features of the DataCollector yet. Having to make an example simulation code for a particular feature is overkill, considering that an example usually consists of multiple files. Maybe a short doc in https://github.com/projectmesa/mesa/tree/main/docs/useful-snippets would do for now. And we can defer this documentation to a separate PR.

@woolgathering
Copy link
Author

woolgathering commented Nov 2, 2022

Looking at test_datacollector.py as an example, it looks like I'd need to edit the initialize_data_collector() function in the Model class since it would assign the DataCollector to self.datacollector in every case (i.e. overwrite if I called it multiple times for multiple collectors and schedules).

To be completely honest, I've never written a unit test in my life so I suppose I'm looking for some guidance here. Should I just write a simple script, semi-based on test_datacollector.py with a bunch of assertions?

[EDIT: Ended up just adding some sample code to the useful-snippets file. Seemed easiest and most appropriate since I figure most people won't end up using this functionality regularly and it feels weird to write unit tests for custom classes that don't appear natively in the project.]

@EwoutH
Copy link
Member

EwoutH commented Nov 2, 2022

Ended up just adding some sample code to the useful-snippets file.

Content wise this is perfect and really useful, especially since Mesa is used a lot as a teaching tool. Good, accessible documentation and examples are essential for that, so thanks!

The document doesn't seem to render correctly however, I think you need to add a white line after the .. code:: python part.

We also might have to mention (link to) this snippet in a tutorial or example, but that can indeed be done in another PR. Maybe we can expand the wolf_sheep example, since it already uses different agent types. (CC @tpike3)

@tpike3
Copy link
Member

tpike3 commented Nov 5, 2022

How close are we to a merge on this? I can use this to show snippets as I rerecord session 6

@rht
Copy link
Contributor

rht commented Nov 5, 2022

It's almost there. Just need to tidy the code snippet.

@jacob-thrackle
Copy link

Will have it done by this afternoon PST.

@@ -0,0 +1,49 @@
from random import shuffle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is accidentally committed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved! Sorry, oversight.

@rht
Copy link
Contributor

rht commented Nov 8, 2022

OK, pending the discussion result of #1506.

@EwoutH
Copy link
Member

EwoutH commented Nov 10, 2022

Hi @woolgathering! @rht, @tpike3, @jackiekazil and I discussed multiple agent types for the DataCollector. We discussed multiple options, including adding a type attribute to the Agent class (consistent, but conditional filtering would have take a large performance impact) and layering the DataCollector with a dict key for each agent type like in #1142 (not very scalable to other use cases).

The problem to we converged to was that Mesa supports a single scheduler and a single datacollector by default. This PR adds multiple schedulers with multiple datacollectors. However, there is also a use case for a single scheduler (for example RandomActivationByType) but with multiple datacollectors.

Do you think this PR supports (or can be adapted to) that case of single scheduler with multiple data collectors, and if so would you be willing to add an additional example for that?

@rht
Copy link
Contributor

rht commented Nov 10, 2022

Do you think this PR supports (or can be adapted to) that case of single scheduler with multiple data collectors, and if so would you be willing to add an additional example for that?

RandomActivationByType.agents_by_type can be considered to be a collection of schedulers (which keys are static), so I think it would work out of the box with the solution in this PR.

Use cases to consider:

  • Currently, RandomActivationByType groups agents by their agent class, obtained via type(agent). But sometimes, one might want to group by various agent attributes. It might be possible to have 2 agent classes sharing the same attribute, hence, grouped in the same schedule. Or the other way around: 2 agents with the same agent class, but have different attributes
  • An agent attributes may change over time, and so, the agent might be moved to a different step stage as the simulation progresses.

@EwoutH
Copy link
Member

EwoutH commented Nov 10, 2022

Very good cases!

Aside from examples, how complex would it be to add tests for these cases?

@rht
Copy link
Contributor

rht commented Nov 11, 2022

Aside from examples, how complex would it be to add tests for these cases?

I think we can do this in a separate PR, because it requires a scheduler that groups not by type(agent). And once the 1 scheduler with multiple data collectors has been documented, it would require a little tweak to accommodate for the scheduler with custom grouping.

@EwoutH
Copy link
Member

EwoutH commented Nov 12, 2022

@woolgathering, what do you think?

@EwoutH
Copy link
Member

EwoutH commented Nov 14, 2022

I feel we now have some momentum on this issue, and I would really like not to lose it again this time.

@woolgathering have you seen the comment above and what do you think of it? Especially about the single-scheduler multi-datacollector case/example?

If you don't have time to work on it this is also perfectly fine, then please let it know, then we can make another plan forward.

@woolgathering
Copy link
Author

I'm generally of the opinion that PRs should be fairly limited in scope so adding the test cases would probably be best in another PR. I'm honestly not sure how common those cases will be.

I was thinking about the single scheduler, multiple collector this weekend. It seems that the cleanest way would be to just pass a list to the collector of the agents one wants to track. The problem there, of course, is that this seems it would require relatively significant reworking of the DataCollector class and would complicate adding/removing agents. Given the way this fix is currently written, I can't think of a simple way of doing it.

Suppose I have the problem here: that I have 1000 agents but I only want to track 100 specific agents. This would be a case where one could put all the agents into a single scheduler, then only capture in the data collector the 100 desired agents. This is my understanding of the problem. However... the same thing could be accomplished using multiple schedulers mixed together as in the example I wrote in the snippets. This could even be adapted to the case where one wants a specific order (in a single scheduler, all agents have a unique_id and the mix could be ordered according to that), or to activate them by type (no different than having multiple schedulers with different types executed in series).

So I think overall I agree with @rht in that those cases could work out of the box and get the same end result, even if on the execution layer it looks different.

@EwoutH
Copy link
Member

EwoutH commented Nov 14, 2022

Thanks for your extensive reply and the time thinking about this!

While I feel using multiple schedulers is not the most intuitive for users, it’s an improvement over the current situation when that means the datacollector can be used for each scheduler. If you and @rht think another example with that use case is out of the scope of this PR, I would say go ahead and merge it.

How close are we to a merge on this? I can use this to show snippets as I rerecord session 6

@tpike3 this opportunity window has probably passed right?

@rht
Copy link
Contributor

rht commented Nov 15, 2022

I sketched an example for 3 data collectors, 1 scheduler (RandomActivationByType) in https://github.com/rht/mesa/tree/test_dc_schedule. It didn't work, because the data collector strictly assumes that the object being passed has an attribute agents (see https://github.com/rht/mesa/blob/dbb29de3decb350fac406ba985fb846e14597960/mesa/datacollection.py#L167).

@woolgathering
Copy link
Author

Hmmm, I wonder if it would be best to be able to pass a list of agents to the DataCollector after all... then if it's None it would just default to schedule.agents (all the agents). It would then be on the user to make sure that if they're adding/removing agents to do it in both places.

That may not be as hard, though, especially if the list were passed on each collect. It just feels kind of cumbersome and bloated. Thoughts?

@rht
Copy link
Contributor

rht commented Nov 16, 2022

It could as well be a dict instead of a list, or maybe have DataCollector.__getitem__, not sure.

I think for now, to solve the Sugarscape G1MT for the complexity tutorial use case (1 scheduler, multiple data collectors), we should just implement a custom filter. The mental model with the custom filter is much simpler. Also, it's only 200 agents at most, and so should be fine. We can have a more performant option later.

@rht
Copy link
Contributor

rht commented Nov 16, 2022

The mental model with the custom filter is much simpler.

I should add, that this is convenient for rapid prototyping a model, before scaling it up.

@tpike3 tpike3 modified the milestones: v1.2.0 Taylor, Mesa 2.0 Dec 8, 2022
@tpike3 tpike3 modified the milestones: Mesa 2.0, Milestone 1.2 Jan 8, 2023
@tpike3 tpike3 modified the milestones: Milestone 1.2, Mesa 2.0 Feb 5, 2023
@tpike3 tpike3 removed this from the Mesa 2.0 milestone Jun 19, 2023
@tpike3
Copy link
Member

tpike3 commented Jun 19, 2023

Inspired other updates to datacollector

@tpike3 tpike3 closed this Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Agent DataCollection Can't Handle Different Attributes in ActivationByType
8 participants