Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient iteration over SWIFTGalaxy objects #14

Merged
merged 99 commits into from
Oct 18, 2024
Merged

Efficient iteration over SWIFTGalaxy objects #14

merged 99 commits into from
Oct 18, 2024

Conversation

kyleaoman
Copy link
Member

@kyleaoman kyleaoman commented Sep 16, 2024

Closes #13

If we want to iterate over many galaxies that lie in the same top-level cell(s) in the snapshot then creating separate SWIFTGalaxy objects will be inefficient because each time a SWIFTGalaxy is created the same particles will be read from disk, and then all but those corresponding to the halo of interest will be discarded. Instead we would like to do the expensive disk i/o once and then temporarily mask out particles that don't belong to the current galaxy of interest while iterating over galaxies.

We can do this cleverly using existing SWIFTGalaxy functionality: it's already easy to return a SWIFTGalaxy that is a subset (masked) of a SWIFTGalaxy. Draft workflow looks like:

  • Create a SWIFTGalaxyIterator helper class. It will need to know the locations and (maximum) sizes of the halos to iterate. Combining the cell size and the target galaxy sizes it will calculate an optimal grid of sub-regions to work with. Grid aligns with top-level cell boundaries. Galaxies could be on the boundary of the grid - shift the grid by half a grid spacing to handle these (the sizes are chosen to guarantee that all galaxies fit in one of the two grids). Assign each galaxy to a location in one of the two grids. Think about load balancing here as it makes sense to parallelize over the grid elements. For every grid element we generate a SWIFTGalaxy centred on the grid element and masked to contain all particles in the grid cell.
  • The SWIFTGalaxy produced by the iterator behaves as a "server" that will copy-and-mask and recentre itself to produce the actual SWIFTGalaxy's corresponding to each galaxy of interest. Should automatically avoid a copy of the entire memory (because each array is masked then written to the copy). We will assume that if we can afford to store the entire particle content of the grid element we can afford the entire particle content of the grid element + a copy of the particles belonging to one galaxy as it's being iterated over.
  • To be efficient, we can't load data too lazily. The "server" SWIFTGalaxy needs to pre-load everything that will be wanted by the SWIFTGalaxy's that it spawns to avoid duplicating i/o. Accept user input of all fields to be read, and warn user if a spawned SWIFTGalaxy reads something not already loaded (it will work, just inefficiently).
  • The optimal efficient iteration order will not be the same as the order of the input target list (and we might break it into parallel pieces). Need to collect results and keep a record of where the results go to match up with the input list. The SWIFTGalaxyIterator can help here: it should (i) be able to expose its generator to allow passing off generated objects to parallel processes and (ii) accept an arbitrary function that operates on the "spawned" SWIFTGalaxy's and returns an arbitrary object. The output of the function will map the input target list to an output list.
  • I think that this can be implemented in pieces. Can start with a naive, inefficient SWIFTGalaxyIterator class that at least takes a target list, generates SWIFTGalaxy's for each galaxy (just do the i/o and lazy-load for each one) and takes a function to map to output. This at least gets an API started, can work on making it more functional and efficient in parts.

All of the basics are now in place! Some further nice-to-have stuff:

  • UI now needs some attention?
  • Warn on reading something not pre-loaded and on failing to pre-load anything.
  • Everything so far is SOAP-focused. Look at other halo finders. Thinking of just explicitly not supporting Velociraptor with SWIFTGalaxies (keep support in SWIFTGalaxy for now).
  • SWIFTGalaxies.map(f) to apply a function f to each SWIFTGalaxy and collect the results.
  • Tests for map.

And of course:

  • Go through comments, tidy up.
  • Sanity-check variable names.
  • Tests. Updating the dummy data will be "fun".
  • Type hints.
  • Docstrings.
  • Narrative docs.
  • Cookbook.
  • Bump version for release.

Pending issues:

  • soap_index and similar work fine with list input, but fail with a numpy or unyt array.
  • Warning when loading something that should have been preloaded should include particle type in message.
  • Testing on live Colibre data fails to get galaxy particles!?
  • Catalogue index attribute disagrees between sg and sgs.
  • Can two catalogue mask variables be refactored into one and just set differently for SOAP?
  • SOAP catalogue attributes are different in sg and sgs
  • Accidentally zipped things out of order in map.

@kyleaoman kyleaoman added the enhancement New feature or request label Sep 16, 2024
@kyleaoman kyleaoman self-assigned this Sep 16, 2024
@kyleaoman kyleaoman marked this pull request as ready for review October 18, 2024 19:21
@kyleaoman kyleaoman merged commit 2eb0e78 into main Oct 18, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Iterate SWIFTGalaxy's efficiently
1 participant