Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much opening and closing files #153

Open
bwvdnbro opened this issue Dec 6, 2022 · 8 comments
Open

Too much opening and closing files #153

bwvdnbro opened this issue Dec 6, 2022 · 8 comments

Comments

@bwvdnbro
Copy link
Member

bwvdnbro commented Dec 6, 2022

Last week I got told off by Peter Draper for overloading the meta-data server of the cosma7 file system. The reason: I was running 192 parallel instances of a script that uses swiftsimio to extract galaxies from a snapshot for analysis. Apparently the meta-data server is queried whenever a file is opened, even if you just closed that same file after another reading operation. In swiftsimio, the snapshot file is opened (and almost immediately closed again):

  • when reading header information
  • when reading the list of datasets
  • when reading units meta-data
  • when reading basically every bit of meta-data for a dataset (the datasets are done in one read, but a separate read is used for different bits of meta-data).
  • when reading the cosmology
  • when reading the named columns
  • when reading individual datasets
    That is a lot. I understand the reason for reading datasets separately, but I don't see a good reason to open and close the file for meta-data; this can and should all be read in one go. That might already help a bit.

Even if the number of reads per load() can be reduced, the more fundamental issue is that, right now, we need to do a separate load() whenever the mask is changed (because you want to read a different galaxy for example). That is great for parallelisation, since it essentially means you can spawn lots of processes for different galaxies without needing to worry about parallel HDF5 (each process uses its own h5py instance). But it means a huge overhead (in file accesses and also in amount of data being read) when you are trying to read a significant fraction of the snapshot. We need a better mechanism to deal with that. Ideally, we would have a mechanism that reads large chunks of datasets in few operations (Lustre file systems like that), but hides this from the user.

@JBorrow any thoughts on this? We really want to use swiftsimio for the morphology pipeline we plan to use for COLIBRE, but without some way to improve performance for loops over galaxies, we will never be able to run this on a large box.

@JBorrow
Copy link
Member

JBorrow commented Dec 6, 2022

We could keep around a persistent handle for the snapshot and re-use it. I've avoided doing that so far as it seemed cleaner to me this way, and you could ask the question of how on earth we would ever give back the handle. I guess we would have that in the destructor of the swift dataset, but you would still need to query the metadata server once per galaxy (though that doesn't seem too bad to me).

Specifically, you could add a handle as a property of the SWIFTDataset.

@JBorrow
Copy link
Member

JBorrow commented Dec 6, 2022

From slack: Is this not a massive concern for distributed snapshots? There is surely no other way of doing this than swapping out file handles, otherwise we'll have thousands of handles hanging around. Maybe @jchelly has some insight here? Does HDF5 do some magic under the covers?

@jchelly
Copy link
Contributor

jchelly commented Dec 6, 2022

Maybe swiftsimio needs an API that allows reading multiple arrays in one operation? That's how I usually deal with multi-file snapshots: make a list of everything I need from each file then open the files in sequence. It should be possible to get the number of file opens down to one opening of file 0 to read all of the metadata and then each time a region is requested each file with particles in that region is opened once. You could also use something like functools.lru_cache to keep the last few files around if it's likely the same files will be needed again.

I'm not sure if HDF5 does anything clever with the virtual file. It seems to delay opening sub files until you request data from them but I don't know how it decides to close them or what happens if there are more files than the maximum number of file handles.

@JBorrow
Copy link
Member

JBorrow commented Dec 7, 2022

So what yt does is hold onto a single file handle. Indeed John, that may be a nice way to solve this, but it would ruin our lovely attribute-based API.

I guess as long as we hold onto the file handle, maybe hdf5 will under the covers? I've always hated the idea of the distributed snapshots, but I guess people have to use them...

@JBorrow
Copy link
Member

JBorrow commented Dec 7, 2022

The best way to do this would be to add an @property to SWIFTMetadata that either presents the currently open file handle or opens the file. This would require a slight re-write of the way we handle the SWIFTUnits object.

@JBorrow
Copy link
Member

JBorrow commented Dec 7, 2022

Then, when loading multiple regions, you should be able to re-use the same SWIFTMetadata (the same file handle) even if you destroy the SWIFTDataset?

@MatthieuSchaller
Copy link
Member

Can we load all the meta-data once and for all? That would already reduce the number of reads and is light-weight.

@JBorrow
Copy link
Member

JBorrow commented Dec 16, 2022

I've proposed a fix in #155.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants