Too much opening and closing files #153

bwvdnbro · 2022-12-06T10:54:48Z

Last week I got told off by Peter Draper for overloading the meta-data server of the cosma7 file system. The reason: I was running 192 parallel instances of a script that uses swiftsimio to extract galaxies from a snapshot for analysis. Apparently the meta-data server is queried whenever a file is opened, even if you just closed that same file after another reading operation. In swiftsimio, the snapshot file is opened (and almost immediately closed again):

when reading header information
when reading the list of datasets
when reading units meta-data
when reading basically every bit of meta-data for a dataset (the datasets are done in one read, but a separate read is used for different bits of meta-data).
when reading the cosmology
when reading the named columns
when reading individual datasets
That is a lot. I understand the reason for reading datasets separately, but I don't see a good reason to open and close the file for meta-data; this can and should all be read in one go. That might already help a bit.

Even if the number of reads per load() can be reduced, the more fundamental issue is that, right now, we need to do a separate load() whenever the mask is changed (because you want to read a different galaxy for example). That is great for parallelisation, since it essentially means you can spawn lots of processes for different galaxies without needing to worry about parallel HDF5 (each process uses its own h5py instance). But it means a huge overhead (in file accesses and also in amount of data being read) when you are trying to read a significant fraction of the snapshot. We need a better mechanism to deal with that. Ideally, we would have a mechanism that reads large chunks of datasets in few operations (Lustre file systems like that), but hides this from the user.

@JBorrow any thoughts on this? We really want to use swiftsimio for the morphology pipeline we plan to use for COLIBRE, but without some way to improve performance for loops over galaxies, we will never be able to run this on a large box.

The text was updated successfully, but these errors were encountered:

JBorrow · 2022-12-06T13:17:23Z

We could keep around a persistent handle for the snapshot and re-use it. I've avoided doing that so far as it seemed cleaner to me this way, and you could ask the question of how on earth we would ever give back the handle. I guess we would have that in the destructor of the swift dataset, but you would still need to query the metadata server once per galaxy (though that doesn't seem too bad to me).

Specifically, you could add a handle as a property of the SWIFTDataset.

JBorrow · 2022-12-06T13:29:38Z

From slack: Is this not a massive concern for distributed snapshots? There is surely no other way of doing this than swapping out file handles, otherwise we'll have thousands of handles hanging around. Maybe @jchelly has some insight here? Does HDF5 do some magic under the covers?

jchelly · 2022-12-06T14:55:27Z

Maybe swiftsimio needs an API that allows reading multiple arrays in one operation? That's how I usually deal with multi-file snapshots: make a list of everything I need from each file then open the files in sequence. It should be possible to get the number of file opens down to one opening of file 0 to read all of the metadata and then each time a region is requested each file with particles in that region is opened once. You could also use something like functools.lru_cache to keep the last few files around if it's likely the same files will be needed again.

I'm not sure if HDF5 does anything clever with the virtual file. It seems to delay opening sub files until you request data from them but I don't know how it decides to close them or what happens if there are more files than the maximum number of file handles.

JBorrow · 2022-12-07T13:51:31Z

So what yt does is hold onto a single file handle. Indeed John, that may be a nice way to solve this, but it would ruin our lovely attribute-based API.

I guess as long as we hold onto the file handle, maybe hdf5 will under the covers? I've always hated the idea of the distributed snapshots, but I guess people have to use them...

JBorrow · 2022-12-07T13:54:25Z

The best way to do this would be to add an @property to SWIFTMetadata that either presents the currently open file handle or opens the file. This would require a slight re-write of the way we handle the SWIFTUnits object.

JBorrow · 2022-12-07T13:56:08Z

Then, when loading multiple regions, you should be able to re-use the same SWIFTMetadata (the same file handle) even if you destroy the SWIFTDataset?

MatthieuSchaller · 2022-12-07T21:23:03Z

Can we load all the meta-data once and for all? That would already reduce the number of reads and is light-weight.

JBorrow · 2022-12-16T18:13:52Z

I've proposed a fix in #155.

JBorrow mentioned this issue Dec 16, 2022

Allows file handles to stick around longer #155

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too much opening and closing files #153

Too much opening and closing files #153

bwvdnbro commented Dec 6, 2022

JBorrow commented Dec 6, 2022 •

edited

Loading

JBorrow commented Dec 6, 2022

jchelly commented Dec 6, 2022

JBorrow commented Dec 7, 2022

JBorrow commented Dec 7, 2022

JBorrow commented Dec 7, 2022

MatthieuSchaller commented Dec 7, 2022

JBorrow commented Dec 16, 2022

Too much opening and closing files #153

Too much opening and closing files #153

Comments

bwvdnbro commented Dec 6, 2022

JBorrow commented Dec 6, 2022 • edited Loading

JBorrow commented Dec 6, 2022

jchelly commented Dec 6, 2022

JBorrow commented Dec 7, 2022

JBorrow commented Dec 7, 2022

JBorrow commented Dec 7, 2022

MatthieuSchaller commented Dec 7, 2022

JBorrow commented Dec 16, 2022

JBorrow commented Dec 6, 2022 •

edited

Loading