-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New readdir
features: lazy
and filetypes
#33478
base: master
Are you sure you want to change the base?
Conversation
Any guidance and suggestions about this PR are very welcome. I could not avoid the changes to The ability to list a directory as a stream can be especially useful to deal with directories containing a large number of files, preventing the overhead of holding it all in memory at the same time. While this may not be a very usual situation, this can be very relevant for those in need. |
I think this would be better exposed as a |
This isn't a problem I've specifically had with Julia, but in a project one worked on (supervised chess learning), we had to make meaningless subfolders to reduce the number of files per folder, as many applications do not deal well with this. |
This started with me scratching my own itch: I had to deal with a directory with millions of files, and I knew I could use e.g. Python's os.scandir or even a Scala prompt to iterate over them instead of having to wait forever while everything gets loaded and sorted before I look at a single file (using I decided to give Julia a try and found out The argument for this PR really cannot be so much about performance gains, and it doesn't really seem to improve rm, chmod, etc. It's more about being able to have a directory traversal with limited memory and latency. I put some experiments in this gist: The channel-based lazyreaddir can work twice as fast as the other versions only with as much as 2.5M files in the same directory. A lazy approach that appends everything to an array (without sorting) has only a small speed gain compared to the original readdir, at most 15%. I took this example from the Python os.scandir PEP, where they actually mention there can only be some performance gains getting the file size has little cost. I'm not sure there's any other argument for having a similar function, and I don't know if this can be further improved. I'm just generally interested in this kind of stream processing, and I'm happy it worked. File system processing is maybe not an exciting application, true. Who knows, maybe there is some special FUSE file system where this could make sense? I just find it weird to always have to read and sort a full directory before doing anything. |
Hope this is no too much information. I made a slightly more interesting experiment. Because getting the file size from stat is a big overhead, a more relevant task for testing is to simply count the files in a directory. This gist shows two programs to count files using both libuv readdir and scandir, in C and Julia (this branch). https://gist.github.com/nlw0/0549c9fafa5e6c259e47a73e913c1185 The C program shows that with millions of files the lazy In Julia using the lazy readdir turned out to be twice as slow in this test. I'm thinking there's probably some overhead on |
To summarize and see if I'm understanding this:
Even if the array of names is very large, say 10 million, and the names are quite long, say 100 bytes each, that's still only about a gigabyte of memory, which is not much these days. One possible direction here is to have |
In libuv's present nomenclature, It is true that working with big directories is not that prohibitive nowadays. I just find it nice to follow the design principle that it should be possible to do an unsorted and memory-limited directory traversal, and to offer a way to do it, avoiding to have an intrinsic limitation. The lazy In this new experiment here https://gist.github.com/nlw0/e55933632d857481b106d170305c736c With this function we can now create a file count in Julia that is faster than doing Is there an interest in having something like this in the standard library? Should I reformulate the PR with the callback-based implementation? This is of course a very low-level function, not really intended for every user, although it might be a useful tool for implementing other functions like the file count example. |
Alternative API idea: |
The problem I see is that laziness kind of precludes |
Adding another function is too ugly. It's fine for |
I'm still working on a new patch, but I thought I could share this. I've had this topic on my mind for ages, but only today I found out it's perfectly possible to implement it all just with ccall! https://gist.github.com/nlw0/f9e0a4c6ebeb11b91e5b92f53b283f64 Then I'm actually not so sure about the callback anymore. The thing about the callback is that these libuv functions can take a callback argument themselves, and maybe that's something to explore later. But it's looking like we could just make an iterator without that first. Is that OK? What's really missing, though, is that it would be nice to preserve the information of file type that these functions return in the |
It would be nice to have an option for passing the dirent to the callback. One possibility is to decide what to pass to the callback based on the callback's signature. I've done this in packages to maintain compatibility and it works quite nicely. I suspect that @JeffBezanson might hate it though. To spell that out, if the callback |
Ah, I see from your example code that |
Yeah, we could map the integer it into some enum or symbols, but is there nothing like that available already? |
I don't know if there is or not. You could look around for it or @vtjnash might know. |
Trying out a different approach. First of all, the Second, |
Finally put everything in a workable state, and added a minimal testing. Hope I can soon make some tests to see if this scandir iterator offers some performance advantage or not. I still think it's nice to offer a limited-memory alternative anyways. The main new feature here might actually be the filetypes... |
readdir
features, lazy
and filetypes
readdir
features, lazy
and filetypes
readdir
features: lazy
and filetypes
f20b2f3
to
1d0282e
Compare
I feel this is pretty close to done, or let's say a working patch. Here's how it's working right now: New flag Flags One idea I'm having now, seeing I have essentially duplicated a lot of code, is that we could rely just on the iterator in any case, and then simply return a vector, sorted or not, to replicate the current Then in the end it would be a dirent iterator. On top of that we implement And then in the end having a callback would be something like |
d689d37
to
5daaa3a
Compare
Apparently the strings are de-allocated at the |
I'm very strongly against having this return an iterator. It should take a callback instead. Making this an iterator means that you need to allocate an iterator object which then may have to do substantial cleanup, which doesn't happen immediately when iteration is done, but rather only happens when GC collects it. That adds unnecessary burden on GC/finalization and worse, not finalizing filesystem resources immediately is a mess, ranging from leaking limited file descriptors, to locking resources so that no one else can do anything with them (I'm looking at you, Windows). |
I agree there's a problem right now, that is the iterator must run to the end. The cleanup is the The eager code is essentially holding and releasing a resource after it's finished with its for-loop. Providing the eager function with a callback does help limiting memory, so I'm down with that. I would hope we can find a solution to properly holding and releasing a system resource, though, as this is a common pattern. I see two alternatives, using a |
What do you mean that it isn't lazy anymore? My impression was that the concern here was always that allocating a huge array of names could take up a lot of memory. The callback version avoids allocating that array of names. What more do you mean by "lazy"? Are you talking about terminating before having processed all the names? |
By lazy I mean essentially having and iterator object, or something ilke Being able to break the loop might actually be a great example of what we still lack. If we are writing a for-loop in C and calling I hope I'm not being too picky. My greatest concern is really just ensuring memory-limited operation, so I'm glad if I can get that. But on top of that, I think it's great to offer something more like and iterator. It might enable things I cannot even tell you right now. I just find it to be a generally good thing to have, and I'm curious to see how we might implement it here, as I'm not familiar with any similar examples in Julia as well. I'm not too concerned about it, though, and I'd be glad to just refactor the code the other way. I've actually drafted it somewhere already. |
There's two approaches:
The first one has to work anyway: the implementation is just buggy if it doesn't handle that correctly, it has to wrap the callback invocation in a try/catch and cleanup properly. |
Are finalizers generally considered something we shouldn't use? With finalizers, following a "RAII" approach we could do something like
As far as I know it should play well with things like exceptions. I'm just not familiar with what issues would preclude that. Otherwise, following the idea of using a do-block to handle a resource, we could still have an iterator with something like this
Not sure it's missing anything to properly handle any exceptions. I'm just curious to hear, would any of these options be considered fine? I'll make a patch later just for taking a function to |
Delaying the de-allocation or finalization might not necessarily be a bad thing. Of course if it relates to some resource that needs to be free, it's a different story, you need something strict. But in a case like this, it's about just memory de-allocation, there's no conflict. Eagerly de-allocating memory can actually be detrimental to performance. This is often done in languages such as C or C++ just because there's no good memory-management library available out of the box. The ability to postpone memory de-allocation if often one way how more modern languages can provide performance benefits. I like to rely on it whenever possible. |
I'm getting the impression that you really want to do this with an iterable object, but as I've explained, that's a bad idea. It's a different story in languages where finalization is guaranteed to happen when scope exits (and it's a runtime error for the resource to leak), but Julia is not one of those languages. Finalizers are ok but should be avoided when possible in general, and especially bad here because of what I've already explained. If the lazy API looks like this: for name in readdir(dir, lazy=true)
# do something with name
end That means that readdir(dir) do name
# do something with name
end This does not need to allocate an iterable object and guarantees cleanup as soon as processing exits by any means. I'm happy to help with an implementation of the callback based API that I've already said is the way to go, but I'm going to ignore any further discussion of an iterator. |
I've already conceded to not having an iterator interface. I am interested in having a conversation about these topics, though, and I find it useful to talk over real code. Thanks for putting up with me. I'm learning a lot, and I hope this can be helpful for others. I just pushed a new version. The current patch merely refactors
With this patch we have a viable way to perform limited-memory directory traversal that is acknowledged, documented and available out of the box. It solves my feature request. This might also be re-used in other system functions such as Hopefully this design is not too far from what you had in mind. How can we further improve it? |
Sorry—I interpreted the ongoing discussion of returning an iterator as insistence on that approach. I'm pretty short on time these days, so I'm afraid I don't have much time to discuss approaches that we're not going to use. I'll try to take a look at this version some time today. |
@StefanKarpinski no worries, sorry I'm nerding out. The current patch is a pretty simple refactoring of the current code. |
The original libuv readdir one day became scandir, what is today the basis for Julia's
readdir
. The way that function works means all directory contents must be held in memory and sorted before processing. By utilizing the new readdir function in libuv we can instead process directory contents in a streaming fashion. This patch implements alazyreaddir
method to do this, delivering the directory contents trough aChannel
.