Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New readdir features: lazy and filetypes #33478

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nlw0
Copy link
Contributor

@nlw0 nlw0 commented Oct 5, 2019

The original libuv readdir one day became scandir, what is today the basis for Julia's readdir. The way that function works means all directory contents must be held in memory and sorted before processing. By utilizing the new readdir function in libuv we can instead process directory contents in a streaming fashion. This patch implements a lazyreaddir method to do this, delivering the directory contents trough a Channel.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 5, 2019

Any guidance and suggestions about this PR are very welcome. I could not avoid the changes to sys.c, and it would be great to learn a way to do the same thing just with Julia and ccall.

The ability to list a directory as a stream can be especially useful to deal with directories containing a large number of files, preventing the overhead of holding it all in memory at the same time. While this may not be a very usual situation, this can be very relevant for those in need.

@StefanKarpinski
Copy link
Sponsor Member

I think this would be better exposed as a lazy::Bool=false keyword option to readdir, in which case it should return an iterator. I'm still not entirely convinced that this is necessary. In theory there could exist directories large enough that this is a problem, but in practice we seem to be quite short on actual cases where that's an issue—has anyone actually encountered this problem?

@oscardssmith
Copy link
Member

This isn't a problem I've specifically had with Julia, but in a project one worked on (supervised chess learning), we had to make meaningless subfolders to reduce the number of files per folder, as many applications do not deal well with this.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 8, 2019

This started with me scratching my own itch: I had to deal with a directory with millions of files, and I knew I could use e.g. Python's os.scandir or even a Scala prompt to iterate over them instead of having to wait forever while everything gets loaded and sorted before I look at a single file (using ls -U | head is too easy).

I decided to give Julia a try and found out readdir works the way it does. Then I went ahead trying to find a way to do a lazy directory traversal in Julia, preferably using libuv, and this is what I came up with. Even if this doesn't get merged, I think it would be nice to have this available in a package with a pure ccall implementation (none of this sys.c stuff).

The argument for this PR really cannot be so much about performance gains, and it doesn't really seem to improve rm, chmod, etc. It's more about being able to have a directory traversal with limited memory and latency. I put some experiments in this gist:
https://gist.github.com/nlw0/991b4f3cfd89f945471e4f38bdce9c93

The channel-based lazyreaddir can work twice as fast as the other versions only with as much as 2.5M files in the same directory. A lazy approach that appends everything to an array (without sorting) has only a small speed gain compared to the original readdir, at most 15%.

I took this example from the Python os.scandir PEP, where they actually mention there can only be some performance gains getting the file size has little cost.

I'm not sure there's any other argument for having a similar function, and I don't know if this can be further improved. I'm just generally interested in this kind of stream processing, and I'm happy it worked. File system processing is maybe not an exciting application, true. Who knows, maybe there is some special FUSE file system where this could make sense? I just find it weird to always have to read and sort a full directory before doing anything.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 10, 2019

Hope this is no too much information. I made a slightly more interesting experiment.

Because getting the file size from stat is a big overhead, a more relevant task for testing is to simply count the files in a directory. This gist shows two programs to count files using both libuv readdir and scandir, in C and Julia (this branch).

https://gist.github.com/nlw0/0549c9fafa5e6c259e47a73e913c1185

The C program shows that with millions of files the lazy readdir can be 3x faster, so it does pay off significantly in some situation, even though it's just a couple of seconds too. readdir is still 15% faster even for few files.

In Julia using the lazy readdir turned out to be twice as slow in this test. I'm thinking there's probably some overhead on ccall, which is called more times in the function I wrote. I guess that means there's more to figure out, so even if having a readdir function available apart from scandir, there hopefully a better way to take the most out of it.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Oct 11, 2019

To summarize and see if I'm understanding this:

  • eager readdir is always faster
  • lazy scandir is always slower but avoids allocating a large array of names

Even if the array of names is very large, say 10 million, and the names are quite long, say 100 bytes each, that's still only about a gigabyte of memory, which is not much these days.

One possible direction here is to have scandir as a lazy alternative to readdir, but one which returns already-stat'd files or something like that to avoid some other overhead. Also note that walkdir uses readdir but could be transparently changed to use scandir instead, which could potentially be more efficient since it needs to stat each file as well. Yet another reason that walkdir is not a great interface: it pays the cost of stat'ing each file (multiple times) but doesn't give you access to the result of the stat call in case you want to use it.

@nlw0
Copy link
Contributor Author

nlw0 commented Oct 12, 2019

In libuv's present nomenclature, readdir is the lazy one, and scandir is eager and also sorts the data. Julia's Filesystem.readdir is based on scandir. I guess this mismatch is due to the rename in libuv of the old readdir to scandir.

It is true that working with big directories is not that prohibitive nowadays. I just find it nice to follow the design principle that it should be possible to do an unsorted and memory-limited directory traversal, and to offer a way to do it, avoiding to have an intrinsic limitation.

The lazy readdir can outperform scandir, if for any reason because it skips the sorting step. The C benchmarks show this, I just had not achieved it with Julia yet.

In this new experiment here https://gist.github.com/nlw0/e55933632d857481b106d170305c736c
I made a C function that takes a Julia callback. The C function implements the libuv readdir traversal, calling the callback for each file with the available entry name and type information. The callback can be used to implement different things such as merely counting the number of files, or even creating a Channel with full path names.

With this function we can now create a file count in Julia that is faster than doing length(readdir(path)) by a factor of 2. This result should diminish any doubts that there is something to be gained by exploring this approach.

Is there an interest in having something like this in the standard library? Should I reformulate the PR with the callback-based implementation? This is of course a very low-level function, not really intended for every user, although it might be a useful tool for implementing other functions like the file count example.

@StefanKarpinski
Copy link
Sponsor Member

Alternative API idea: readdir(callback::Function, dir::AbstractString=pwd(); join::Bool = false, sort::Bool = true) where each path is passed to the callback and nothing is returned. The callback variation uses the lazy API (if the combination of platform and sort keyword allows) and thus avoids allocating an array of names, instead yielding them one at a time. I'm not sure why that didn't occur to me originally since it seems like the obvious way to do this.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 5, 2022

The problem I see is that laziness kind of precludes sort=true, unless we make it lazy for sort=false and use the eager readdir otherwise. A saparate method makes it clear it's a different beast. Then we could even have eg lazyreaddir(::Function), readdir(::Function, sort=false) and readdir(::Function, sort=true) (plus dir and join arguments for all)

@StefanKarpinski
Copy link
Sponsor Member

Adding another function is too ugly. It's fine for readdir(f, dir) to only be lazy when the sort keyword is false and document that. Using the lazy API is basically then a "memory optimization" that is used when it can be. There may also be OSes where directory iteration is already sorted in Unicode ordering, in which case the lazy memory optimization can be done.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 6, 2022

I'm still working on a new patch, but I thought I could share this. I've had this topic on my mind for ages, but only today I found out it's perfectly possible to implement it all just with ccall!

https://gist.github.com/nlw0/f9e0a4c6ebeb11b91e5b92f53b283f64

Then I'm actually not so sure about the callback anymore. The thing about the callback is that these libuv functions can take a callback argument themselves, and maybe that's something to explore later. But it's looking like we could just make an iterator without that first. Is that OK? What's really missing, though, is that it would be nice to preserve the information of file type that these functions return in the uv_dirent_t structure. Another topic for later, maybe?

@StefanKarpinski
Copy link
Sponsor Member

It would be nice to have an option for passing the dirent to the callback. One possibility is to decide what to pass to the callback based on the callback's signature. I've done this in packages to maintain compatibility and it works quite nicely. I suspect that @JeffBezanson might hate it though. To spell that out, if the callback hasmethod(callback, Tuple{String,DirEnt}) then you call it with both arguments. That brings up issues with how to present uv_dirent_t in Julia. We don't currently expose it as a type. We do have DiskStat but I'm guessing those aren't equivalent.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Apr 7, 2022

Ah, I see from your example code that uv_dirent_t is just a name and a file type. I think we'd want to expose that as a symbol: :file, :directory, etc.; see also Tar, which uses symbols for entry types. Or maybe as some enum? Although I think that Julia 1.0 enums were really botched and should be avoided.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 7, 2022

Yeah, we could map the integer it into some enum or symbols, but is there nothing like that available already?

@StefanKarpinski
Copy link
Sponsor Member

I don't know if there is or not. You could look around for it or @vtjnash might know.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 8, 2022

Trying out a different approach. First of all, the uv_fs_readdir function has this ominous warning about it being non-thread-safe, so perhaps it's better if we steer away from it.

Second, uv_fs_scandir does have a lazy interface itself, so I thought maybe we just offer an iterator based on that instead of just reading everything in one go. Plus I'm also returning the file type. Still need to make a flag to control this file type output.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 9, 2022

Finally put everything in a workable state, and added a minimal testing. Hope I can soon make some tests to see if this scandir iterator offers some performance advantage or not. I still think it's nice to offer a limited-memory alternative anyways. The main new feature here might actually be the filetypes...

@nlw0 nlw0 changed the title Implements lazyreaddir based on uv_fs_readdir New readdir features, lazy and filetypes Apr 9, 2022
@nlw0 nlw0 changed the title New readdir features, lazy and filetypes New readdir features: lazy and filetypes Apr 9, 2022
@nlw0 nlw0 force-pushed the nic/lazyreaddir branch 2 times, most recently from f20b2f3 to 1d0282e Compare April 10, 2022 10:29
@nlw0
Copy link
Contributor Author

nlw0 commented Apr 10, 2022

I feel this is pretty close to done, or let's say a working patch. Here's how it's working right now: New flag filetypes makes the code return tuples with ("filename", :filetype) in any case. It's essentially a separate new feature.

Flags lazy=true, sort=false make the function return an iterator. It's based on the same underlying libuv calls, because the original readdir is already taking entries one by one in a loop.

One idea I'm having now, seeing I have essentially duplicated a lot of code, is that we could rely just on the iterator in any case, and then simply return a vector, sorted or not, to replicate the current readdir behavior. Furthermore, we could make this iterator return the dirents, because all the other options are just transforming this data. So this minimizes the code where there's any connection to libuv. One thing I actually still need to understand is what happens to the memory from these C strings? Is libuv managing this internally?

Then in the end it would be a dirent iterator. On top of that we implement filetypes and join essentially as a function taking a dirent argument, like we were taking about. And then we collect and sort according to the lazy and sort flags. Because there's no sort argument going to libuv anyways.

And then in the end having a callback would be something like readdir(f, dir;...) = map(f, readdir(dir; lazy=true,...)). I'm not sure what exactly would be worth implementing or not with that interface. Perhaps we could leave this for a separate patch?

@nlw0 nlw0 force-pushed the nic/lazyreaddir branch 2 times, most recently from d689d37 to 5daaa3a Compare April 11, 2022 06:18
@nlw0
Copy link
Contributor Author

nlw0 commented Apr 11, 2022

One thing I actually still need to understand is what happens to the memory from these C strings? Is libuv managing this internally?

Apparently the strings are de-allocated at the cleanup. What I was thinking is we might have a hardcore version of this where the strings are not copied. I'm not sure how much might be the performance gains, but it's such a clear case of avoiding memory copies...

@StefanKarpinski
Copy link
Sponsor Member

I'm very strongly against having this return an iterator. It should take a callback instead. Making this an iterator means that you need to allocate an iterator object which then may have to do substantial cleanup, which doesn't happen immediately when iteration is done, but rather only happens when GC collects it. That adds unnecessary burden on GC/finalization and worse, not finalizing filesystem resources immediately is a mess, ranging from leaking limited file descriptors, to locking resources so that no one else can do anything with them (I'm looking at you, Windows).

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 12, 2022

I agree there's a problem right now, that is the iterator must run to the end. The cleanup is the Base.Filesystem.uv_fs_req_cleanup(req) and Libc.free(req) called before iterate returns nothing. It's all the same as the eager code, but we are "opening the loop" to produce the values.

The eager code is essentially holding and releasing a resource after it's finished with its for-loop. Providing the eager function with a callback does help limiting memory, so I'm down with that. I would hope we can find a solution to properly holding and releasing a system resource, though, as this is a common pattern.

I see two alternatives, using a finalizer as a destructor, or following the same pattern as open() do. That's pretty much what a readdir() do would look like, I suppose. So it's interesting we are doing it mostly for the sake of the resource management. I'm not sure it would be lazy anymore, though. What would be an alternative for that?

@StefanKarpinski
Copy link
Sponsor Member

I see two alternatives, using a finalizer as a destructor, or following the same pattern as open() do. That's pretty much what a readdir() do would look like, I suppose. So it's interesting we are doing it mostly for the sake of the resource management. I'm not sure it would be lazy anymore, though. What would be an alternative for that?

What do you mean that it isn't lazy anymore? My impression was that the concern here was always that allocating a huge array of names could take up a lot of memory. The callback version avoids allocating that array of names. What more do you mean by "lazy"? Are you talking about terminating before having processed all the names?

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 12, 2022

By lazy I mean essentially having and iterator object, or something ilke uv_fs_scandir_next itself. I get a new value whenever I want with some next function, inside my own for-loop. If I give you a function and it runs inside the for-loop in the julia readdir, that's nice. This pretty much delivers the memory-limited feature request. It's still eagerly running the whole for-loop, though.

Being able to break the loop might actually be a great example of what we still lack. If we are writing a for-loop in C and calling scandir_next, we can just break at some point, and then do our cleanup. To do that based on a function would require either some state mutation, or implementing some special logic that would probably look a lot like an iterator protocol.

I hope I'm not being too picky. My greatest concern is really just ensuring memory-limited operation, so I'm glad if I can get that. But on top of that, I think it's great to offer something more like and iterator. It might enable things I cannot even tell you right now. I just find it to be a generally good thing to have, and I'm curious to see how we might implement it here, as I'm not familiar with any similar examples in Julia as well. I'm not too concerned about it, though, and I'd be glad to just refactor the code the other way. I've actually drafted it somewhere already.

@StefanKarpinski
Copy link
Sponsor Member

There's two approaches:

  1. Throw an exception from the callback to stop early;
  2. Have the callback return true to continue or false to stop.

The first one has to work anyway: the implementation is just buggy if it doesn't handle that correctly, it has to wrap the callback invocation in a try/catch and cleanup properly.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 12, 2022

Are finalizers generally considered something we shouldn't use? With finalizers, following a "RAII" approach we could do something like

mutable struct LazyReadDir
    dir::AbstractString
    req::Ptr{Nothing}
    function LazyReadDir(dir)
        req = Libc.malloc(Base.Filesystem._sizeof_uv_fs)
        lrd = new(dir, req)
        finalizer(lrd) do x
            Base.Filesystem.uv_fs_req_cleanup(x.req)
            Libc.free(x.req)
        end
    end
end
Base.eltype(::Type{LazyReadDir}) = Base.Filesystem.uv_dirent_t
Base.IteratorEltype(::Type{<:LazyReadDir}) = Base.HasEltype()
Base.IteratorSize(::Type{<:LazyReadDir}) = Base.SizeUnknown()

function Base.iterate(lrd::LazyReadDir)
    err = ccall(:uv_fs_scandir, Int32, (Ptr{Cvoid}, Ptr{Cvoid}, Cstring, Cint, Ptr{Cvoid}),
                C_NULL, lrd.req, lrd.dir, 0, C_NULL)
    err < 0 && Base.Filesystem.uv_error("readdir($(repr(lrd.dir)))", err)    
    return iterate(lrd, nothing)
end

function Base.iterate(lrd::LazyReadDir, state::Nothing)
    ent = Ref{Base.Filesystem.uv_dirent_t}()
    err = ccall(:uv_fs_scandir_next, Cint, (Ptr{Cvoid}, Ptr{Base.Filesystem.uv_dirent_t}), lrd.req, ent)
    if Base.UV_EOF != err
        return ent[], nothing
    else
        return nothing
    end
end

function mylist(dir)
    for (n,x) in enumerate(LazyReadDir(dir))
        println(unsafe_string(x.name))
        if n>=3 break end
    end
end
julia> mylist("/home/user/src/julia")
.buildkite-external-version
.clang-format
.codecov.yml

As far as I know it should play well with things like exceptions. I'm just not familiar with what issues would preclude that.

Otherwise, following the idea of using a do-block to handle a resource, we could still have an iterator with something like this

struct LazyReadDir
    dir::AbstractString
    req::Ptr{Nothing}
    function LazyReadDir(dir)
        req = Libc.malloc(Base.Filesystem._sizeof_uv_fs)
        new(dir, req)
    end
end
Base.eltype(::Type{LazyReadDir}) = Base.Filesystem.uv_dirent_t
Base.IteratorEltype(::Type{<:LazyReadDir}) = Base.HasEltype()
Base.IteratorSize(::Type{<:LazyReadDir}) = Base.SizeUnknown()

function Base.iterate(lrd::LazyReadDir)
    err = ccall(:uv_fs_scandir, Int32, (Ptr{Cvoid}, Ptr{Cvoid}, Cstring, Cint, Ptr{Cvoid}),
                C_NULL, lrd.req, lrd.dir, 0, C_NULL)
    err < 0 && uv_error("readdir($(repr(lrd.dir)))", err)    
    return iterate(lrd, nothing)
end

function Base.iterate(lrd::LazyReadDir, state::Nothing)
    ent = Ref{Base.Filesystem.uv_dirent_t}()
    err = ccall(:uv_fs_scandir_next, Cint, (Ptr{Cvoid}, Ptr{Base.Filesystem.uv_dirent_t}), lrd.req, ent)
    if Base.UV_EOF != err
        return ent[], nothing
    else
        return nothing
    end
end

function withreq(f, dir)
    lrd = LazyReadDir(dir)
    f(lrd)
    Base.Filesystem.uv_fs_req_cleanup(lrd.req)
    Libc.free(lrd.req)    
end

function mylist(dir)
    withreq(dir) do lrd
        for (n,x) in enumerate(lrd)
            println(unsafe_string(x.name))
            if n>=3 break end
        end
    end
end

Not sure it's missing anything to properly handle any exceptions. I'm just curious to hear, would any of these options be considered fine?

I'll make a patch later just for taking a function to readdir, and the filetypes argument, not sure it's worth it pursuing anything else.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 13, 2022

Delaying the de-allocation or finalization might not necessarily be a bad thing. Of course if it relates to some resource that needs to be free, it's a different story, you need something strict. But in a case like this, it's about just memory de-allocation, there's no conflict. Eagerly de-allocating memory can actually be detrimental to performance. This is often done in languages such as C or C++ just because there's no good memory-management library available out of the box. The ability to postpone memory de-allocation if often one way how more modern languages can provide performance benefits. I like to rely on it whenever possible.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Apr 13, 2022

I'm getting the impression that you really want to do this with an iterable object, but as I've explained, that's a bad idea. It's a different story in languages where finalization is guaranteed to happen when scope exits (and it's a runtime error for the resource to leak), but Julia is not one of those languages. Finalizers are ok but should be avoided when possible in general, and especially bad here because of what I've already explained. If the lazy API looks like this:

for name in readdir(dir, lazy=true)
     # do something with name
end

That means that readdir returns a heap-allocated, mutable, iterable object and relies on the garbage collector to eventually collect that object and clean up all the state required to do lazy directory listing. That is not good. The alternative API I've suggested is this:

readdir(dir) do name
    # do something with name
end

This does not need to allocate an iterable object and guarantees cleanup as soon as processing exits by any means. I'm happy to help with an implementation of the callback based API that I've already said is the way to go, but I'm going to ignore any further discussion of an iterator.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 13, 2022

I've already conceded to not having an iterator interface. I am interested in having a conversation about these topics, though, and I find it useful to talk over real code. Thanks for putting up with me. I'm learning a lot, and I hope this can be helpful for others.

I just pushed a new version. The current patch merely refactors readdir moving all the libuv calls and the loop to a _readdir function, that takes a callback. readdir was re-implemented on top of that to avoid code duplication. readdir is also responsible for the join and sort features, that do not affect _readdir.

_readdir is not exported, but is documented as an alternative for limited-memory directory traversal. The main reason not to export _readdir would be the need to deal with uv_dirent_t and UV_FS_FILETYPES. It seems like a function that is good enough for implementing library functions, but not really intended for a large public.

With this patch we have a viable way to perform limited-memory directory traversal that is acknowledged, documented and available out of the box. It solves my feature request. This might also be re-used in other system functions such as rm or walkdir, potentially saving a few calls to functions like isdir and a couple of string copies.

Hopefully this design is not too far from what you had in mind. How can we further improve it?

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Apr 14, 2022

Sorry—I interpreted the ongoing discussion of returning an iterator as insistence on that approach. I'm pretty short on time these days, so I'm afraid I don't have much time to discuss approaches that we're not going to use. I'll try to take a look at this version some time today.

@nlw0
Copy link
Contributor Author

nlw0 commented Apr 14, 2022

@StefanKarpinski no worries, sorry I'm nerding out. The current patch is a pretty simple refactoring of the current code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants