Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move DirEntry::ino method to an extension trait, Support more metadata for sort_by #53

Closed
wants to merge 8 commits into from

Conversation

jeremielate
Copy link
Contributor

@jeremielate jeremielate commented Jun 17, 2017

Is it right to break the api when making improvements ? If so, do I need to change the version number ?

@jeremielate jeremielate changed the title Move DirEntry::ino method to an extension trait Move DirEntry::ino method to an extension trait, Support more metadata for sort_by Jun 17, 2017
@KodrAus
Copy link
Contributor

KodrAus commented Jun 19, 2017

Thanks @jeremielate! No need to update the version number yet. You're right this is a breaking change, but I think we'll update the version number later, because there are a few breaking changes in other PRs to work through.

Just so we've got some links between issues, you've covered:

EDIT: My mistake, you've referenced the issues in commits, I didn't notice :)

I'll spend some time digging through this change (there's a fair bit there), but @BurntSushi will probably come along at some point with feedback.

Copy link
Contributor

@KodrAus KodrAus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work @jeremielate!

I think your sort_by implementation is a great reference for discussion. We can work on more details in this PR conversation.

src/lib.rs Outdated
@@ -842,6 +843,15 @@ impl DirEntry {
}
}

#[cfg(unix)]
impl std::os::unix::fs::DirEntryExt for DirEntry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd think this is a perfect reason to implement DirEntryExt in the standard library but we shouldn't do that :( Those traits are going away at some point, so we need to define our own DirEntryExt with an ino method and implement that.

It'll be easier to put this trait and its implementation for DirEntry in a separate unix.rs file.

src/lib.rs Outdated
@@ -191,7 +190,8 @@ struct WalkDirOptions {
max_open: usize,
min_depth: usize,
max_depth: usize,
sorter: Option<Box<FnMut(&OsString,&OsString) -> Ordering + 'static>>,
sorter: Option<Box<FnMut((&OsStr, Option<Metadata>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we haven't got the Metadata on hand already I think we'll need to consider the implications of fetching it multiple times. It might be a non-issue, but it might also be expensive.

At the very least, I think it's worth pulling these arguments into a struct, something like:

pub struct DirEntrySort {
    file_name: OsString,
    metadata: Option<fs::Metadata>,
}

impl DirEntrySort {
    pub fn file_name(&self) -> &OsStr {
        &self.file_name
    }

    pub fn metadata(&self) -> Option<&fs::Metadata> {
        self.metadata.as_ref()
    }
}

cc: @BurntSushi

@@ -792,3 +792,18 @@ fn walk_dir_sort_small_fd_max() {
assert_eq!(got,
["", "/foo", "/foo/abc", "/foo/abc/fit", "/foo/bar", "/foo/faz"]);
}

#[test]
fn walk_dir_send_sync_traits() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

…rt_by callback.

Trait DirEntryExt copied and implemented from std::os::unix::fs to a local module.
Copy link
Contributor

@KodrAus KodrAus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks again! 👍

I've just left 2 really minor comments.

I'm happy with your new sort_by implementation, but will leave the rest to @BurntSushi

src/lib.rs Outdated
@@ -842,6 +861,15 @@ impl DirEntry {
}
}

#[cfg(unix)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this impl into the new unix module too.

src/unix.rs Outdated
@@ -0,0 +1,8 @@

/// Unix specific extension methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do you think we should mirror the docs on the DirEntryExt trait in the standard library?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to properly link std docs, does the link to this trait may break in the future if change occur in the std ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant copying the docs in the standard DirEntryExt, rather than linking to that trait. So our walkdir::unix::DirEntryExt trait looks like:

/// Unix-specific extension methods for `walkdir::DirEntry`
pub trait DirEntryExt {
    /// Returns the underlying `d_ino` field in the contained `dirent` structure.
    fn ino(&self) -> u64;
}

@jeremielate
Copy link
Contributor Author

I have added a commit referencing #38 while an older pull request has already corrected this issue, do i need to remove this last commit ?

@KodrAus
Copy link
Contributor

KodrAus commented Jun 19, 2017

@jeremielate That's all good, we'll coordinate it all when merging PRs in 👍

Let's keep the rest of this PR focused on the new sort_by though, so it's easier to discuss any specifics around that. I understand this particular feature touches a lot of other parts of walkdir so you're running into these other issues as you go.

Please feel free to open other PRs for any other issues you want to tackle! You've done some awesome work already.

…DirEntryExt.

Moved trait implementation for DirEntry in module unix.
Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremielate Thanks so much for working on this!

I think this API is going to be tricky to get right. There are many competing concerns, and I am pretty sure that the implementation in this PR is probably not the right balance. Here are my thoughts:

  1. This implementation ignores errors if the metadata call fails and instead just returns None. This probably isn't acceptable.
  2. Even if the caller only sorts by OsStr, they are still forced to pay the cost of an extra stat call because metadata is called unconditionally.
  3. This introduces a new public type just for the sorting API, but there's no real obvious reason why we can't just use DirEntry instead. If we passed &DirEntry to the sort callback, then we could remove the extra public type, and the caller could avoid the extra stat call when they don't need it. The downside of this approach is that if the caller calls metadata in the comparator, then each call to metadata will result in another stat call, which is unfortunate for performance reasons, but also because each call to metadata could yield a different result. And therefore, you'll end up sorting data that's changing during the sort, which is bad.

So I'm not sure what the right path is here. Technically, we could solve all of the problems by introducing a new public type that lightly wraps DirEntry by caching any metadata retrievals. But this seems clunky and unfortunate.

On top of all of that, passing &DirEntry to the comparator will require some not-so-nice refactoring inside walkdir. Right now, the actual DirEntry is constructed as late as possible. To make this work, we probably want to change DirList::Closed to hold a vec::IntoIter<Result<DirEntry>> instead of a vec::IntoIter<Result<fs::DirEntry>>. And that in turns probably means threading the current depth through some more code.

Thoughts? Ideas? Other simple solutions that I'm missing?

&self.file_name
}

/// Optional metadata.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you insert a blank /// line after this?

@KodrAus
Copy link
Contributor

KodrAus commented Jun 28, 2017

@BurntSushi Maybe we can start by listing some requirements for the sort_by function, so we're all on the same page about what an appropriate solution will look like?

From what I can gather:

  • We don't want to call metadata unless a caller actually needs it because:
    • It's expensive
  • We want to avoid multiple calls to metadata because:
    • It's expensive
    • You might get different results
  • We don't want to introduce much internal complexity in walkdir because:
    • It might not be necessary and who wants to deal with that?

I need to go back and refresh my memory of how this all works, but it sounds like either caching metadata calls, or not calling sort_by until after we've got a DirEntry sounds like a possible approach. Caching might mean introducing something sort of like an Arc<RwLock<Option<DirEntry>>> though, where we build the DirEntry if it's None and stash it. That's not as elegant as not calling sort_by until we've already got a DirEntry on hand, but I'm not sure what that would involve just yet.

I'm saying this out of ignorance of walkdir's current internals though, so please correct me if I've mistaken something :)

@BurntSushi
Copy link
Owner

@KodrAus Sounds like you've got the requirements right, although I'm actually not too concerned about internal complexity. I think we can deal with that. What I'm worried about is introducing new public types that complicate the API that end users need to understand.

The code should be able to be refactored such that we build the DirEntry early enough to pass to sort_by. It's not a pleasant refactoring, but I don't see why we can't do it. Once we do that, though, if we hand the sorter the DirEntry's directly and the sorter calls metadata, then we wind up with the "extra cost + sorting by changing data" problem. You might think we could solve this by caching metadata inside the DirEntry itself, but I feel like that violates the principle of least surprise in two distinct ways:

  1. If we used RefCell internally: "What do you mean DirEntry doesn't satisfy Sync?" Alternatively, if we used a lock: "What do you mean DirEntry has a mutex inside of it?"
  2. Calling metadata wouldn't necessarily correspond to actually fetching the metadata. I would normally think that this is bad, but std's DirEntry has platform specific behavior that might not result in re-fetching the data, so I suppose we would have some precedent there.

Assuming we're OK with (2), does that mean we need to convince ourselves to be OK with (1)?

@KodrAus
Copy link
Contributor

KodrAus commented Jun 28, 2017

Ah my mistake, for some reason I was thinking DirEntrys were expensive to build because their metadata was already cached, and that it was always cheap to call metadata on them. That's not the case.

So I was basically already expecting (2), but I guess it's a bit of a subtle deviation from std, I think we'd need to call out that the metadata is a snapshot in the docs. I'd personally prefer calling it out in the method signature, but cached_metadata or metadata_snapshot might be a bit much.

For (1) I'd be more surprised about DirEntry not being Sync than DirEntry containing a lock, because at least that implementation detail doesn't then leak out to its public API via the Sync trait. But I guess a Mutex would also technically leak if you wanted to return borrowed Metadata instead of cloning.

So at this stage I don't find myself too surprised by those points, but I don't think I'm really representative of walkdir users.

@BurntSushi
Copy link
Owner

@alexcrichton Do you have any thoughts here? I think you did a good chunk of std::fs, so you might have some additional context that can inform the design here. I'll try to summarize the issue briefly for you.

The overall problem we're trying to solve here is to support the sort_by method on WalkDir. This method is special because it's used internally while walking directories, which makes it more efficient than just collecting the entire list of entries into memory and sorting them. The issue is this: what parameters should we pass to the comparator given by the caller?

Today, we pass a pair of &OsStrings, which permits the caller to only sort by the base file name, but ideally, the caller should be able to sort by any attribute of the file. The most natural solution to this is to pass a pair of &DirEntrys. The problem here is that if the comparator calls metadata, then technically, the result of metadata could change over time which seems bad to do during a sort. We could cache this inside DirEntry itself, but using a RefCell or a Mutex inside a DirEntry seems strange. It's also strange in that subsequent calls to metadata yield a cached view instead of what it does today: always ask for metadata. What do you think?

Note that the plan is to release 2.0, so breaking changes are OK.

@alexcrichton
Copy link

Yeah I agree that my expectation here is that a pair of &DirEntry would be passed in that has cheap access (no syscalls) to the filename and to the stuff like file type if it's available.

If I understand it right then calling metadata seems fine to me. I think you're worried about the case of while you're iterating a directory someone's also changing it concurrently? If walkdir has not-perfect results in that situation that seems reasonable to me.

@BurntSushi
Copy link
Owner

I think you're worried about the case of while you're iterating a directory someone's also changing it concurrently?

Right. That, and also performance concerns. The metadata call really only needs to be called once, but the comparator can be called multiple times with the same DirEntry. I worry that once we commit to this API, we're going to get a bug report that says, "the sort isn't optimal and it slows down traversal, what should I do?" And I think our answer is basically, "there's nothing you can do."

I guess I'm OK just moving forward and passing DirEntry as is, but if we do that, we should simultaneously decide that the bug report I'm anticipating is something we explicitly won't support.

@alexcrichton
Copy link

In theory the user could have a cache in this closure, right? That way you could use metadata if you wanted and also cache metadata lookups?

@BurntSushi
Copy link
Owner

Yeah, I suppose they could.

@BurntSushi
Copy link
Owner

With #70 merged, I think we might want to pair this PR down to just the extension trait. What do you think @jeremielate?

@KodrAus
Copy link
Contributor

KodrAus commented Aug 1, 2017

Hi @jeremielate, do you still have time to work on this? If not that's totally fine, but we appreciate your contribution and want you to have the chance to get it merged in.

This was referenced Aug 3, 2017
@BurntSushi BurntSushi closed this Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants