Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add sort_by_file_name_case_insensitive method #174

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lucatrv
Copy link

@lucatrv lucatrv commented Feb 8, 2023

No description provided.

@ChrisDenton
Copy link

ChrisDenton commented Feb 9, 2023

Not to be too nitpicky but maybe this should be sort_by_file_name_lowercase? Unicode make a distinction between case folding and lowercasing. From the Unicode specification (3.13 Default Case Algorithms):

Case folding is related to case conversion. However, the main purpose of case folding is to contribute to caseless matching of strings, whereas the main purpose of case conversion is to put strings into a particular cased form.

Lowercase_Mapping(C) is used for most characters, but there are instances in which the folding must be based on Uppercase_Mapping(C), instead. In particular, the addition of lowercase Cherokee letters as of Version 8.0 of the
Unicode Standard, together with the stability guarantees for case folding, require that Cherokee letters be case folded to their uppercase counterparts. As a result, a case folded string is not necessarily lowercase.

@BurntSushi
Copy link
Owner

Hmmm right, I forgot that std doesn't have a case folding API.

I'm unfortunately inclined to say no to this and instead add some docs that shows how to do it with sort_by and something like the caseless crate.

@lucatrv
Copy link
Author

lucatrv commented Feb 9, 2023

How about renaming the method sort_by_file_name_lowercase to make it clear what it's doing? Or is there a better way to deal with this?

@BurntSushi
Copy link
Owner

I'm not a fan of that. It makes it too easy to do the wrong thing. If we're going to have a case insensitive API then we either do it right (which means either implementing case folding ourselves or introducing a dependency) or we don't do it at all and point folks in the right direction with docs. I'm not inclined to maintain the former at this time, so the latter is the path forward here.

@lucatrv
Copy link
Author

lucatrv commented Feb 10, 2023

IMHO since file names are typically at least alphanumeric (as opposite to numeric only, if not even Unicode), also the current sort_by_file_name method makes already easy to do the wrong thing. In my opinion to make this matter clear to the user, the best would be to add the sort_by_file_name_lowercase method, clarifying in its documentation what it does and relative limitations, and how to reach full case folding when needed. Otherwise the two options (just lowercase or full case folding) could be described in the documentation of the sort_by_file_name method (better with examples).

@lucatrv
Copy link
Author

lucatrv commented Feb 21, 2023

After reading the following interesting references:

https://stackoverflow.com/questions/40250988/how-can-i-case-fold-a-string-in-rust/75526819
https://stackoverflow.com/questions/45745661/lower-vs-casefold-in-string-matching-and-converting-to-lowercase/74702121#74702121
https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf
https://www.w3.org/TR/charmod-norm/#definitionCaseFolding
https://sts10.github.io/2023/01/29/sorting-words-alphabetically-rust.html
https://icu4x.unicode.org/doc/icu_collator/index.html
https://unicode-org.github.io/icu/userguide/collation/
https://www.unicode.org/reports/tr35/tr35-collation.html

now I agree with your point. So please let me know if you prefer to add the sort_by_file_name_lowercase method and describe in its documentation when it falls short, including a better example using icu_collator, or otherwise if you want all this described directly in the sort_by_file_name method's documentation.

@lucatrv
Copy link
Author

lucatrv commented Mar 18, 2023

Here is the example that I wanted to provide, using icu_collator:

use icu_collator::{Collator, CollatorOptions, Strength};

fn main() {
    let mut options = CollatorOptions::new();
    options.strength = Some(Strength::Primary);
    let collator: Collator =
        Collator::try_new_unstable(&icu_testdata::unstable(), &Default::default(), options)
            .unwrap();
    for entry in WalkDir::new("test").sort_by(|a, b| {
        collator.compare(
            &a.file_name().to_string_lossy(),
            &b.file_name().to_string_lossy(),
        )
    }) {
        println!("{}", entry.unwrap().into_path().display());
    }
}

Unfortunately it does not compile with error:

error[E0277]: `Rc<Box<[u8]>>` cannot be shared between threads safely
   --> src\main.rs:64:47
    |
64  |       for entry in WalkDir::new("test").sort_by(|a, b| {
    |  _______________________________________-------_^
    | |                                       |
    | |                                       required by a bound introduced by this call
65  | |         collator.compare(
66  | |             &a.file_name().to_string_lossy(),
67  | |             &b.file_name().to_string_lossy(),
68  | |         )
69  | |     }) {
    | |_____^ `Rc<Box<[u8]>>` cannot be shared between threads safely
    |
    = help: within `Collator`, the trait `Sync` is not implemented for `Rc<Box<[u8]>>`
    = note: required because it appears within the type `Cart`
    = note: required because it appears within the type `Option<Cart>`
    = note: required because it appears within the type `Yoke<CollationDataV1<'static>, Option<Cart>>`
    = note: required because it appears within the type `DataPayload<CollationDataV1Marker>`
    = note: required because it appears within the type `Collator`
    = note: required for `&Collator` to implement `Send`
note: required because it's used within this closure
   --> src\main.rs:64:47
    |
64  |     for entry in WalkDir::new("test").sort_by(|a, b| {
    |                                               ^^^^^^
note: required by a bound in `WalkDir::sort_by`
   --> C:\Users\trevisal\.cargo\registry\src\index.crates.io-6f17d22bba15001f\walkdir-2.3.3\src\lib.rs:396:54
    |
396 |         F: FnMut(&DirEntry, &DirEntry) -> Ordering + Send + Sync + 'static,
    |                                                      ^^^^ required by this bound in `WalkDir::sort_by`

For more information about this error, try `rustc --explain E0277`.

Can you please give me some advice on how to make it work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants