Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - Normalize Names #272

Open
mthelm85 opened this issue Jul 27, 2019 · 7 comments
Open

Feature Request - Normalize Names #272

mthelm85 opened this issue Jul 27, 2019 · 7 comments
Milestone

Comments

@mthelm85
Copy link

mthelm85 commented Jul 27, 2019

CSV.jl has a normalizenames argument that can be set to true when reading a file. This option replaces invalid identifier characters (spaces) with underscores. I think it would be nice to add this sort of a feature to Query, but I would take it a step further and remove all trailing/leading whitespaces (rather than replacing them with underscores). From the CSV.jl documentation:

"When a column name is not a single atom Julia identifier, this is inconvenient, because f.column one is not valid, so I would have to manually call getproperty(f, Symbol("column one")"

Julia's built-in strip and replace functions should do the job. I'd love to make an attempt to write this myself if you can provide a basic roadmap for me to get started.

Thanks!!

@davidanthoff
Copy link
Member

Alright, here is how I think we should do this:

First, we add a function like normalize_names to here. It should simply take a named tuple as an input argument, and return a named tuple in which all the names are normalized with the same values. This function will have to be type stable, and to achieve that it will have to be a generated function.

With that function alone, one should already be able to normalize names a la source |> @map(normalize_names(_)) |> ....

And then we simply add a new macro to here that just translates @normalize_names() into @map(normalize_names(_)).

I think you would start with the first part, get that all sorted out, and then in a second step we can add the macro.

If you haven't written a generated function before, let me know, and I'll give more tips on how to do that. You can also take a look at some of the other functions in that file for patterns.

@davidanthoff davidanthoff added this to the Backlog milestone Jul 27, 2019
@mthelm85
Copy link
Author

mthelm85 commented Jul 28, 2019

I took a look at the NamedTupleUtilities module and there is quite a bit of syntax that looks foreign to me 😨. I have a lot of researching/learning to do before I'll be able to write a function that can accomplish the task at hand. Also, I checked out the CSV.jl source code and there is quite a bit going on there that I also don't understand (it's much more than simply replacing spaces, they are replacing other kinds of characters as well).

I didn't get far at all before getting completely stuck 😳:

function normalize_names(a::NamedTuple)
    normalize(name) = replace(strip(string(name)), " " => "_")
    names_normalized = normalize.(keys(a))
    return names_normalized
end

mynames = (Symbol(" queryverse rocks"), :b, :c)
myvalues = [1, 2.0, "hello world"]
my_namedtuple = NamedTuple{mynames}(myvalues)

normalize_names(my_namedtuple)

Output:
("queryverse_rocks", "b", "c")

This function works, but it obviously doesn't do anything close to what we want it to : )

I haven't written generated functions before and I actually haven't written any code that deals with NamedTuples. In looking at the select function, for example:

@generated function select(a::NamedTuple{an}, ::Val{bn}) where {an, bn}
    names = ((i for i in an if i == bn)...,)
    types = Tuple{(fieldtype(a, n) for n in names)...}
    vals = Expr[:(getfield(a, $(QuoteNode(n)))) for n in names]
    return :(NamedTuple{$names,$types}(($(vals...),)))
end

I don't really understand what an and bn are (or how they work), and I also am unfamiliar with the comma after the splat operator ...,, and again in ($(vals...),).

My (mis)interpretation of the select function goes something like this:

The a argument that the function accepts is a NamedTuple with the names being stored in the tuple an. I guess that ::Val{bn} is a name provided by the user, but I'm really not sure what it is. I'm also unsure about what the where {an,bn} bit does. From there, I think you are creating a names variable that checks to see if the name(s) bn exist(s) in the NamedTuple a. You then create a variable to store the types of each field and another variable vals for storing an array of the NamedTuple's values.

Lastly, the function returns a NamedTuple of just the names/values that the user wanted, as specified by ::Val{bn}. Unfortunately, it all appears very cryptic to a novice like myself!!

I'll keep researching and tinkering with my code to see how much closer I can get to the desired outcome.

@davidanthoff
Copy link
Member

Yeah, this stuff is not easy :) The following code might be a useful template:

function normalize_name(x)
    return uppercase(x)
end

@generated function normalize_names(x::NamedTuple{NAMES,TYPES}) where {NAMES, TYPES}
    new_names = (Symbol(normalize_name(string(i))) for i in NAMES)
    return :(NamedTuple{$(tuple(new_names...))}(values(x)))
end

The normalize_names function should be complete as I've written it down, so it should be enough for you to turn the normalize_name function into something that does a useful thing, namely take a string with a name and return a string with the normalized name (instead of just uppercasing everything).

If CSV does a lot more in terms of normalizing the actual name than just replacing spaces, maybe we can just copy the code from there?

I can also handle the macro, that is more boiler plate code that is tricky to get right if you are unfamiliar with it, but should be really easy for me. But if you can write the actual normalization code (and tests and docs etc.) that is already a huge help!

@mthelm85
Copy link
Author

mthelm85 commented Jul 28, 2019

Thanks, David! The below seems to be working fine for me (I copied the code from CSV.jl and merged it with your example above):

using Unicode

const RESERVED = Set(["local", "global", "export", "let",
    "for", "struct", "while", "const", "continue", "import",
    "function", "if", "else", "try", "begin", "break", "catch",
    "return", "using", "baremodule", "macro", "finally",
    "module", "elseif", "end", "quote", "do"])

function normalize_name(name::String)::Symbol
    uname = strip(Unicode.normalize(name))
    id = Base.isidentifier(uname) ? uname : map(c->Base.is_id_char(c) ? c : '_', uname)
    cleansed = string((isempty(id) || !Base.is_id_start_char(id[1]) || id in RESERVED) ? "_" : "", id)
    return Symbol(replace(cleansed, r"(_)\1+"=>"_"))
end

@generated function normalize_names(x::NamedTuple{NAMES,TYPES}) where {NAMES, TYPES}
    new_names = (Symbol(normalize_name(string(i))) for i in NAMES)
    return :(NamedTuple{$(tuple(new_names...))}(values(x)))
end

With this, I can do the following:

mynames = (Symbol(" queryverse rocks"), :b, :c) # note that " queryverse rocks" is not normalized
myvalues = [1, 2.0, "hello world"]
my_namedtuple = NamedTuple{mynames}(myvalues)

normalize_names(my_namedtuple)

Output:
(queryverse_rocks = 1, b = 2.0, c = "hello world") # yay! it's normalized!

I will attempt to write a test for the above and then get back to you in a couple of days. I can definitely handle writing the documentation so I will also take care of that and get it to you in the next couple of days.

@davidanthoff
Copy link
Member

Cool! I think we’ll need tests in QueryOperators, but no docs there. But then we’ll need docs in Query for the macro (and tests there as well).

@mthelm85
Copy link
Author

mthelm85 commented Jul 29, 2019

Here's an attempt at the documentation:

The @normalize_names command

The @normalize_names command has the form source |> @normalize_names(). source can be any source that can be queried. The command will normalize column names by replacing invalid identifier characters with underscores to ensure each column is a valid Julia identifier.

Example

using Query

names = (Symbol(" queryverse rocks"), Symbol("¡column #2!"), :c)
values = [1, 2.0, "hello world"]
source = NamedTuple{names}(values)

q = source |> @normalize_names() |> collect

println(q)

# output

(queryverse_rocks = 1, _column_2! = 2.0, c = "hello world")

@mthelm85
Copy link
Author

I'm not sure exactly how to go about the unit tests, but here's something that works:

names = (Symbol(" queryverse rocks"), Symbol("¡column #2!"), :c)
values = [1, 2.0, "hello world"]
source = NamedTuple{names}(values)

@test QueryOperators.NamedTupleUtilities.normalize_names(source) == (queryverse_rocks = 1, _column_2! = 2.0, c = "hello world")
@inferred QueryOperators.NamedTupleUtilities.normalize_names(source)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants