Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn about use of $ in aes() #2693

Closed
hadley opened this issue Jun 9, 2018 · 11 comments · Fixed by #3346
Closed

Warn about use of $ in aes() #2693

hadley opened this issue Jun 9, 2018 · 11 comments · Fixed by #3346

Comments

@hadley
Copy link
Member

hadley commented Jun 9, 2018

Quite a few people seem to do this, and it introduces potentially subtle bad behaviour when you start facetting. I think we now have sufficient understanding to detect this correctly, although we’ll need to carefully consider how to override $ so ok usage is not caught.

@clauswilke
Copy link
Member

clauswilke commented Jun 9, 2018

I think what needs to be caught are vectors of constants in aes(). The same problem exists for parameters, but it may be more difficult to check systematically. (Maybe there is a geom somewhere that uses vectors of constants as parameters for valid reasons.)

Here is a slightly modified reprex from the off-topic discussion in #2660, to highlight the issues:

library(ggplot2) 

df <- data.frame(x = 1:4,
                 y = 1:4,
                 label = c("a", "b", "a", "b"),
                 colour =  c("magenta", "green", "magenta", "green"))

# using a data column inside aes causes problems, the colors don't match within facets
ggplot(df, aes(x, y, colour = df$colour)) +
  geom_point(size = 3) +
  facet_wrap(~label) +
  scale_colour_identity()

# directly handing the colors to `geom_point()` as a parameter is also wrong
ggplot(df, aes(x, y)) +
  geom_point(size = 3, colour = df$colour) +
  facet_wrap(~label)

# this is valid code, however, if a bit weird
df2 <- data.frame(point = "point", label = "label")
ggplot(df, aes(x, y)) +
  geom_point(aes(colour = df2$point), size = 3) +
  geom_text(aes(x = x + 0.2, label = label, colour = df2$label))

Created on 2018-06-09 by the reprex package (v0.2.0).

@baptiste
Copy link
Contributor

baptiste commented Jun 9, 2018

On StackOverflow I used to refer to this example

@hadley
Copy link
Member Author

hadley commented Jun 10, 2018

Possible implementations:

  • In aes(), inspect the AST for uses of $. This is simple, but there are legitimate uses of $ not related to retrieving a column from a data frame. However, we'd probably catch 90% of bad usage only by looking for top-level usage of $, i.e. aes(df$x) not aes(df$x + 2). Hopefully that would be enough of a hint to steer people in the right direction generally.

  • Evaluate aesthetics in an environment where $.data.frame throws an error. This is probably going to be more challenging as R moves away from lexical scoping for S3 methods and towards pre-registration, but I'm sure we could make it work with sufficient ingenuity. The downside is that you wouldn't be able to refer to columns within a dataframe-column (which I think over the next few years the tidyverse is going to move towards supporting)

Neither of those techniques would catch this related problem:

df <- data.frame(x = 1:3)
y <- 3:1

ggplot(df, aes(x, y)) + geom_point()

Or code that's handy, but potentially problematic like aes(x, resid(mod)).

I wonder if we should rethink how the facetting code works to ensure that we don't mess up vectors that don't come from the data? It might be possible to fix by evaluating the aesthetics in a different way (although that might

Then identifying $ usage would just be a style hint, not something that is critical for correct ggplot2 code.

@clauswilke
Copy link
Member

The problem arises with the reordering of the data frame on this line:

data[order(data$PANEL), ]

It's not immediately clear to me that the reordering does anything other than make the created data frame look nice. When I replace that line with just data things seem to work. Only two regression tests fail, and those explicitly assume a particular order in the calculated data frame.

library(ggplot2) # patched line 202 in facet-wrap.r

df <- data.frame(x = 1:4,
                 y = 1:4,
                 label = c("a", "b", "a", "b"),
                 colour =  c("magenta", "green", "magenta", "green"))

# works
ggplot(df, aes(x, y, colour = df$colour)) +
  geom_point(size = 3) +
  facet_wrap(~label) +
  scale_colour_identity()

# works
ggplot(df, aes(x, y)) +
  geom_point(size = 3, colour = df$colour) +
  facet_wrap(~label) +
  scale_colour_identity()

Created on 2018-06-10 by the reprex package (v0.2.0).

@thomasp85
Copy link
Member

FWIW I think it is a simple design mistake (can't remember if I introduced it during the rewrite, but it doesn't matter anyway). map_data should just add a PANEL column to the data - nothing else

@hadley
Copy link
Member Author

hadley commented Jun 10, 2018

It's been there a long time, at least since 687ba6e.

Let's remove in the next version, and then think about how to make a warning message steering people away from $ usage.

@clauswilke
Copy link
Member

I just made a pull request, so the required changes are recorded. I didn't encounter any problems with geom_point(), but more extensive testing would be good.

@paleolimbot
Copy link
Member

If this is still something worth fixing, I think there are some possibilities. First, I think it is possible to inspect the AST and look for calls to $ where the target is the data:

library(rlang)
aes <- ggplot2::aes

is_dollar_sign_call <- function(call) {
  is.call(call) && identical(call[[1]], quote(`$`))
}

is_double_bracket_call <- function(call) {
  is.call(call) && identical(call[[1]], quote(`[[`))
}

is_extract_call <- function(call) {
  is_dollar_sign_call(call) || is_double_bracket_call(call)
}

check_extract_usage_expr <- function(x, data, env = emptyenv()) {
  if (is_extract_call(x)) {
    data_eval <- eval_tidy(x[[2]], data, env)
    if(identical(tracemem(data_eval), tracemem(data))) {
      bad_call <- x
      good_call <- x
      good_call[[2]] <- quote(.data)
      warning(
        "Use of `", format(bad_call), "` is discouraged. ",
        "Use `", format(good_call),  "` instead.",
        call. = FALSE
      )
    }
  } else if (is.call(x)) {
    lapply(x, check_extract_usage_expr, data, env)
  } else if (is.pairlist(x)) {
    lapply(x, check_extract_usage_expr, data, env)
  }
  
  invisible()
}

check_extract_usage_quo <- function(quosure, data) {
  check_extract_usage_expr(get_expr(quosure), data, get_env(quosure))
}

check_extract_usage <- function(mapping, data) {
  lapply(mapping, check_extract_usage_quo, data)
  invisible(mapping)
}

# data
returns_x <- function() "x"
df <- tibble::tibble(x = 1:5, nested_df = tibble::tibble(x = 6:10))

# good
check_extract_usage(aes(x), df)
check_extract_usage(aes(.data$x), df)
check_extract_usage(aes(.data[["x"]]), df)
check_extract_usage(aes(.data[[!!quo("x")]]), df)
check_extract_usage(aes(.data[[returns_x()]]), df)
check_extract_usage(aes(!!sym("x")), df)
check_extract_usage(aes(x * 10), df)
check_extract_usage(aes(nested_df$x), df)
check_extract_usage(aes(nested_df[["x"]]), df)
check_extract_usage(aes(.data[[c("nested_df", "x")]]), df)
check_extract_usage(aes(.data[[c(2, 1)]]), df)
check_extract_usage(aes(.data[[1]]), df)


# bad: use of extraction
check_extract_usage(aes(df$x), df)
#> Warning: Use of `df$x` is discouraged. Use `.data$x` instead.
check_extract_usage(aes(df[["x"]]), df)
#> Warning: Use of `df[["x"]]` is discouraged. Use `.data[["x"]]` instead.

Created on 2019-05-28 by the reprex package (v0.2.1)

Similarly, it is possible to look for places where no columns from data were mapped at all. This may help with #2689 (in that it gives some measure of which names were referred to), although I'm less sure of the corner cases on this one.

library(rlang)
aes <- ggplot2::aes

is_dollar_sign_call <- function(call) {
  is.call(call) && identical(call[[1]], quote(`$`))
}

is_double_bracket_call <- function(call) {
  is.call(call) && identical(call[[1]], quote(`[[`))
}

is_extract_call <- function(call) {
  is_dollar_sign_call(call) || is_double_bracket_call(call)
}

is_pronoun_operation <- function(call) {
  is.call(call) && identical(call[[2]], quote(.data))
}

index_value_to_name <- function(index, data) {
  if (is.character(index)) {
    index[1]
  } else if (is.numeric(index)) {
    names(data)[index[1]]
  } else {
    NULL
  }
}

find_data_references <- function(x, data, env = emptyenv()) {
  if (is.name(x) && (as.character(x) %in% names(data))) {
    as.character(x)
  } else if (is_double_bracket_call(x) && is_pronoun_operation(x)) {
    index_value_to_name(eval_tidy(x[[3]], data, env), data)
  } else if (is_dollar_sign_call(x) && is_pronoun_operation(x)) {
    as.character(x[[3]])
  } else if(is_dollar_sign_call(x)) {
    find_data_references(x[[2]], data, env)
  } else if (is.call(x)) {
    new_names <- lapply(x, find_data_references, data, env)
    unlist(new_names)
  } else if (is.pairlist(x)) {
    new_names <- lapply(x, find_data_references, data, env)
    unlist(new_names)
  }
}

find_data_references_quo <- function(quosure, data) {
  find_data_references(get_expr(quosure), data, get_env(quosure))
}

find_mapped_cols <- function(mapping, data) {
  intersect(unlist(lapply(mapping, find_data_references_quo, data)), names(data))
}

check_mapping <- function(mapping, data) {
  data_name <- as_label(enquo(data))
  cols <- find_mapped_cols(mapping, data)
  if (length(cols) == 0) {
    stop("At least one column in `", data_name, "` must be mapped", call. = FALSE)
  }
  
  invisible(mapping)
}

# data
returns_x <- function() "x"
df <- tibble::tibble(x = 1:5, nested_df = tibble::tibble(x = 6:10))

# good
check_mapping(aes(x), df)
check_mapping(aes(.data$x), df)
check_mapping(aes(.data[["x"]]), df)
check_mapping(aes(.data[[!!quo("x")]]), df)
check_mapping(aes(.data[[returns_x()]]), df)
check_mapping(aes(!!sym("x")), df)
check_mapping(aes(x * 10), df)
check_mapping(aes(nested_df$x), df)
check_mapping(aes(nested_df[["x"]]), df)
check_mapping(aes(.data[[c("nested_df", "x")]]), df)
check_mapping(aes(.data[[c(2, 1)]]), df)
check_mapping(aes(.data[[1]]), df)


# bad: use of extraction
check_mapping(aes(df$x), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(df[["x"]]), df)
#> Error: At least one column in `df` must be mapped

# bad: no columns were mapped
check_mapping(aes(not_a_column_in_df), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(not_a_column_in_df$x), df)
#> Error: At least one column in `df` must be mapped
check_mapping(aes(), df) 
#> Error: At least one column in `df` must be mapped

# error during evaluation
check_mapping(aes(.data[[not_a_function()]]), df)
#> Error in not_a_function(): could not find function "not_a_function"

Created on 2019-05-28 by the reprex package (v0.2.1)

@hadley
Copy link
Member Author

hadley commented May 30, 2019

I think that overall approach has legs, so I think it's worth turning into a PR so I can give it full review. A few quick thoughts before you do that:

  • You can collapse (e.g.) is_dollar_sign_call() to is_call(x, "$") which I think means you can eliminate the named functions

  • Use is_reference() instead of identical() + tracemem()

  • I don't think we should be routinely warning people to use .data. I think in most cases we need to tell them to just use the variable name

  • I'm worried that check_mapping() is going to be too strict. Maybe we should also allow constants, and only make it a warning? (But then how do you turn the warning off?)

@clauswilke
Copy link
Member

I'm worried that check_mapping() is going to be too strict. Maybe we should also allow constants, and only make it a warning? (But then how do you turn the warning off?)

Yes. It's important that examples such as the following continue to work. Not sure whether this would work with the current implementation of check_mapping().

library(ggplot2)

ggplot(mtcars, aes(disp, mpg)) +
  geom_point(aes(color = "data")) +
  geom_smooth(aes(color = "loess smooth"), se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Created on 2019-05-30 by the reprex package (v0.2.1)

@lock
Copy link

lock bot commented Dec 1, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Dec 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants