-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column data types not being parsed under v0.8.0-dev #29
Comments
Hey David! Can you share a sanitized version of the By default, Drill will treat all columns in a CSV as write_csv(mtcars, "~/Data/mtcars.csv")
httr::POST(
"http://localhost:8047/query.json",
encode = "json",
body = list(
queryType = "SQL",
query = "SELECT * FROM dfs.d.`/mtcars.csv` LIMIT 1"
)
) -> res
cat(
httr::content(res, as = "text")
) causes this to be passed on down the wire: {
"queryId" : "23ccb7be-d896-51d5-c1d3-7e93a32a769f",
"columns" : [ "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb" ],
"rows" : [ {
"carb" : "4",
"mpg" : "21",
"qsec" : "16.46",
"cyl" : "6",
"hp" : "110",
"am" : "1",
"disp" : "160",
"wt" : "2.62",
"vs" : "0",
"drat" : "3.9",
"gear" : "4"
} ],
"metadata" : [ "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR" ]
} Would an |
AUGHH!!! they've deprecated the |
Hrm. 🤔 My inexperience with drill may be showing here. I've been using CTAS queries to convert CSVs (CSVH, technically) to Parquet. Your comments about drill always returning There does seem to be a difference in how v.0.5.0 and library(tidyverse)
library(sergeant)
data(flights, package = "nycflights13")
flights %>% write_csv(here::here("data/flights.csvh"))
ds <- sergeant::src_drill()
dat <- tbl(ds, "dfs.root.`/data/flights.csvh`")
glimpse(dat) I may have been relying inadvertently on |
Aye. I inadvertently pushed the new code to You could also do: dat <- tbl(ds, sql("
SELECT
CAST(`year` AS INT) `year`,
CAST(`month` as INT) `month`,
CAST(`day` AS INT) `day`,
CAST(dep_time AS INT) dep_time,
CAST(sched_dep_time AS INT) sched_dep_time,
CAST(dep_delay AS DOUBLE) dep_delay,
CAST(arr_time AS INT) arr_time,
CAST(sched_arr_time AS INT) sched_arr_time,
CAST(arr_delay AS DOUBLE) arr_delay,
carrier,
CAST(flight AS VARCHAR) flight,
tailnum,
origin
FROM dfs.data.`flights.csvh`
"))
glimpse(dat)
## Observations: ??
## Variables: 13
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559, 559, 600, ...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559, 600, 600, ...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -3, -4, -4, 0...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 702, 854, 851,...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 706, 902, 858,...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, -6, -8, 16, ...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", "UA", "AA", ...
## $ flight <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, 1806, 1187, ...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS", "N593JB", ...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "JFK", "JFK",... |
The above is likely the best way to do it (apart from running |
Sounds like a plan. I don't suppose you know of any
After a few columns of manual query creation my typo probability increases. 😬 If the rebase item you mention in #29 (comment) isn't something you need tracking, please do feel free to close this issue. Appreciate the package! |
Something like… ctas_profile <- function(x, new_table_name = "CHANGE____ME") {
stopifnot(inherits(x, "tbl_drill"))
vals <- collect(head(x))
vals <- suppressMessages(type_convert(vals))
data_type <- function(x) {
switch(
class(x)[1],
integer64 = "BIGINT",
logical = "BOOLEAN",
integer = "INTEGER",
numeric = "DOUBLE",
factor = "VARCHAR",
character = "VARCHAR",
Date = "DATE",
POSIXct = "TIMESTAMP"
)
}
field_types <- vapply(vals, data_type, character(1))
sprintf(
" CAST(`%s` AS %s) AS `%s`",
names(field_types),
field_types,
names(field_types)
) -> casts
casts <- unlist(strsplit(paste0(casts, collapse=",\n"), "\n"))
case_when(
grepl("AS TIMESTAMP", casts) ~ sprintf("%s -- ** Change this TO_TIMESTAMP() with a proper format string ** (will eventually auto-convert)", casts),
grepl("AS DATE", casts) ~ sprintf("%s -- ** Change this TO_DATE() with a proper format string ** (will eventually auto-convert)", casts),
TRUE ~ casts
) -> casts
orig_query <- x$ops$x
if (!grepl("select", orig_query, ignore.case=TRUE)) {
orig_query <- sprintf("SELECT * FROM %s", orig_query)
}
sprintf(
"CREATE TABLE %s AS\nSELECT\n%s\nFROM (%s)\n",
new_table_name,
paste0(casts, collapse="\n"),
orig_query
)
} A cpl light tests. Bare data sourceflt1 <- tbl(db, "dfs.d.`/flights.csvh`")
cat(
ctas_profile(flt1)
) CREATE TABLE CHANGE____ME AS
SELECT
CAST(`year` AS DOUBLE) AS `year`,
CAST(`month` AS DOUBLE) AS `month`,
CAST(`day` AS DOUBLE) AS `day`,
CAST(`dep_time` AS DOUBLE) AS `dep_time`,
CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
CAST(`arr_time` AS DOUBLE) AS `arr_time`,
CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
CAST(`carrier` AS CHARACTER) AS `carrier`,
CAST(`flight` AS DOUBLE) AS `flight`,
CAST(`tailnum` AS VARCHAR) AS `tailnum`,
CAST(`origin` AS VARCHAR) AS `origin`,
CAST(`dest` AS VARCHAR) AS `dest`,
CAST(`air_time` AS DOUBLE) AS `air_time`,
CAST(`distance` AS DOUBLE) AS `distance`,
CAST(`hour` AS DOUBLE) AS `hour`,
CAST(`minute` AS DOUBLE) AS `minute`,
CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT * FROM dfs.d.`/flights.csvh`) Carved out fields with
|
I put this WIP |
Under sergeant master (20190106), csvh files being read in over DBI (
src_drill()
) are coming across as character columns, while the same file parsed under CRAN v0.5.0 sergeant is properly coerced into R data types.Apologies for the lack of reprex here. I'm in the thick of a drill-powered project and wanted to throw this early report up in case this was something related to the
rapidjsonr
conversion progress and the new dev branches. If previously unknown, I'll generate a reprex when my hair is a bit less on fire. ;)The text was updated successfully, but these errors were encountered: