Feature Request - Feather format for cache files #225

rsangole · 2018-03-14T23:42:41Z

Report an Issue / Request a Feature

I'm submitting a (Check one with "x") :

bug report
feature request

Issue Severity Classification -

(Check one with "x") :

1 - Severe
2 - Moderate
3 - Low

Expected Behavior

I recommend adding the option of using the feather file format as an option when we cache dataframe objects. Read about feather here and here.

Advantages of feather for dataframe objects:

BLAZING fast compared to RData in R or pickles in Python. I've been using feather at work, and it's been really really fast.
Cached files can be directly accessed from Python (or vice versa) given feather was developed by Wes McKinney & Hadley Wickam together. This helps when collaborating on projects with Python aspects.

Possible Solution

Potential call:

cache('dataframe_object', feather = TRUE)

should save an object called dataframe_object.feather.

The text was updated successfully, but these errors were encountered:

Hugovdberg · 2018-03-15T09:46:34Z

Since the cache files shouldn't be accessed by users on their own I would rather stick to the convention over configuration philosophy and just introduce it as the default if it provides all functionality we need. It sounds pretty awesome, and like something I could really use in my day to day work 👍
Also, we really need a reader for feather files in that case.

rsangole · 2018-03-15T11:18:49Z

@Hugovdberg the feather package has simple functions write_feather and read_feather which work great.

I don't disagree with the default options of cache (since feather can only deal with dataframes anyways, it cannot deal with other R objects), but a choice for users of the package who have to deal with large datasets would be extremely useful. Thus I'm proposing that cache('name', feather = FALSE) would be the default.

Hugovdberg · 2018-03-15T14:46:00Z

Better performance shouldn't be an option, it should just be implemented right? ;-) Some checks for object types should be built in anyway to make this possible, so I would suggest we add a tryCatch to use write_feather first, and if it fails we can fallback to base::save. The cached objects are stored in a data.frame anyway so perhaps we can actually cache everything using feather.

I can try to implement this sometime soon, probably over the weekend.

rsangole · 2018-03-15T18:15:16Z

Always want better performance 👍

If cache always stores data.frame, then feather is an excellent alternative.

Here is my benchmark comparison between base::save and feather::write_feather on my Intel Xeon 2.6Ghz, 128GB, with a Toshiba 2TB 7200 RPM SATA3 64MB Hard Drive which shows that feather is 18x faster than base for 10mil rows [10x faster for 1mil rows]. You'll see similar performance on the read functionalities as well.

nrows <- 1e7
fake_data <- dplyr::tibble(
    dates = as.POSIXct(lubridate::today()) + 1:nrows,
    random_numers = runif(nrows),
    booleans = sample(c(T, F), size = nrows, replace = T),
    strings = sample(letters, size = nrows, replace = T)
)
na_table <- dplyr::tibble(
    na_rows = sample(1:nrow(fake_data), 1e3, replace = F),
    na_cols = sample(1:ncol(fake_data), 1e3, replace = T)
)
for (i in 1:nrow(na_table)) {
    fake_data[na_table$na_rows[i],
              na_table$na_cols[i]] <- NA
}
head(fake_data)
#> # A tibble: 6 x 4
#>   dates               random_numers booleans strings
#>   <dttm>                      <dbl> <lgl>    <chr>  
#> 1 2018-03-14 20:00:01       0.00834 TRUE     k      
#> 2 2018-03-14 20:00:02       0.00219 FALSE    l      
#> 3 2018-03-14 20:00:03       0.652   FALSE    q      
#> 4 2018-03-14 20:00:04       0.0718  FALSE    w      
#> 5 2018-03-14 20:00:05       0.775   FALSE    z      
#> 6 2018-03-14 20:00:06       0.212   FALSE    t


result_of_timing_test <- microbenchmark::microbenchmark(
    base=save(fake_data, file = 'fake_data.RData'),
    feather=feather::write_feather(fake_data, 'fake_data.feather'),
    times = 10
)

print(result_of_timing_test,signif = 2)
#> Unit: seconds
#>     expr  min   lq mean median   uq  max neval
#>     base 24.0 24.0 24.0   24.0 24.0 26.0    10
#>  feather  1.1  1.1  1.3    1.2  1.4  1.6    10

microbenchmark::autoplot.microbenchmark(result_of_timing_test)
#> Loading required namespace: ggplot2

KentonWhite · 2018-03-15T18:30:32Z

feather is a good idea for caching. I'm hearing make it a suggestion is the way forward and not a dependency? What do we do about migrating projects? Keep it .rdata unless a user runs migrate.project()? Update the cache to feather silently? Have a mix of .rdata and feather in the cache? Ask the user if they want to upgrade their project?

Hugovdberg · 2018-03-17T19:33:11Z

I was just looking into using feather for the cache, but my proposed tryCatch solution would create a problem with loading from the cache. The cache should return items exactly as they were written to disk, but read_feather always returns a tibble.
Also, I was mistaken that all data is cached inside a data.frame, and even if it was it wouldn't help because the feather package only allows atomic column types.

I did some benchmarks comparing as.data.table(read_feather()) to load(), and feather is a lot faster even with the conversion, but I'm not sure if all uses of data.table are compatible.

My suggestion therefore would be to do the following (pseudo coded):

# Write to cache
if (identical(class(x), 'data.frame') || is.tibble(x)) {
    write_feather(x, file)
} else {
    save(x, file)
}

# Read from cache
if (file_extension == '.feather') {
    assign(varname, read_feather(file))
} else {
    load(file)
}

The major advantage of this is that there is no backward incompatibility. Files should not be manually saved to or read from the cache directory, so we can accept a hybrid state, even just writing new files to feather while keeping old .RData files until the variable is cleared from the cache at one point.

Regarding just implementing or making it optional: I don't feel bad about adding a dependency on feather, but if you guys do we should make it optional.

KentonWhite · 2018-03-18T18:33:09Z

This looks good. If feather is an optional dependency, we will need a way to select feather or rData in the config, with rData the default.

rsangole · 2018-04-23T16:24:30Z

@Hugovdberg just following up on this. Are you taking care of this feature, or would you like someone else to pick it up? cheers!

Hugovdberg · 2018-04-24T13:32:50Z

As mentioned in #191 we should investigate performance of feather and fst in relation to speed and compatibility with data types.

rsangole · 2018-04-24T15:20:26Z

Agreed, thanks for bringing fst to this discussion.

A quick look at these packages leads me to state...

fst is highly optimized for SSD disks. We should investigate if it has similar performance on non-SSD drives.
Both fst and feather are intended for dataframes only
feather allows the user to have interoperability with Python codes. Not sure if fst does as well

Why can't we support both as an option to cache()? We can keep the default as whichever one is generally the faster option for most users, but users with specific needs can cache to the type they like.

KentonWhite · 2018-04-27T16:17:18Z

We should choose one. ProjectTemplate is meant to be opinionated. While fst is faster in some circumstances, it does still return a dataframe. We've made the decision to move towards tibbles, which is what feather returns. My opinion is we support feather. Thoughts?

rsangole · 2018-04-28T01:04:18Z

If we had to pick one, I would also pick feather. Apart from the tibble returned, I also enjoy the fact that it's cross-platform compatible, which enables data scientists to mix Python and R code efficiently. Since it's being developed by Wickham and McKinney, it'll enjoy long-term support too.

Hugovdberg · 2018-04-28T09:14:56Z

I agree feather is probably the nicest, although I disagree with your argument that interoperability with Python is a pro. The cache isn't meant to be read by other programs.
I'm trying to get this to work but there are a few more hickups. The is_tibble function also returns true on sf objects (for spatial dataframes). But the write_feather will issue a warning and output an incomplete file. How do you guys feel about a construction like this (again pseudocoded):

old.warn <- option(warn = 2) # All warnings are errors
try {
    write_feather(variable, cache.file.feather)
    stopifnot(identical(variable, read_feather(cache.file)))
} except {
    if (file.exists(cache.file.feather) {
        delete.file(cache.file.feather)
    }
    save(variable, cache.file.rdata
} finally {
    option(warn = old.warn)
}

This means that every time we don't get exactly the same result back from feather as we tried to write it will fall back to the standard save functionality.

KentonWhite · 2018-04-30T13:29:39Z

Has this issue been raised with the feather maintainers?

I like checking that the result is the same. A bit concerned that there will be a mixture of .RData and feather files in the cache. But I'm OK with this construction.

rsangole · 2018-04-30T14:37:03Z

although I disagree with your argument that interoperability with Python is a pro. The cache isn't meant to be read by other programs....

@Hugovdberg this is actually from a usecase I face everyday within the same project. When working with large dataframes (10m+, 100m+ rows), R's interactive visualization methods (plotly, ggplotly etc) are painfully slow. This is an example of a workflow like:

Save to feather in cache/
Use Python code saved in src/python/ to interact with data using glueviz

Another colleague does something similar, where Python's dictionaries and capability to use hashtables results in a combined R+Python approach. Thus, feather+cache has come very handy.

Re: the pseudocode, that seems fine and it'll work well for smaller datasets. The stopifnot(identical(variable, read_feather(cache.file))) might (approximately) double execution time. Don't have a solution for you, but eventually we can figure something out.

gisler · 2020-07-11T19:07:37Z

Hi,

I would like to bring a rather new package to this discussion: qs.

According to its Using qs vignette, it seems to be fast and able to serialize about all R objects.

gisler · 2020-11-22T07:55:22Z

Hi all,

I would like to implement the qs (see my comment above) as a (possible) replacement for the RData format as cache files.

My questions now are:

Is this okay with you?
Given it is okay with you, how should we design compatibility to old projects? One possibility would be to add a cache file format option or so to the configuration file allowing one to select the format. Another would be to "upgrade" the RData files on project migration; saving them as qs files for future use, but leave the RData files untouched.

What do you think?

KentonWhite · 2020-11-22T15:14:24Z

Hi,

This sounds like a really great idea! What I would like to do for compatibility is a staged approach:

Start by making this an option in the config with the default set to the old format.
Then after it is stable, we set the default for new projects to this format.
Then offer a migration path for upgrading with the option to not upgrade if you don't want to.
Then if that is stable we remove the old way and use the new way in all cases.

I like these staged rollouts because it makes it easier to find and fix errors. In Stage 1 we are getting bugs from people who know what they are doing. This helps us more easily debug problems with the qs format. Stage 2 we start to get the newby problems since someone downloads ProjectTemplate, sets up a new project and then runs into issues with qs. Stage 3 lets us discover migration issues before rolling into stage 4.

Does this plan work for you?

gisler · 2020-11-22T19:04:05Z

Sure, this plan works very well for me. And I'm really glad you like the idea.

I'll add cache_file_format: RData to the configs. The other option will be "qs".

Once I'm done, I'll open a pull request.

rsangole added the Feature label Apr 21, 2018

Hugovdberg mentioned this issue Apr 24, 2018

IDEA: Use fst package for loading Cache will boost startup performance #191

Closed

Hugovdberg mentioned this issue Apr 28, 2018

Added feather reader #239

Merged

4 tasks

gisler mentioned this issue Nov 23, 2020

qs file format as cache file format #304

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request - Feather format for cache files #225

Feature Request - Feather format for cache files #225

rsangole commented Mar 14, 2018

Hugovdberg commented Mar 15, 2018

rsangole commented Mar 15, 2018

Hugovdberg commented Mar 15, 2018

rsangole commented Mar 15, 2018

KentonWhite commented Mar 15, 2018

Hugovdberg commented Mar 17, 2018

KentonWhite commented Mar 18, 2018

rsangole commented Apr 23, 2018

Hugovdberg commented Apr 24, 2018

rsangole commented Apr 24, 2018

KentonWhite commented Apr 27, 2018

rsangole commented Apr 28, 2018

Hugovdberg commented Apr 28, 2018

KentonWhite commented Apr 30, 2018

rsangole commented Apr 30, 2018

gisler commented Jul 11, 2020

gisler commented Nov 22, 2020

KentonWhite commented Nov 22, 2020

gisler commented Nov 22, 2020

Feature Request - Feather format for cache files #225

Feature Request - Feather format for cache files #225

Comments

rsangole commented Mar 14, 2018

Report an Issue / Request a Feature

Issue Severity Classification -

Expected Behavior

Possible Solution

Hugovdberg commented Mar 15, 2018

rsangole commented Mar 15, 2018

Hugovdberg commented Mar 15, 2018

rsangole commented Mar 15, 2018

KentonWhite commented Mar 15, 2018

Hugovdberg commented Mar 17, 2018

KentonWhite commented Mar 18, 2018

rsangole commented Apr 23, 2018

Hugovdberg commented Apr 24, 2018

rsangole commented Apr 24, 2018

KentonWhite commented Apr 27, 2018

rsangole commented Apr 28, 2018

Hugovdberg commented Apr 28, 2018

KentonWhite commented Apr 30, 2018

rsangole commented Apr 30, 2018

gisler commented Jul 11, 2020

gisler commented Nov 22, 2020

KentonWhite commented Nov 22, 2020

gisler commented Nov 22, 2020