[Feature Request] Save Pins to Databricks #839

viv-analytics · 2024-08-31T17:50:13Z

Dear Development Team,

Given the increasing collaboration between Posit and Databricks, I believe that the capability to store Pins to Databricks, in comparison to other platforms such as S3 and Azure, could prove to be an appealing feature for enterprise clients.

Sincerely,

juliasilge · 2024-09-03T16:46:40Z

Thanks for this suggestion! 🙌

Can you share some specifics about how and what you would like to store in Databricks, perhaps highlighting what is different from the workflows supported by sparklyr? Like this:
https://spark.posit.co/deployment/databricks-connect.html
Using sparklyr has some real benefits over something like pins, such as being able to execute SQL queries.

wklimowicz · 2024-09-04T20:31:47Z

I have a use case for this, although interested in suggestions if there's a better solution.

I work a lot with survey data that comes in an SPSS .sav file format. sav's great because haven reads in the labels as ordered factors, so it retains the information behind the order of responses like c("Strongly agree", "Agree", ... or c("Yes", "Maybe", "No"). I want to keep it in a form where the factors are retained, which rules out standard Databricks tables.

There is Unity Catalog Volumes, but I can't figure out how to store factors on Databricks while retaining read access from my local machine. You can save and read .sav files from a Volume, but only when using a notebook in browser. When running sparklyr from my local machine it only seems to work for spark_read_csv. There is no function for spark_read_sav, and if I convert to an intermediate format like parquet it only seems to work from a notebook (i.e. running spark_read_parquet from local machine errors out -- the documentation is a bit sparse so not sure if this is a bug or unsupported behaviour).

Pins in Databricks would solve this, because I could write the data directly to board_databricks as Rds or qs. I would also want to use it as the default location for pins anyway: my organisation has board_ms365 disabled, and the permissions on the shared network drive are a bit opaque. This solution would also be in the Databricks spirit of having all your data and governance on a single platform.

Thanks for all your work on this package!

jmbarbone · 2024-09-18T03:43:14Z

In general, I think accessing/storing information in Databrick's Volumes provides some great benefits.

storage location is closer to processing/working location (we're not jumping out of Databricks for files)
Volumes are backed by Unity Catalog so orgs can control access to information we'd want to write to {pins}
Volumes can hold unstructured data
Volumes should look like file systems; volumes can hold multiple files and folders

Through the databricks-sdk you can manage (read/write) Volumes: https://docs.databricks.com/en/dev-tools/sdk-python.html#files-in-volumes

A pseudo example for reading in a directory of .yaml files

from databricks.sdk import WorkspaceClient
name = f"{catalog}.{database}.{volume}"
wc = WorkspaceClient()
volume = wc.volumes.read(name)
spark.read.text(
    paths=volume.storage_location, # directory
    wholetext=True, # single row
    pathGlobFilter="*.yaml" 
)

A more full example in R via {reticulate}

library(reticulate)

# https://github.com/databrickslabs/databricks-sdk-r
# package for using the REST API in R
library(databricks) 

client <- DatabricksClient()

# this can also be accomplished through reticulate
volume <- 
  client |> 
  read_volume("{catalog}.{database}.{volume}")

location <- volume$storage_location

# grabbing a cluster that I can access more data
clusters <- 
  client |> 
  list_clusters() |> 
  subset(startsWith(creator_user_name, "jbarbone")) 

cluster <- clusters$cluster_id[1]  

# requires the {databricks-sdk} and {databricks-connect} packages
db <- import("databricks.sdk")
connect <- import("databricks.connect")

w <- db$WorkspaceClient()
volume <- w$volumes$read("{catalog}.{database}.{volume}")
location <- volume$storage_location
pyspark <- import("pyspark")
spark <- 
    connect $
    DatabricksSession $
    builder $
    profile("DEFAULT") $
    clusterId(cluster) $ 
    getOrCreate()

content <- spark$read$text(
  paths = location,
  wholetext = TRUE,
  pathGlobFilter = "*.yaml"
)

Through the REST API you can find the storage location, but (https://docs.databricks.com/api/workspace/volumes/list) but you may need to access the spark context to read the data. The R example seems to work fine for me, and locally I just have a ~/.databrickscfg file with some profiles ("DEFAULT" used above). The cluster/user permissions would need to be set by the user, so what {pins} needs may be pretty minimum?

* Ports host and token functions * Starts board_databricks * Starts pin list * Removes pipes * Centralices content retreival adds pin_exists * Starts pin_meta function * Simplifies arguments, renames token and host functions * First pass at pin_store * Fixex hashed subfolder * Adds pin_versions * Improvements to cache path * Adds download file helper * Adds download step to meta, fixes cache discovery * Adds pin_fetch * Adds pin_delete, and all supporting functions * Assignes proper file rights to local cache * Passes all tests * Adds board_deparse * Adds required_pkgs * Starts testing * Avoids failing when checking contents of a folder, needed for prefix * Passes all tests * Fixes a pkg check finding * Starts documentation * Completes documentation * Adds NEWS item * Small fix to documentation, adds some instructions to tests * Properly handles lack of host or token * Fixes pkg_down failure, address oldrel-4 issue by reverting to older mode of `purrr`, and improves some tests * More consistent filename * More consistent filename * Edits to docs * Redocument * Update R/board_databricks.R Co-authored-by: Julia Silge <[email protected]> * Update R/board_databricks.R Co-authored-by: Julia Silge <[email protected]> * Removes reference to bucket, and re-documents * Little more doc refining * Try running tests in CI * Update snapshot --------- Co-authored-by: Julia Silge <[email protected]>

juliasilge · 2024-10-03T01:17:35Z

Many thanks to @edgararuiz for his work implementing board_databricks() in #841! If any of you are interested in taking this new functionality for a spin by installing the dev version of pins, we would be very interested in hearing how it goes for you.

viv-analytics · 2024-10-08T05:41:49Z

Great work, thank you @juliasilge and @edgararuiz for bringing this feature to the package.

I've tested this using version 1.4.0 for various file types and sizes, all worked smoothly.

jmbarbone · 2024-10-08T14:55:19Z

I've ran into a small issue within my workflow. We track multiple Databricks profiles with separate hosts and tokens configured in a .databrickscfg file. The envvars are then only temporarily set, so they aren't persistent in the session. This does lead to a less than intuitive error when trying to write:

some_custom_board_function <- function(profile, ...) {
  # simplified
  cfg <- ini::read.ini("~/.databrickscfg")
  profile <- mark::match_param(profile, names(cfg))
  config <- cfg[[profile]]
  withr::local_envvar(c(
    DATABRICKS_HOST = config$host, 
    DATABRICKS_TOKEN = config$token
  ))
  board_databricks(...)
  board
}

board <- some_custom_board_function()
pin_write(board, mtcars)
#> Using `name = 'mtcars'`
#> Guessing `type = 'rds'`
#> Error in `purrr::keep()`:
#> ℹ In index: 1.
#> ℹ With name: message.
#> Caused by error in `.x$is_directory`:
#> ! $ operator is invalid for atomic vectors

We have custom board creating and pin writing wrappers; everything still works with a few extra steps.

It would be nice to look for these settings in .databrickscfg neither explicitly passed or when those envvars are not set. This is currently how Databricks handles authentication across tools: Default methods for client unified authentication.

Still very excited for this update and looking forward to using it more.

juliasilge · 2024-10-08T16:55:05Z

Thanks for sharing that @jmbarbone! I have opened #848 to track additional auth needs for Databricks; please add any additional thoughts there. 🙌

juliasilge added feature a feature request or enhancement boards 🧑‍🏫 labels Sep 19, 2024

edgararuiz mentioned this issue Sep 30, 2024

Adds Databricks integration (#839) #841

Merged

juliasilge closed this as completed in #841 Oct 3, 2024

juliasilge mentioned this issue Oct 8, 2024

Support additional auth options for pins on Databricks #848

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Save Pins to Databricks #839

[Feature Request] Save Pins to Databricks #839

viv-analytics commented Aug 31, 2024

juliasilge commented Sep 3, 2024

wklimowicz commented Sep 4, 2024

jmbarbone commented Sep 18, 2024

juliasilge commented Oct 3, 2024

viv-analytics commented Oct 8, 2024 •

edited

Loading

jmbarbone commented Oct 8, 2024

juliasilge commented Oct 8, 2024

[Feature Request] Save Pins to Databricks #839

[Feature Request] Save Pins to Databricks #839

Comments

viv-analytics commented Aug 31, 2024

juliasilge commented Sep 3, 2024

wklimowicz commented Sep 4, 2024

jmbarbone commented Sep 18, 2024

juliasilge commented Oct 3, 2024

viv-analytics commented Oct 8, 2024 • edited Loading

jmbarbone commented Oct 8, 2024

juliasilge commented Oct 8, 2024

viv-analytics commented Oct 8, 2024 •

edited

Loading