Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Add iterator classes #1274

Merged
merged 21 commits into from
May 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions apis/r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,11 @@ export(SOMAOpen)
export(SOMASparseNDArray)
export(SOMASparseNDArrayCreate)
export(SOMASparseNDArrayOpen)
export(SOMASparseNDArrayRead)
export(SOMATileDBContext)
export(ScalarMap)
export(SparseReadIter)
export(TableReadIter)
export(TileDBArray)
export(TileDBCreateOptions)
export(TileDBGroup)
Expand Down
62 changes: 62 additions & 0 deletions apis/r/R/ReadIter.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#' SOMA Read Iterator Base class
#'
#' Class that allows for read iteration of SOMA reads.

ReadIter <- R6::R6Class(
classname = "ReadIter",

public = list(

#' @description Create (lifecycle: experimental)
#' @param sr soma read pointer
initialize = function(sr) {
private$soma_reader_pointer <- sr
},

#' @description Check if iterated read is complete or not. (lifecycle: experimental)
#' @return logical
read_complete = function() {
if (is.null(private$soma_reader_pointer)) {
TRUE
} else {
sr_complete(private$soma_reader_pointer)
}
},

#' @description Read the next chunk of an iterated read. (lifecycle: experimental).
#' If read is complete, retunrs `NULL` and raises warning.
#' @return \code{NULL} or one of arrow::\link[arrow]{Table}, \link{matrixZeroBasedView}
read_next = function() {
if (is.null(private$soma_reader_pointer)) {
NULL
} else {
if (self$read_complete()) {
warning("Iteration complete, returning NULL")
NULL
} else {
rl <- sr_next(private$soma_reader_pointer)
return(private$soma_reader_transform(rl))
}
}
},

#' @description Concatenate remainder of iterator
# to be refined in derived classes
concat = function() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the very belated review -- @aaronwolen asked me to help review, and I'm happy to -- I believe my review is speaking for the both of us

There are definitely multiple ways to go here. However, given that on the Python side we have a single read, options of .tables, etc., and concat -- e.g. sdf.read().tables().concat() -- we should do the same here. It won't be ideal for everyone but it will work, it will parallel the Python implementation, and we can add some keystroke-saving syntactic-sugar functions on top of these later if we want.

Let's get this concat going, and get rid of the iterated argument to read. Then read will always return an iterator (low complexity), and tables() et al. will transform an iterator of one type to an iterator of another, and concat will have the single job of doing concatenation.

Note that apis/r/R/utils-readerTransformers.R on this PR already has reader-transformers, so (I think and hope) it should be easy to connect them as methods on the iterator objects.

Likewise, concat() exists in non-stub form elsewhere on this PR and should be able to be connected as a method on the iterator objects.

If read_next remains, it needs to be remain as a helper method, not as public API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with everything and I will try to get this done tonight.

If read_next remains, it needs to be remain as a helper method, not as public API.

@johnkerl I'm not sure that the suggestion here is? Unfortunately in R there's no generic next() function, so I'm not sure how we could give users the ability to iterate themselves unless we:

  1. make read_next a public method
  2. OR we create function next() that calls internally read_next

The second one is a little less intuitive imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read will always return an iterator

I'm not sure I understand the benefit to this. While R does have package that implements iterators, it isn't widely used (or at all in the single-cell world). Because iterators don't exist in R, all this does is add complexity to R users. While I understand the desire to make the R API to act like the Python API, we do need to keep in mind that R is not Python and there are things that Python handles natively that just don't work as well in R

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key is here:

we can add some keystroke-saving syntactic-sugar functions on top of these later if we want

and imo it's not a "if we want" but more a "we need to". I think we can keep the low-level functionality in the R6 level as proposed and then we provide wrapper functions like as_arrow_table(soma_obj) see here, as(soma_obj, "dgCMatrix"), as.data.frame(soma_obj), etc

Copy link
Member Author

@pablo-gar pablo-gar May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnkerl I have addressed your comments with the exception of:

If read_next remains, it needs to be remain as a helper method, not as public API.

I also updated the usage section of top-level comment. The branch still needs to merge main but wanted to get your early thoughts on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablo-gar looking this morning -- thank you! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablo-gar sorry to confuse.

This looks good to me -- thank you!! :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I will update the docs and merge main into the PR branch

.NotYetImplemented()
}

),

private = list(

# Internal 'external pointer' object used for iterated reads
soma_reader_pointer = NULL,

# to be refined in derived classes
soma_reader_transform = function(x) {
.NotYetImplemented()
}

)
)
47 changes: 0 additions & 47 deletions apis/r/R/SOMAArrayBase.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,36 +17,6 @@ SOMAArrayBase <- R6::R6Class(
}
),

public = list(

#' @description Check if iterated read is complete or not. (lifecycle: experimental)
read_complete = function() {
private$check_open_for_read()

if (is.null(private$soma_reader_pointer)) {
TRUE
} else {
sr_complete(private$soma_reader_pointer)
}
},

#' @description Read the next chunk of an iterated read. (lifecycle: experimental)
read_next = function() {
private$check_open_for_read()

if (is.null(private$soma_reader_pointer)) {
NULL
} else {
if (sr_complete(private$soma_reader_pointer)) {
invisible(NULL)
} else {
rl <- sr_next(private$soma_reader_pointer)
private$soma_reader_transform(rl)
}
}
}

),

private = list(

Expand All @@ -64,23 +34,6 @@ SOMAArrayBase <- R6::R6Class(
meta[[SOMA_OBJECT_TYPE_METADATA_KEY]] <- self$class()
meta[[SOMA_ENCODING_VERSION_METADATA_KEY]] <- SOMA_ENCODING_VERSION
self$set_metadata(meta)
},

# Internal 'external pointer' object used for iterated reads
soma_reader_pointer = NULL,

# Instantiate soma_reader_pointer with a soma_array_reader object
soma_reader_setup = function() {
private$soma_reader_pointer <- sr_setup(
self$uri,
config=as.character(tiledb::config(self$tiledbsoma_ctx$context()))
)
},

## to be refined in derived classes
soma_reader_transform = function(x) {
x
}

)
)
47 changes: 14 additions & 33 deletions apis/r/R/SOMADataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,8 @@ SOMADataFrame <- R6::R6Class(
#' @template param-result-order
#' @param iterated Option boolean indicated whether data is read in call (when
#' `FALSE`, the default value) or in several iterated steps.
#' @param log_level Optional logging level with default value of `"auto"`.
#' @return An [`arrow::Table`].
#' @param log_level Optional logging level with default value of `"warn"`.
#' @return arrow::\link[arrow]{Table} or \link{TableReadIter}
johnkerl marked this conversation as resolved.
Show resolved Hide resolved
read = function(coords = NULL,
column_names = NULL,
value_filter = NULL,
Expand All @@ -191,7 +191,7 @@ SOMADataFrame <- R6::R6Class(
result_order <- match_query_layout(result_order)
uri <- self$uri
arr <- self$object # need array (schema) to properly parse query condition

## if unnamed set names
if (!is.null(coords)) {
if (!is.list(coords))
Expand All @@ -215,30 +215,16 @@ SOMADataFrame <- R6::R6Class(
value_filter <- parsed@ptr
}

if (isFALSE(iterated)) {
cfg <- as.character(tiledb::config(self$tiledbsoma_ctx$context()))
rl <- soma_array_reader(uri = uri,
colnames = column_names, # NULL dealt with by soma_array_reader()
qc = value_filter, # idem
dim_points = coords, # idem
loglevel = log_level, # idem
config = cfg)
private$soma_reader_transform(rl)
} else {
## should we error if this isn't null?
if (!is.null(private$soma_reader_pointer)) {
warning("Reader pointer not null, skipping")
rl <- NULL
} else {
private$soma_reader_setup()
rl <- list()
while (!self$read_complete()) {
## soma_reader_transform() applied inside read_next()
rl <- c(rl, self$read_next())
}
}
invisible(rl)
}
cfg <- as.character(tiledb::config(self$tiledbsoma_ctx$context()))
sr <- sr_setup(uri = self$uri,
config = cfg,
colnames = column_names,
qc = value_filter,
dim_points = coords,
loglevel = log_level)

TableReadIter$new(sr)

}

),
Expand Down Expand Up @@ -275,12 +261,7 @@ SOMADataFrame <- R6::R6Class(
}

schema
},

## refined from base class
soma_reader_transform = function(x) {
arrow::as_arrow_table(arrow::RecordBatch$import_from_c(x[[1]], x[[2]]))
}

)
)
126 changes: 42 additions & 84 deletions apis/r/R/SOMADenseNDArray.R
Original file line number Diff line number Diff line change
Expand Up @@ -114,80 +114,44 @@ SOMADenseNDArray <- R6::R6Class(
#' length equal to the number of values to read. If `NULL`, all values are
#' read. List elements can be named when specifying a subset of dimensions.
#' @template param-result-order
#' @param iterated Option boolean indicated whether data is read in call (when
#' `FALSE`, the default value) or in several iterated steps.
#' @param log_level Optional logging level with default value of `"auto"`.
#' @param log_level Optional logging level with default value of `"warn"`.
#' @return An [`arrow::Table`].
read_arrow_table = function(
coords = NULL,
result_order = "auto",
iterated = FALSE,
log_level = "auto"
log_level = "warn"
) {
private$check_open_for_read()

uri <- self$uri

result_order <- map_query_layout(match_query_layout(result_order))

if (!is.null(coords)) {
## ensure coords is a named list, use to select dim points
stopifnot("'coords' must be a list" = is.list(coords),
"'coords' must be a list of vectors or integer64" =
all(vapply_lgl(coords, is_vector_or_int64)),
"'coords' if unnamed must have length of dim names, else if named names must match dim names" =
(is.null(names(coords)) && length(coords) == length(self$dimnames())) ||
(!is.null(names(coords)) && all(names(coords) %in% self$dimnames()))
)

## if unnamed (and test for length has passed in previous statement) set names
if (is.null(names(coords))) names(coords) <- self$dimnames()

## convert integer to integer64 to match dimension type
coords <- lapply(coords, function(x) if (inherits(x, "integer")) bit64::as.integer64(x) else x)
coords <- private$convert_coords(coords)
johnkerl marked this conversation as resolved.
Show resolved Hide resolved
}

private$dense_matrix <- FALSE

if (isFALSE(iterated)) {
cfg <- as.character(tiledb::config(self$tiledbsoma_ctx$context()))
rl <- soma_array_reader(uri = uri,
dim_points = coords, # NULL dealt with by soma_array_reader()
result_order = result_order,
loglevel = log_level, # idem
config = cfg)
private$soma_reader_transform(rl)
} else {
## should we error if this isn't null?
if (!is.null(self$soma_reader_pointer)) {
warning("pointer not null, skipping")
rl <- NULL
} else {
private$soma_reader_setup()
rl <- list()
while (!self$read_complete()) {
## soma_reader_transform() applied inside read_next()
rl <- c(rl, self$read_next())
}
}
invisible(rl)
}
cfg <- as.character(tiledb::config(self$tiledbsoma_ctx$context()))
rl <- soma_array_reader(uri = uri,
dim_points = coords, # NULL dealt with by soma_array_reader()
result_order = result_order,
loglevel = log_level, # idem
config = cfg)

soma_array_to_arrow_table(rl)
},

#' @description Read as a dense matrix (lifecycle: experimental)
#' @param coords Optional `list` of integer vectors, one for each dimension, with a
#' length equal to the number of values to read. If `NULL`, all values are
#' read. List elements can be named when specifying a subset of dimensions.
#' @template param-result-order
#' @param iterated Option boolean indicated whether data is read in call (when
#' `FALSE`, the default value) or in several iterated steps.
#' @param log_level Optional logging level with default value of `"auto"`.
#' @param log_level Optional logging level with default value of `"warn"`.
#' @return A `matrix` object
read_dense_matrix = function(
coords = NULL,
result_order = "ROW_MAJOR",
iterated = FALSE,
log_level = "auto"
log_level = "warn"
) {
private$check_open_for_read()

Expand All @@ -198,23 +162,12 @@ SOMADenseNDArray <- R6::R6Class(
all.equal(c("soma_dim_0", "soma_dim_1"), names(dims)),
"Array must contain column 'soma_data'" = all.equal("soma_data", names(attr)))

if (isFALSE(iterated)) {
tbl <- self$read_arrow_table(coords = coords, result_order = result_order, log_level = log_level)
m <- matrix(as.numeric(tbl$GetColumnByName("soma_data")),
nrow = length(unique(as.numeric(tbl$GetColumnByName("soma_dim_0")))),
ncol = length(unique(as.numeric(tbl$GetColumnByName("soma_dim_1")))),
byrow = result_order == "ROW_MAJOR")
} else {
## should we error if this isn't null?
if (!is.null(self$soma_reader_pointer)) {
warning("pointer not null, skipping")
} else {
private$soma_reader_setup()
private$dense_matrix <- TRUE
private$result_order <- result_order
}
invisible(NULL)
}
tbl <- self$read_arrow_table(coords = coords, result_order = result_order, log_level = log_level)
m <- matrix(as.numeric(tbl$GetColumnByName("soma_data")),
nrow = length(unique(as.numeric(tbl$GetColumnByName("soma_dim_0")))),
ncol = length(unique(as.numeric(tbl$GetColumnByName("soma_dim_1")))),
byrow = result_order == "ROW_MAJOR")

},

#' @description Write matrix data to the array. (lifecycle: experimental)
Expand Down Expand Up @@ -251,22 +204,27 @@ SOMADenseNDArray <- R6::R6Class(
),

private = list(

## refined from base class
soma_reader_transform = function(x) {
tbl <- arrow::as_arrow_table(arrow::RecordBatch$import_from_c(x[[1]], x[[2]]))
if (isTRUE(private$dense_matrix)) {
m <- matrix(as.numeric(tbl$GetColumnByName("soma_data")),
nrow = length(unique(as.numeric(tbl$GetColumnByName("soma_dim_0")))),
ncol = length(unique(as.numeric(tbl$GetColumnByName("soma_dim_1")))),
byrow = private$result_order == "ROW_MAJOR")
} else {
tbl
}
},

## internal state variable for dense matrix vs arrow table return
dense_matrix = TRUE,
result_order = "ROW_MAJOR"

#' @description Converts a list of vectors corresponding to coords to a
#' format acceptable for sr_setup and soma_array_reader
convert_coords = function(coords) {

## ensure coords is a named list, use to select dim points
stopifnot("'coords' must be a list" = is.list(coords),
"'coords' must be a list of vectors or integer64" =
all(vapply_lgl(coords, is_vector_or_int64)),
"'coords' if unnamed must have length of dim names, else if named names must match dim names" =
(is.null(names(coords)) && length(coords) == length(self$dimnames())) ||
(!is.null(names(coords)) && all(names(coords) %in% self$dimnames()))
)

## if unnamed (and test for length has passed in previous statement) set names
if (is.null(names(coords))) names(coords) <- self$dimnames()

## convert integer to integer64 to match dimension type
coords <- lapply(coords, function(x) if (inherits(x, "integer")) bit64::as.integer64(x) else x)

coords
}
)
)
Loading