[AIR] Success criteria for numpy narrow waist #28346

jiaodong · 2022-09-07T18:04:01Z

E2E Successful User Journey

For sample computer vision workload that we had examples / benchmarks, training and batch prediction that is currently Pandas based:

Pure numpy preprocessor UDF to do image preprocessing
Pure numpy predictor subclassing for object detection
Verify e2e throughput and UX with existing ML platform user (p13n and vision)
- Minimal time spent in training loop with higher GPU saturation
Updated docs and benchmarks to compare UX + performance
Observability and tracking of data conversions / time spent

Milestones and current progress

[ETA: EOW 9/23] BatchMapper and Preprocessor change with numpy format
[ETA: EOW 9/30] BatchPredictor uses numpy first path
[ETA: EOW 10/7] Benchmarking and do perf comparison vs. pandas
[ETA: EOW 10/21] Cherrypicks - Dogfooding, documentation and updates of existing pandas UDF workload

Relevant docs:

https://docs.google.com/document/d/1Er4df6OK78Mt_wiM80WNZr1SKXUEs_D-WNpKyiFwmd8/edit

Data:

Use batch_format="numpy" for all map and map_batches
Able to transform, concat, produce blocks with numpy format
Ensure numpy format works with all internal data structure such as TensorArray
[Datasets] Push data block produce / concat / convert to upstream and minimize time spent in user code (training & prediction) #28526
Minimize and track conversion cost among numpy transformations
[Datasets] Add a new "zero_copy" batch format #32662

Processor:

Add _transform_numpy to all DL preprocessors, ex: BatchMapper #28522
Add numpy combination for _transform_batch to all DL preprocessors, ex: BatchMapper #28524
Remove _transform_arrow from Preprocessors #28523
[P0.5] For preprocessors to work with single numpy arrays, they should no longer require column names as a required arg
[P0.5] For multi-column data, not providing column names will apply the preprocessor on all columns

Training:

iter_tf_batches / iter_torch_batches can get numpy type of right batch size without conversion in user code / training loop

Prediction:

Add predict_numpy to DLPredictor types (tf, torch) #28525
BatchPredictor works with batch_format="numpy" as input & output #28627
Finalize on output types supported for predictor such as image segmentation
- Ask user to vstack their list of tensors with clear and actionable error message
Remove _transform_arrow from predictors #29003

@clarkzinzow @amogkam I put down a tentative list of success criteria, please take another look and edit if you need !

The text was updated successfully, but these errors were encountered:

amogkam · 2022-09-07T18:51:34Z

Nice overall lgtm!

Just adding a couple more things we discussed in our meeting:

Remove _transform_arrow from Preprocessors and Predictors. Arrow should not be user or developer facing.
For preprocessors to work with single numpy arrays, they should no longer require column names as a required arg. For multi-column data, not providing column names will apply the preprocessor on all columns

Also might be out of scope, but in general we should make sure that preprocessors are using Datasets correctly (and taking advantage of stage fusion when applicable).

Also is there anything for ragged tensors?

jiaodong · 2022-09-07T19:17:24Z

Sg, this is basically what we discussed last time :) I'm assuming ragged tensors will be supported for this given the PR is about to be merged sooner than the rest, also it's captured as part of TensorArray iiuc.

I think putting Preprocessor class as first-class UX is important but a bit orthogonal to this. We should track it tho.

@amogkam we didn't have enough time to finish discussion on the predictor Any output and our stance on python object fall back yet, we should have it updated here.

clarkzinzow · 2022-09-07T19:35:45Z

@jiaodong All of the data items should be the current state (each of which has been validated) so I don't know if including those items as part of the success criteria makes sense, but I might be missing something. I think all of the success criteria should be at the AIR level, around the preprocessors, training, and prediction?

amogkam · 2022-09-07T19:44:13Z

What are the risks with falling back to Python object for Predictors? Off the top of my head I can think of:

Users' custom predict functions fall back to Python object even if it's not necessary and the output type can be represented via dict[str, numpy]. Basically it might provide a footgun to the user
Falling back to Python object means the Predictor cannot be used as a Preprocessor and chained with other preprocessors.

The benefit is that we can support the "long tail" of models with unique output types.

clarkzinzow · 2022-09-07T20:02:13Z

+1 to having the fallback, i.e. a call_model output type of Union[TensorType, Dict[str, TensorType], List[Any]], where List[Any] is the fallback. This should work quite nicely for Datasets, which will convert List[Any] into a well-structured (e.g. tabular) dataset if possible.

@amogkam how common is risk (1)? I feel like that wouldn't be very common. As for (2), we could add List[Any] as a supported type to DataBatchType, but I'm not sure how that _transform_* delegation would work... could we require that the user reformats List[Any] to a preprocessor batch type but only when using a predictor preprocessor?

amogkam · 2022-09-07T20:19:55Z

Actually would np.ndarray with object dtype work as a fallback instead of List[Any]?

def call_model(...) -> Union[TensorType, Dict[str, TensorType], np.ndarray, Dict[str, np.ndarray]]

Then the outputs can still go through the _transform_numpy codepath of preprocessors. And we still keep the current contract that the output type is the same as the input type (which Serve currently depends on).

jiaodong · 2022-09-07T21:27:19Z

I think all of the success criteria should be at the AIR level, around the preprocessors, training, and prediction?

@clarkzinzow so I think there's perf and API aspect of it, for API most items should be AIR only, for example we had working PR for numpy narrow waist in batch predictor before 2.0 with batch_format="numpy" and added corresponding preprocessor / predictor changes, suggesting we had pieces we need for a numpy path (even though i didn't vet through all possible cases, we might discover more corners later on)

I'm putting a data section for perf to ensure the numpy pass didn't trigger more unintended perf costs via concat / copy. Do you have other items on your mind for 2.1 regarding numpy narrow waist ?

jiaodong · 2022-09-07T21:30:08Z

Falling back to Python object means the Predictor cannot be used as a Preprocessor and chained with other preprocessors.

@amogkam I think you mentioned chaining predictors / using predictor as processor in meeting earlier this week, wouldn't fallback to python object break what you planned or it's not an issue ?

Otherwise i vote for adding python object fallback too, it makes supporting arbitrary predictor output easier.

amogkam · 2022-09-07T21:35:22Z

@jiaodong it will break, but if we have np.ndarray with object dtype as the fallback instead of List[Any] I think that should be fine.

clarkzinzow · 2022-09-08T17:52:43Z

I'm putting a data section for perf to ensure the numpy pass didn't trigger more unintended perf costs via concat / copy.

@jiaodong The NumPy narrow waist will result in strictly less data copies than the Pandas or Arrow narrow waist (that's why it was initially added), so I don't think that we need that as a call-out, but we can have a single validation item for that if you'd like. Since there are no known Datasets code changes needed, how about we axe the Data items and have an AIR-level validation item for ensuring that our NumPy narrow waist isn't resulting in unnecessary copies anywhere?

There's also a P1 for adding a new "zero_copy" batch format that allows preprocessors/predictors to delegate to format-specific optimized implementations of their transformation functions, which would be a Datasets-level item, but I'm guessing that we're considering that out-of-scope for now?

jiaodong · 2022-09-08T18:08:11Z

Sounds good ! I trust your judgment on this given all the context you have on data side. So how about we put data items as placeholder for now, we cross off items in this list that's already status-quo; Amog and I start adding the numpy path in AIR / predictor and if we discovered any corner cases (which i assume we will certainly run into) we can sync with data team to see what's the best action needed.

clarkzinzow · 2022-09-08T18:18:04Z

Actually would np.ndarray with object dtype work as a fallback instead of List[Any]?

@amogkam Hmm interesting... I feel like that Union[TensorType, Dict[str, TensorType], np.ndarray, Dict[str, np.ndarray]] return type is weirder/messier than Union[TensorType, Dict[str, TensorType], List[Any]], since the former makes you ask "why do we only take ML tensors but can return ML tensors or NumPy ndarrays?", and I think we can easily document the latter List[Any] as an escape hatch for odd model outputs.

And we still keep the current contract that the output type is the same as the input type (which Serve currently depends on).

Would it be possible for Serve to not impose this restriction? It doesn't really make sense for model outputs which may be an arbitrary format, and I think we'll also want to support BatchMapper being able to change the batch format (we have Datasets users that rely on this behavior of .map_batches()).

richardliaw · 2022-09-08T18:56:14Z

cc @simon-mo on the serve comment

amogkam · 2022-09-08T19:06:52Z

former makes you ask "why do we only take ML tensors but can return ML tensors or NumPy ndarrays?"

Isn't this also the case with List[Any]?

I think we can easily document the latter List[Any] as an escape hatch for odd model outputs.

Can't we also document np.ndarray as an escape hatch?

My reasoning is that np.ndarray handles these 2 requirements while List[Any] does not.

Keeping the contract that the input type is the same as the output type. If we change this behavior, then this is a significant API change we should discuss.
It allows predictors to be used as preprocessors- which unlocks a nice feature that few users have brought up.

clarkzinzow · 2022-09-08T20:01:53Z

@amogkam I was thinking that we could make List[Any] our canonical AIR escape hatch since it already serves that purpose within Datasets, via the simple block representation. This has the added benefit that Datasets will try to convert it into a well-formatted dataset (e.g. a tabular dataset if its a list of dicts, as is the case for SSD300_VGG16 object detection outputs), only falling back to the simple block representation if that fails.

There's also the issue that an np.ndarray batch will be interpreted by Datasets as a tensor dataset and will try to use our tensor extension to represent it as a single-column table, which would fail for many common use cases (e.g. the object detection output) since the tensor extension can't hold arbitrary Python objects such as dictionaries. The only way around that would be using a structured ndarray, which currently isn't supported by the tensor extension (yet).

Your two points still stand, although I think we'll want to tweak (1) anyway since I think it's too strict for advanced BatchMapper use cases, and things are already getting awkward with model outputs; what's the big benefit of (1) anyway, other than Serve relying on it? I feel like we should be able to support going from e.g. a Pandas DataFrame batch to a NumPy ndarray batch if the preprocessor so-desires.

As for (2), won't users need to munge their model outputs to have them fit their downstream preprocessors anyway? E.g. if a user is going to take the object detections from SSD300_VGG16 and apply some preprocessor, they're going to need to turn those model outputs into a format that the preprocessor understands, which the np.ndarray escape hatch wouldn't give them automatically (the List[Any] escape hatch would, but that's kind of beside the point). Since the user will probably have to coerce their model outputs to be compatible with downstream preprocessors anyway, it seems like we'd want to provide a hook for them to do that in the PredictorPreprocessor, so whether we have the np.ndarray escape hatch or List[Any] escape hatch doesn't seem to matter as much for (2). Am I missing something there?

jiaodong · 2022-09-09T21:54:38Z

Synced with Clark and Amog offline, looks like for all options we had adding opaque object lead to more implications of AIR narrow waist or ability to chain predictor / preprocessor to others. Given the only case we know of is object detection, which ragged tensor support already greatly improved UX by eliminating the need of padding, we decided to punt it for now unless new clear use cases show up that requires a generic "simple" block format -- which requires a proper dedicated design.

amogkam · 2022-12-09T21:45:00Z

@jiaodong can this be closed out?

jiaodong self-assigned this Sep 7, 2022

jiaodong added the air label Sep 7, 2022

jiaodong changed the title ~~[AIR] Create issue list of numpy narrow waist in 2.1~~ [AIR] Success criteria for numpy narrow waist Sep 7, 2022

jiaodong assigned clarkzinzow and amogkam Sep 7, 2022

jiaodong mentioned this issue Sep 14, 2022

[AIR][Numpy] Add numpy narrow waist to Preprocessor and BatchMapper #28418

Merged

7 tasks

jiaodong added the air-numpy label Sep 15, 2022

clarkzinzow removed their assignment Sep 20, 2022

richardliaw added this to the Ray 2.1 milestone Oct 1, 2022

jiaodong mentioned this issue Oct 3, 2022

[AIR][Predictor] Enable numpy based predictor #28917

Merged

7 tasks

krfricke added triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 18, 2023

anyscalesam closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Success criteria for numpy narrow waist #28346

[AIR] Success criteria for numpy narrow waist #28346

jiaodong commented Sep 7, 2022 •

edited by amogkam

Loading

amogkam commented Sep 7, 2022 •

edited

Loading

jiaodong commented Sep 7, 2022 •

edited

Loading

clarkzinzow commented Sep 7, 2022 •

edited

Loading

amogkam commented Sep 7, 2022 •

edited

Loading

clarkzinzow commented Sep 7, 2022

amogkam commented Sep 7, 2022

jiaodong commented Sep 7, 2022

jiaodong commented Sep 7, 2022

amogkam commented Sep 7, 2022

clarkzinzow commented Sep 8, 2022 •

edited

Loading

jiaodong commented Sep 8, 2022

clarkzinzow commented Sep 8, 2022

richardliaw commented Sep 8, 2022

amogkam commented Sep 8, 2022

clarkzinzow commented Sep 8, 2022 •

edited

Loading

jiaodong commented Sep 9, 2022

amogkam commented Dec 9, 2022

[AIR] Success criteria for numpy narrow waist #28346

[AIR] Success criteria for numpy narrow waist #28346

Comments

jiaodong commented Sep 7, 2022 • edited by amogkam Loading

E2E Successful User Journey

Milestones and current progress

Relevant docs:

amogkam commented Sep 7, 2022 • edited Loading

jiaodong commented Sep 7, 2022 • edited Loading

clarkzinzow commented Sep 7, 2022 • edited Loading

amogkam commented Sep 7, 2022 • edited Loading

clarkzinzow commented Sep 7, 2022

amogkam commented Sep 7, 2022

jiaodong commented Sep 7, 2022

jiaodong commented Sep 7, 2022

amogkam commented Sep 7, 2022

clarkzinzow commented Sep 8, 2022 • edited Loading

jiaodong commented Sep 8, 2022

clarkzinzow commented Sep 8, 2022

richardliaw commented Sep 8, 2022

amogkam commented Sep 8, 2022

clarkzinzow commented Sep 8, 2022 • edited Loading

jiaodong commented Sep 9, 2022

amogkam commented Dec 9, 2022

jiaodong commented Sep 7, 2022 •

edited by amogkam

Loading

amogkam commented Sep 7, 2022 •

edited

Loading

jiaodong commented Sep 7, 2022 •

edited

Loading

clarkzinzow commented Sep 7, 2022 •

edited

Loading

amogkam commented Sep 7, 2022 •

edited

Loading

clarkzinzow commented Sep 8, 2022 •

edited

Loading

clarkzinzow commented Sep 8, 2022 •

edited

Loading