Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Success criteria for numpy narrow waist #28346

Closed
17 of 21 tasks
jiaodong opened this issue Sep 7, 2022 · 17 comments
Closed
17 of 21 tasks

[AIR] Success criteria for numpy narrow waist #28346

jiaodong opened this issue Sep 7, 2022 · 17 comments
Assignees

Comments

@jiaodong
Copy link
Member

jiaodong commented Sep 7, 2022

E2E Successful User Journey

For sample computer vision workload that we had examples / benchmarks, training and batch prediction that is currently Pandas based:

  • Pure numpy preprocessor UDF to do image preprocessing
  • Pure numpy predictor subclassing for object detection
  • Verify e2e throughput and UX with existing ML platform user (p13n and vision)
    • Minimal time spent in training loop with higher GPU saturation
  • Updated docs and benchmarks to compare UX + performance
  • Observability and tracking of data conversions / time spent

Milestones and current progress

  • [ETA: EOW 9/23] BatchMapper and Preprocessor change with numpy format
  • [ETA: EOW 9/30] BatchPredictor uses numpy first path
  • [ETA: EOW 10/7] Benchmarking and do perf comparison vs. pandas
  • [ETA: EOW 10/21] Cherrypicks - Dogfooding, documentation and updates of existing pandas UDF workload

Relevant docs:

https://docs.google.com/document/d/1Er4df6OK78Mt_wiM80WNZr1SKXUEs_D-WNpKyiFwmd8/edit

Data:

Processor:

Training:

  • iter_tf_batches / iter_torch_batches can get numpy type of right batch size without conversion in user code / training loop

Prediction:

@clarkzinzow @amogkam I put down a tentative list of success criteria, please take another look and edit if you need !

@jiaodong jiaodong self-assigned this Sep 7, 2022
@jiaodong jiaodong added the air label Sep 7, 2022
@jiaodong jiaodong changed the title [AIR] Create issue list of numpy narrow waist in 2.1 [AIR] Success criteria for numpy narrow waist Sep 7, 2022
@amogkam
Copy link
Contributor

amogkam commented Sep 7, 2022

Nice overall lgtm!

Just adding a couple more things we discussed in our meeting:

  • Remove _transform_arrow from Preprocessors and Predictors. Arrow should not be user or developer facing.
  • For preprocessors to work with single numpy arrays, they should no longer require column names as a required arg. For multi-column data, not providing column names will apply the preprocessor on all columns

Also might be out of scope, but in general we should make sure that preprocessors are using Datasets correctly (and taking advantage of stage fusion when applicable).

Also is there anything for ragged tensors?

@jiaodong
Copy link
Member Author

jiaodong commented Sep 7, 2022

Sg, this is basically what we discussed last time :) I'm assuming ragged tensors will be supported for this given the PR is about to be merged sooner than the rest, also it's captured as part of TensorArray iiuc.

I think putting Preprocessor class as first-class UX is important but a bit orthogonal to this. We should track it tho.

@amogkam we didn't have enough time to finish discussion on the predictor Any output and our stance on python object fall back yet, we should have it updated here.

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Sep 7, 2022

@jiaodong All of the data items should be the current state (each of which has been validated) so I don't know if including those items as part of the success criteria makes sense, but I might be missing something. I think all of the success criteria should be at the AIR level, around the preprocessors, training, and prediction?

@amogkam
Copy link
Contributor

amogkam commented Sep 7, 2022

What are the risks with falling back to Python object for Predictors? Off the top of my head I can think of:

  1. Users' custom predict functions fall back to Python object even if it's not necessary and the output type can be represented via dict[str, numpy]. Basically it might provide a footgun to the user
  2. Falling back to Python object means the Predictor cannot be used as a Preprocessor and chained with other preprocessors.

The benefit is that we can support the "long tail" of models with unique output types.

@clarkzinzow
Copy link
Contributor

+1 to having the fallback, i.e. a call_model output type of Union[TensorType, Dict[str, TensorType], List[Any]], where List[Any] is the fallback. This should work quite nicely for Datasets, which will convert List[Any] into a well-structured (e.g. tabular) dataset if possible.

@amogkam how common is risk (1)? I feel like that wouldn't be very common. As for (2), we could add List[Any] as a supported type to DataBatchType, but I'm not sure how that _transform_* delegation would work... could we require that the user reformats List[Any] to a preprocessor batch type but only when using a predictor preprocessor?

@amogkam
Copy link
Contributor

amogkam commented Sep 7, 2022

Actually would np.ndarray with object dtype work as a fallback instead of List[Any]?

def call_model(...) -> Union[TensorType, Dict[str, TensorType], np.ndarray, Dict[str, np.ndarray]]

Then the outputs can still go through the _transform_numpy codepath of preprocessors. And we still keep the current contract that the output type is the same as the input type (which Serve currently depends on).

@jiaodong
Copy link
Member Author

jiaodong commented Sep 7, 2022

I think all of the success criteria should be at the AIR level, around the preprocessors, training, and prediction?

@clarkzinzow so I think there's perf and API aspect of it, for API most items should be AIR only, for example we had working PR for numpy narrow waist in batch predictor before 2.0 with batch_format="numpy" and added corresponding preprocessor / predictor changes, suggesting we had pieces we need for a numpy path (even though i didn't vet through all possible cases, we might discover more corners later on)

I'm putting a data section for perf to ensure the numpy pass didn't trigger more unintended perf costs via concat / copy. Do you have other items on your mind for 2.1 regarding numpy narrow waist ?

@jiaodong
Copy link
Member Author

jiaodong commented Sep 7, 2022

Falling back to Python object means the Predictor cannot be used as a Preprocessor and chained with other preprocessors.

@amogkam I think you mentioned chaining predictors / using predictor as processor in meeting earlier this week, wouldn't fallback to python object break what you planned or it's not an issue ?

Otherwise i vote for adding python object fallback too, it makes supporting arbitrary predictor output easier.

@amogkam
Copy link
Contributor

amogkam commented Sep 7, 2022

@jiaodong it will break, but if we have np.ndarray with object dtype as the fallback instead of List[Any] I think that should be fine.

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Sep 8, 2022

I'm putting a data section for perf to ensure the numpy pass didn't trigger more unintended perf costs via concat / copy.

@jiaodong The NumPy narrow waist will result in strictly less data copies than the Pandas or Arrow narrow waist (that's why it was initially added), so I don't think that we need that as a call-out, but we can have a single validation item for that if you'd like. Since there are no known Datasets code changes needed, how about we axe the Data items and have an AIR-level validation item for ensuring that our NumPy narrow waist isn't resulting in unnecessary copies anywhere?

There's also a P1 for adding a new "zero_copy" batch format that allows preprocessors/predictors to delegate to format-specific optimized implementations of their transformation functions, which would be a Datasets-level item, but I'm guessing that we're considering that out-of-scope for now?

@jiaodong
Copy link
Member Author

jiaodong commented Sep 8, 2022

Sounds good ! I trust your judgment on this given all the context you have on data side. So how about we put data items as placeholder for now, we cross off items in this list that's already status-quo; Amog and I start adding the numpy path in AIR / predictor and if we discovered any corner cases (which i assume we will certainly run into) we can sync with data team to see what's the best action needed.

@clarkzinzow
Copy link
Contributor

Actually would np.ndarray with object dtype work as a fallback instead of List[Any]?

@amogkam Hmm interesting... I feel like that Union[TensorType, Dict[str, TensorType], np.ndarray, Dict[str, np.ndarray]] return type is weirder/messier than Union[TensorType, Dict[str, TensorType], List[Any]], since the former makes you ask "why do we only take ML tensors but can return ML tensors or NumPy ndarrays?", and I think we can easily document the latter List[Any] as an escape hatch for odd model outputs.

And we still keep the current contract that the output type is the same as the input type (which Serve currently depends on).

Would it be possible for Serve to not impose this restriction? It doesn't really make sense for model outputs which may be an arbitrary format, and I think we'll also want to support BatchMapper being able to change the batch format (we have Datasets users that rely on this behavior of .map_batches()).

@richardliaw
Copy link
Contributor

cc @simon-mo on the serve comment

@amogkam
Copy link
Contributor

amogkam commented Sep 8, 2022

former makes you ask "why do we only take ML tensors but can return ML tensors or NumPy ndarrays?"

Isn't this also the case with List[Any]?

I think we can easily document the latter List[Any] as an escape hatch for odd model outputs.

Can't we also document np.ndarray as an escape hatch?

My reasoning is that np.ndarray handles these 2 requirements while List[Any] does not.

  1. Keeping the contract that the input type is the same as the output type. If we change this behavior, then this is a significant API change we should discuss.
  2. It allows predictors to be used as preprocessors- which unlocks a nice feature that few users have brought up.

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Sep 8, 2022

@amogkam I was thinking that we could make List[Any] our canonical AIR escape hatch since it already serves that purpose within Datasets, via the simple block representation. This has the added benefit that Datasets will try to convert it into a well-formatted dataset (e.g. a tabular dataset if its a list of dicts, as is the case for SSD300_VGG16 object detection outputs), only falling back to the simple block representation if that fails.

There's also the issue that an np.ndarray batch will be interpreted by Datasets as a tensor dataset and will try to use our tensor extension to represent it as a single-column table, which would fail for many common use cases (e.g. the object detection output) since the tensor extension can't hold arbitrary Python objects such as dictionaries. The only way around that would be using a structured ndarray, which currently isn't supported by the tensor extension (yet).

Your two points still stand, although I think we'll want to tweak (1) anyway since I think it's too strict for advanced BatchMapper use cases, and things are already getting awkward with model outputs; what's the big benefit of (1) anyway, other than Serve relying on it? I feel like we should be able to support going from e.g. a Pandas DataFrame batch to a NumPy ndarray batch if the preprocessor so-desires.

As for (2), won't users need to munge their model outputs to have them fit their downstream preprocessors anyway? E.g. if a user is going to take the object detections from SSD300_VGG16 and apply some preprocessor, they're going to need to turn those model outputs into a format that the preprocessor understands, which the np.ndarray escape hatch wouldn't give them automatically (the List[Any] escape hatch would, but that's kind of beside the point). Since the user will probably have to coerce their model outputs to be compatible with downstream preprocessors anyway, it seems like we'd want to provide a hook for them to do that in the PredictorPreprocessor, so whether we have the np.ndarray escape hatch or List[Any] escape hatch doesn't seem to matter as much for (2). Am I missing something there?

@jiaodong
Copy link
Member Author

jiaodong commented Sep 9, 2022

Synced with Clark and Amog offline, looks like for all options we had adding opaque object lead to more implications of AIR narrow waist or ability to chain predictor / preprocessor to others. Given the only case we know of is object detection, which ragged tensor support already greatly improved UX by eliminating the need of padding, we decided to punt it for now unless new clear use cases show up that requires a generic "simple" block format -- which requires a proper dedicated design.

@amogkam
Copy link
Contributor

amogkam commented Dec 9, 2022

@jiaodong can this be closed out?

@krfricke krfricke added triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants