Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huge data set #972

Closed
bytearchive opened this issue Oct 19, 2019 · 1 comment
Closed

huge data set #972

bytearchive opened this issue Oct 19, 2019 · 1 comment

Comments

@bytearchive
Copy link

bytearchive commented Oct 19, 2019

I have a dataset of 1-10 TB of size. I want to train and serve a classification model on it.

  1. Can seldon support this side of data?
  2. Can seldon support data lineage, snapshotting of data if k8s pod crashes?
  3. Any seldon-approved data management framework?
  4. Any company using seldon in production?

is there any seldon blog post explaining non-hello-world type projects with large datasets? whitepaper? can it be done using sk-learn or pyspark on seldon?

@ukclivecox
Copy link
Contributor

Sorry for late rely.

I have a dataset of 1-10 TB of size. I want to train and serve a classification model on it.

1. Can seldon support this side of data?

We are focused on real time APIs. If you mean a 1-10TB prediction data set then: We do allow batch requests but you would need to decide if a purely non API based solution is better for this case as you would need to handle splitting the requests into batches etc. Solutions such as Spark/Flink are designed for this and to handle errors gracefully.

2. Can seldon support data lineage, snapshotting of data if k8s pod crashes?

Not directly. This is in our roadmap to integrate into tools such as Pachyderm which focus on this.

3. Any seldon-approved data management framework?

I would suggest looking at the Kubeflow ecosystem which we are a part.

4. Any company using seldon in production?

We have a range of companies we know are running in production. We will make certain ones public in future with their consent. Or they can reply here.

is there any seldon blog post explaining non-hello-world type projects with large datasets? whitepaper? can it be done using sk-learn or pyspark on seldon?

You can join Seldon and Spark - see last weeks Spark AI Summit. https://databricks.com/session_eu19/migrating-apache-spark-ml-jobs-to-spark-tensorflow-on-kubeflow

However, again to restate offline batch is less our focus at present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants