Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp docstrings and mkdocs with fixes for new API #427

Merged
merged 1 commit into from
Oct 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,13 @@ repos:
- id: debug-statements
- id: detect-private-key # check for private keys
- id: end-of-file-fixer
exclude: ^tests/test_data|^.data
exclude: ^tests/test_data|^docs|^examples/notebook/
- id: pretty-format-json
exclude: ^tests/test_data|^.data|^docs|^examples/notebook/
exclude: ^tests/test_data|^docs|^examples/notebook/
- id: trailing-whitespace
exclude: ^tests/test_data|^docs|^examples/notebook/
- id: check-added-large-files
args: ['--maxkb=100']
exclude: ^tests/test_data|^.data
exclude: ^tests/test_data
- id: requirements-txt-fixer
files: requirements.*\.txt$
158 changes: 123 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,4 @@
<h1 align="center" style="font-size:64px;font-weight: bold;font-color:black;">⚡️ NOS</h1>
<h4 align="center">Nitrous Oxide System for your AI Infrastructure
<p style="font-weight: normal;">
Optimize, serve and auto-scale Pytorch models in production<br>
</p>
</h4>

<center><img src="./docs/assets/nos-header.svg" alt="Nitrous Oxide for your AI Infrastructure"></center>

<p align="center">
<a href="https://nos.run/"><b>Website</b></a> | <a href="https://docs.nos.run/"><b>Docs</b></a> | <a href="https://discord.gg/QAGgvTuvgg"><b>Discord</b></a>
Expand All @@ -30,12 +24,11 @@ Optimize, serve and auto-scale Pytorch models in production<br>
</p>


**NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running lightning-fast inference of popular foundational AI models.

*Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.*
> *Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast and flexible inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.*

## ⚡️ What is NOS?
**NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running lightning-fast inference of popular foundational AI models.

## 👩‍💻 What is NOS?
- 👩‍💻 **Easy-to-use**: Built for [PyTorch](https://pytorch.org/) and designed to optimize, serve and auto-scale Pytorch models in production without compromising on developer experience.
- 🥷 **Flexible**: Run and serve several foundational AI models ([Stable Diffusion](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), [CLIP](https://huggingface.co/openai/clip-vit-base-patch32), [Whisper](https://huggingface.co/openai/whisper-large-v2)) in a single place.
- 🔌 **Pluggable:** Plug your front-end to NOS with out-of-the-box high-performance gRPC/REST APIs, avoiding all kinds of ML model deployment hassles.
Expand All @@ -52,21 +45,113 @@ Optimize, serve and auto-scale Pytorch models in production<br>

Get started with the full NOS server by installing via pip:

```shell
conda env create -n nos-py38 python=3.8
conda activate nos-py38
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install torch-nos[server]
```shell
$ conda env create -n nos-py38 python=3.8
$ conda activate nos-py38
$ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
$ pip install torch-nos[server]
```

If you want to simply use a light-weight NOS client and run inference on your local machine, you can install the client-only package:

```shell
$ conda env create -n nos-py38 python=3.8
$ conda activate nos-py38
$ pip install torch-nos
```

## 🔥 Quickstart / Show me the code

### Image Generation as-a-Service


<table>
<tr>
<td> REST API </td>
<td> gRPC API ⚡ </td>
</tr>
<tr>
<td>

```bash
curl \
-X POST http://localhost:8000/infer \
-H 'Content-Type: application/json' \
-d '{
"model_id": "stabilityai/stable-diffusion-xl-base-1-0",
"inputs": {
"prompts": ["fox jumped over the moon"],
"width": 1024,
"height": 1024,
"num_images": 1
}
}'
```

</td>
<td>

```python
from nos.client import Client

client = Client("[::]:50051")

sdxl = client.Module("stabilityai/stable-diffusion-xl-base-1-0")
image, = sdxl(prompts=["fox jumped over the moon"],
width=1024, height=1024, num_images=1)
```

</td>
</tr>
</table>

### Text & Image Embedding-as-a-Service (CLIP-as-a-Service)

<table>
<tr>
<td> REST API </td>
<td> gRPC API ⚡ </td>
</tr>
<tr>
<td>

```bash
curl \
-X POST http://localhost:8000/infer \
-H 'Content-Type: application/json' \
-d '{
"model_id": "openai/clip",
"method": "encode_text",
"inputs": {
"texts": ["fox jumped over the moon"]
}
}'
```

</td>
<td>

```python
from nos.client import Client

client = Client("[::]:50051")

clip = client.Module("openai/clip")
txt_vec = clip.encode_text(text=["fox jumped over the moon"])
```
</td>
</tr>
</table>


## 📂 Repository Structure

```bash
├── docker # Dockerfile for CPU/GPU servers
├── docs # mkdocs documentation
├── examples # example guides, jupyter notebooks, demos
├── makefiles # makefiles for building/testing
├── nos
├── nos
│   ├── cli # CLI (hub, system)
│   ├── client # gRPC / REST client
│   ├── common # common utilities
Expand All @@ -84,30 +169,33 @@ pip install torch-nos[server]

## 📚 Documentation

- 📚 [NOS Documentation](https://docs.nos.run/)
- 🔥 [Quickstart](https://docs.nos.run/docs/quickstart.html)
- 🧠 [Models](https://docs.nos.run/docs/models/supported-models.html)
- ⚡️ **Concepts**: [NOS Architecture](https://docs.nos.run/docs/concepts/architecture-overview.html)
- 🤖 **Demos**: [Building a Discord Image Generation Bot](https://docs.nos.run/docs/demos/discord-bot.html), [Video Search Demo](https://docs.nos.run/docs/demos/video-search.html)
- [NOS Documentation](https://docs.nos.run/)
- [Quickstart](https://docs.nos.run/docs/quickstart.html)
- [Models](https://docs.nos.run/docs/models/supported-models.html)
- **Concepts**: [NOS Architecture](https://docs.nos.run/docs/concepts/architecture-overview.html)
- **Demos**: [Building a Discord Image Generation Bot](https://docs.nos.run/docs/demos/discord-bot.html), [Video Search Demo](https://docs.nos.run/docs/demos/video-search.html)

## 🛣 Roadmap

### HW / Cloud Support

- [✅] **Commodity GPUs**
- [✅] NVIDIA GPUs (20XX, 30XX, 40XX)
- [ ] AMD GPUs (RX 6000 series)
- [✅] **Cloud GPUs**
- [ ] NVIDIA (T4, A100, H100)
- [-] AMD (MI200, MI250)
- [🟡] **Cloud ASICs**
- [🟡] [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) [Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)/[Inf2](https://aws.amazon.com/ec2/instance-types/inf2/)
- [ ] Google TPU
- [ ] TBD (Graphcore, Habana Gaudi, Tenstorrent)
- [✅] **Cloud Service Providers** (via [SkyPilot](https://github.com/skypilot-org/skypilot))
- [✅] **Big 3:** AWS, GCP, Azure
- [x] **Commodity GPUs**
- [x] NVIDIA GPUs (20XX, 30XX, 40XX)
- [ ] AMD GPUs (RX 7000)

- [x] **Cloud GPUs**
- [x] NVIDIA (T4, A100, H100)
- [ ] AMD (MI200, MI250)

- [x] **Cloud Service Providers** (via [SkyPilot](https://github.com/skypilot-org/skypilot))
- [x] **Big 3:** AWS, GCP, Azure
- [ ] **Opinionated Cloud:** Lambda Labs, RunPod, etc

- [ ] **Cloud ASICs**
- [ ] [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) ([Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)/[Inf2](https://aws.amazon.com/ec2/instance-types/inf2/))
- [ ] Google TPU
- [ ] TBD (Graphcore, Habana Gaudi, Tenstorrent)


## 📄 License

Expand All @@ -120,4 +208,4 @@ We welcome contributions! Please see our [contributing guide](CONTRIBUTING.md) f
### 🔗 Quick Links

* 💬 Send us an email at [[email protected]](mailto:[email protected]) or join our [Discord](https://discord.gg/QAGgvTuvgg) for help.
* 📣 Follow us on [Twitter](https://twitter.com/autonomi\_ai), and [LinkedIn](https://www.linkedin.com/company/autonomi-ai) to keep up-to-date on our products.
* 📣 Follow us on [Twitter](https://twitter.com/autonomi\_ai), and [LinkedIn](https://www.linkedin.com/company/autonomi-ai) to keep up-to-date on our products.
1 change: 1 addition & 0 deletions docs/api/common/exceptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: nos.common.exceptions
1 change: 1 addition & 0 deletions docs/api/common/metaclass.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: nos.common.metaclass
1 change: 1 addition & 0 deletions docs/api/common/shm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: nos.common.shm
1 change: 1 addition & 0 deletions docs/api/common/system.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: nos.common.system
2 changes: 1 addition & 1 deletion docs/api/common/tasks.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
## ::: nos.common.tasks
::: nos.common.tasks
2 changes: 1 addition & 1 deletion docs/api/common/types.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
## ::: nos.common.types
::: nos.common.types
4 changes: 4 additions & 0 deletions docs/api/server.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

## Docker Runtime

The docker runtime provides the `docker-py` interface to run containerized inference workloads using Docker. It allows starting and stopping containers, getting container information, and running the containers programmatically with HW support for accelerators like GPUs, ASICs etc.

::: nos.server._docker.DeviceRequest

::: nos.server._docker.DockerRuntime
Expand All @@ -13,6 +15,8 @@

## InferenceService

The `InferenceService` along with the `InferenceServiceImpl` gRPC service implementation provides a fully wrapped inference service via gRPC/HTTP2. The `InferenceServiceImpl` wraps the relevant API services such as `ListModels()`, `GetModelInfo()` and crucially `Run()` and executes the inference request via the `InferenceService` class. The `InferenceService` class manages models via the `ModelManager`, and sets up the necessary execution backend via `RayExecutor`. In addition to this, it is also responsible for managing shared memory regions (if requested) for high-performance inference running locally in a single machine.

::: nos.server._service.ModelHandle

::: nos.server._service.InferenceServiceImpl
Expand Down
1 change: 1 addition & 0 deletions docs/assets/nos-header.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 7 additions & 19 deletions docs/concepts/architecture-overview.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
!!!note ""
**NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running lightning-fast inference of popular foundational AI models.

*Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.*
Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.

## ⚡️ Core Features

- 🔋 **Batteries-included:** Server-side inference with all the necessary batteries (model hub, batching/parallelization, fast I/O, model-caching, model resource management via ModelManager, model optimization via ModelSpec)
- 📡 **Client-Server architecture:** Multiple lightweight clients can leverage powerful server-side inference workers running remotely without the bloat of GPU libraries, runtimes or 3rd-party libraries.
- 💪 **High device-utilization:** With better model management, client’s won’t have to wait on model inference and instead can take advantage of the full GPU resources available. Model multiplexing, and efficient bin-packing of models allow us to leverage the resources optimally (without requiring additional user input).
- 📦 **Custom model support:** NOS allows you to easily add support for custom models with a few lines of code. We provide a simple API to register custom models with NOS, and allow you to optimize and run models on any hardware (NVIDIA, custom ASICs) without any model compilation or runtime management (see [example](../guides/running-custom-models.md)).
- **Concurrency**: NOS is built to efficiently serve AI models, ensuring concurrency, parallelism, optimal memory management, and automatic garbage collection. It is particularly well-suited for multi-modal AI applications.
- **Concurrency**: NOS is built to efficiently serve AI models, ensuring concurrency, parallelism, optimal memory management, and automatic garbage collection. It is particularly well-suited for multi-modal AI applications.

![Unified NOS Inference Server](./assets/arch-how-nos-works.png)
## 🏗️ Architecture

![NOS Architecture](./assets/arch-client-server.png)

## 🛠️ Core Components

Expand All @@ -28,20 +30,6 @@ NOS is built to efficiently serve AI models, ensuring concurrency, parallelism,
- Submit tasks to specific methods of the model.
- Garbage collect models when they are evicted.

Model manager for serving and running multiple models with Ray actors.
- [**`InferenceService`**](#inferenceservice): Ray-executor based inference service that executes inference requests.
- [**`InferenceRuntimeService`**](#inferenceruntimeservice): Dockerized runtime environment for server-side remote execution

![NOS Architecture](./assets/arch-client-server.png)


## Overview

NOS is built to efficiently serve AI models, ensuring concurrency, parallelism, optimal memory management, and automatic garbage collection. It is particularly well-suited for multi-modal AI applications.

Key Features:
- Concurrency support for multiple models running simultaneously.
- Parallelism support with multiple replicas of the same model.
- Optimal memory management, dynamically adjusting to model memory consumption.
- Automatic garbage collection to prevent Out-Of-Memory issues.
- [**`InferenceService`**](../api/server.md#inferenceservice): Ray-executor based inference service that executes inference requests.

- [**`InferenceRuntimeService`**](../api/server.md#inferenceserviceruntime): Dockerized runtime environment for server-side remote execution
4 changes: 2 additions & 2 deletions docs/concepts/what-is-nos.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
!!!note ""
**NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running lightning-fast inference of popular foundational AI models.

*Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.*
Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.

## ⚡️ Core Features

Expand Down Expand Up @@ -31,7 +31,7 @@ to a different execution flow with substantial gaps between dev and prod environ
- They may raise privacy concerns as user data must go outside the wire for inferencing on vendor servers
- Stability issues when using poorly maintained third party APIs

We built **NOS** because we wanted an inference server combining best-practices in model-serving, distributed-inference, auto-scaling all in a single, easy-to-user containerized system that you can simply run with a few lines of Python.
We built **NOS** because we wanted an inference server combining best-practices in model-serving, distributed-inference, auto-scaling all in a single, easy-to-user containerized system that you can simply run with a few lines of Python.

## 📦 Model Containers

Expand Down
Loading
Loading