autonomi-ai · spillai · Oct 20, 2023 · Oct 20, 2023
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -34,12 +34,13 @@ repos:
     - id: debug-statements
     - id: detect-private-key      # check for private keys
     - id: end-of-file-fixer
-      exclude: ^tests/test_data|^.data
+      exclude: ^tests/test_data|^docs|^examples/notebook/
     - id: pretty-format-json
-      exclude: ^tests/test_data|^.data|^docs|^examples/notebook/
+      exclude: ^tests/test_data|^docs|^examples/notebook/
     - id: trailing-whitespace
+      exclude: ^tests/test_data|^docs|^examples/notebook/
     - id: check-added-large-files
       args: ['--maxkb=100']
-      exclude: ^tests/test_data|^.data
+      exclude: ^tests/test_data
     - id: requirements-txt-fixer
       files: requirements.*\.txt$
diff --git a/README.md b/README.md
@@ -1,10 +1,4 @@
-<h1 align="center" style="font-size:64px;font-weight: bold;font-color:black;">⚡️ NOS</h1>
-<h4 align="center">Nitrous Oxide System for your AI Infrastructure
-<p style="font-weight: normal;">
-Optimize, serve and auto-scale Pytorch models in production<br>
-</p>
-</h4>
-
+<center><img src="./docs/assets/nos-header.svg" alt="Nitrous Oxide for your AI Infrastructure"></center>
 
 <p align="center">
 <a href="https://nos.run/"><b>Website</b></a> | <a href="https://docs.nos.run/"><b>Docs</b></a> |  <a href="https://discord.gg/QAGgvTuvgg"><b>Discord</b></a>
@@ -30,12 +24,11 @@ Optimize, serve and auto-scale Pytorch models in production<br>
 </p>
 
 
-**NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running lightning-fast inference of popular foundational AI models.
-
-*Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.*
+> *Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast and flexible inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.*
 
+## ⚡️ What is NOS?
+**NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running lightning-fast inference of popular foundational AI models.
 
-## 👩‍💻 What is NOS?
 - 👩‍💻 **Easy-to-use**: Built for [PyTorch](https://pytorch.org/) and designed to optimize, serve and auto-scale Pytorch models in production without compromising on developer experience.
 - 🥷 **Flexible**: Run and serve several foundational AI models ([Stable Diffusion](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), [CLIP](https://huggingface.co/openai/clip-vit-base-patch32), [Whisper](https://huggingface.co/openai/whisper-large-v2)) in a single place.
 - 🔌 **Pluggable:** Plug your front-end to NOS with out-of-the-box high-performance gRPC/REST APIs, avoiding all kinds of ML model deployment hassles.
@@ -52,21 +45,113 @@ Optimize, serve and auto-scale Pytorch models in production<br>
 
 Get started with the full NOS server by installing via pip:
 
-```shell
-conda env create -n nos-py38 python=3.8
-conda activate nos-py38
-conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
-pip install torch-nos[server]
+  ```shell
+  $ conda env create -n nos-py38 python=3.8
+  $ conda activate nos-py38
+  $ conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
+  $ pip install torch-nos[server]
+  ```
+
+If you want to simply use a light-weight NOS client and run inference on your local machine, you can install the client-only package:
+
+  ```shell
+  $ conda env create -n nos-py38 python=3.8
+  $ conda activate nos-py38
+  $ pip install torch-nos
+  ```
+
+## 🔥 Quickstart / Show me the code
+
+### Image Generation as-a-Service
+
+
+<table>
+<tr>
+<td> REST API </td>
+<td> gRPC API ⚡ </td>
+</tr>
+<tr>
+<td>
+
+```bash
+curl \
+-X POST http://localhost:8000/infer \
+-H 'Content-Type: application/json' \
+-d '{
+      "model_id": "stabilityai/stable-diffusion-xl-base-1-0",
+      "inputs": {
+          "prompts": ["fox jumped over the moon"],
+          "width": 1024,
+          "height": 1024,
+          "num_images": 1
+      }
+    }'
 ```
 
+</td>
+<td>
+
+```python
+from nos.client import Client
+
+client = Client("[::]:50051")
+
+sdxl = client.Module("stabilityai/stable-diffusion-xl-base-1-0")
+image, = sdxl(prompts=["fox jumped over the moon"],
+              width=1024, height=1024, num_images=1)
+```
+
+</td>
+</tr>
+</table>
+
+### Text & Image Embedding-as-a-Service (CLIP-as-a-Service)
+
+<table>
+<tr>
+<td> REST API </td>
+<td> gRPC API ⚡ </td>
+</tr>
+<tr>
+<td>
+
+```bash
+curl \
+-X POST http://localhost:8000/infer \
+-H 'Content-Type: application/json' \
+-d '{
+      "model_id": "openai/clip",
+      "method": "encode_text",
+      "inputs": {
+          "texts": ["fox jumped over the moon"]
+      }
+    }'
+```
+
+</td>
+<td>
+
+```python
+from nos.client import Client
+
+client = Client("[::]:50051")
+
+clip = client.Module("openai/clip")
+txt_vec = clip.encode_text(text=["fox jumped over the moon"])
+```
+</td>
+</tr>
+</table>
+
+
 ## 📂 Repository Structure
 
 ```bash
 ├── docker         # Dockerfile for CPU/GPU servers
 ├── docs           # mkdocs documentation
 ├── examples       # example guides, jupyter notebooks, demos
 ├── makefiles      # makefiles for building/testing
-├── nos          
+├── nos
 │   ├── cli        # CLI (hub, system)
 │   ├── client     # gRPC / REST client
 │   ├── common     # common utilities
@@ -84,30 +169,33 @@ pip install torch-nos[server]
 
 ## 📚 Documentation
 
-- 📚 [NOS Documentation](https://docs.nos.run/)
-- 🔥 [Quickstart](https://docs.nos.run/docs/quickstart.html)
-- 🧠 [Models](https://docs.nos.run/docs/models/supported-models.html)
-- ⚡️ **Concepts**: [NOS Architecture](https://docs.nos.run/docs/concepts/architecture-overview.html)
-- 🤖 **Demos**: [Building a Discord Image Generation Bot](https://docs.nos.run/docs/demos/discord-bot.html), [Video Search Demo](https://docs.nos.run/docs/demos/video-search.html)
+- [NOS Documentation](https://docs.nos.run/)
+- [Quickstart](https://docs.nos.run/docs/quickstart.html)
+- [Models](https://docs.nos.run/docs/models/supported-models.html)
+- **Concepts**: [NOS Architecture](https://docs.nos.run/docs/concepts/architecture-overview.html)
+- **Demos**: [Building a Discord Image Generation Bot](https://docs.nos.run/docs/demos/discord-bot.html), [Video Search Demo](https://docs.nos.run/docs/demos/video-search.html)
 
 ## 🛣 Roadmap
 
 ### HW / Cloud Support
 
-- [✅] **Commodity GPUs**
-  - [✅] NVIDIA GPUs (20XX, 30XX, 40XX)
-  - [ ] AMD GPUs (RX 6000 series)
-- [✅] **Cloud GPUs**
-  - [ ] NVIDIA (T4, A100, H100)
-  - [-] AMD (MI200, MI250)
-- [🟡] **Cloud ASICs**
-  - [🟡] [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) [Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)/[Inf2](https://aws.amazon.com/ec2/instance-types/inf2/)
-  - [ ] Google TPU
-  - [ ] TBD (Graphcore, Habana Gaudi, Tenstorrent)
-- [✅] **Cloud Service Providers** (via [SkyPilot](https://github.com/skypilot-org/skypilot))
-    - [✅] **Big 3:** AWS, GCP, Azure
+- [x] **Commodity GPUs**
+    - [x] NVIDIA GPUs (20XX, 30XX, 40XX)
+    - [ ] AMD GPUs (RX 7000)
+
+- [x] **Cloud GPUs**
+    - [x] NVIDIA (T4, A100, H100)
+    - [ ] AMD (MI200, MI250)
+
+- [x] **Cloud Service Providers** (via [SkyPilot](https://github.com/skypilot-org/skypilot))
+    - [x] **Big 3:** AWS, GCP, Azure
     - [ ] **Opinionated Cloud:** Lambda Labs, RunPod, etc
 
+- [ ] **Cloud ASICs**
+    - [ ] [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/) ([Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)/[Inf2](https://aws.amazon.com/ec2/instance-types/inf2/))
+    - [ ] Google TPU
+    - [ ] TBD (Graphcore, Habana Gaudi, Tenstorrent)
+
 
 ## 📄 License
 
@@ -120,4 +208,4 @@ We welcome contributions! Please see our [contributing guide](CONTRIBUTING.md) f
 ### 🔗  Quick Links
 
 * 💬 Send us an email at [[email protected]](mailto:[email protected]) or join our [Discord](https://discord.gg/QAGgvTuvgg) for help.
-* 📣 Follow us on [Twitter](https://twitter.com/autonomi\_ai), and [LinkedIn](https://www.linkedin.com/company/autonomi-ai) to keep up-to-date on our products.
+* 📣 Follow us on [Twitter](https://twitter.com/autonomi\_ai), and [LinkedIn](https://www.linkedin.com/company/autonomi-ai) to keep up-to-date on our products.
diff --git a/docs/api/common/exceptions.md b/docs/api/common/exceptions.md
@@ -0,0 +1 @@
+::: nos.common.exceptions
diff --git a/docs/api/common/metaclass.md b/docs/api/common/metaclass.md
@@ -0,0 +1 @@
+::: nos.common.metaclass
diff --git a/docs/api/common/shm.md b/docs/api/common/shm.md
@@ -0,0 +1 @@
+::: nos.common.shm
diff --git a/docs/api/common/system.md b/docs/api/common/system.md
@@ -0,0 +1 @@
+::: nos.common.system
diff --git a/docs/api/common/tasks.md b/docs/api/common/tasks.md
@@ -1 +1 @@
-## ::: nos.common.tasks
+::: nos.common.tasks
diff --git a/docs/api/common/types.md b/docs/api/common/types.md
@@ -1 +1 @@
-## ::: nos.common.types
+::: nos.common.types
diff --git a/docs/api/server.md b/docs/api/server.md
@@ -2,6 +2,8 @@
 
 ## Docker Runtime
 
+The docker runtime provides the `docker-py` interface to run containerized inference workloads using Docker. It allows starting and stopping containers, getting container information, and running the containers programmatically with HW support for accelerators like GPUs, ASICs etc. 
+
 ::: nos.server._docker.DeviceRequest
 
 ::: nos.server._docker.DockerRuntime
@@ -13,6 +15,8 @@
 
 ## InferenceService
 
+The `InferenceService` along with the `InferenceServiceImpl` gRPC service implementation provides a fully wrapped inference service via gRPC/HTTP2. The `InferenceServiceImpl` wraps the relevant API services such as `ListModels()`, `GetModelInfo()` and crucially `Run()` and executes the inference request via the `InferenceService` class. The `InferenceService` class manages models via the `ModelManager`, and sets up the necessary execution backend via `RayExecutor`. In addition to this, it is also responsible for managing shared memory regions (if requested) for high-performance inference running locally in a single machine.
+
 ::: nos.server._service.ModelHandle
 
 ::: nos.server._service.InferenceServiceImpl

diff --git a/docs/assets/nos-header.svg b/docs/assets/nos-header.svg
diff --git a/docs/concepts/architecture-overview.md b/docs/concepts/architecture-overview.md
@@ -1,17 +1,19 @@
 !!!note ""
     **NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running lightning-fast inference of popular foundational AI models.
 
-*Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.*
+Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.
 
 ## ⚡️ Core Features
 
  - 🔋 **Batteries-included:** Server-side inference with all the necessary batteries (model hub, batching/parallelization, fast I/O, model-caching, model resource management via ModelManager, model optimization via ModelSpec)
  - 📡 **Client-Server architecture:** Multiple lightweight clients can leverage powerful server-side inference workers running remotely without the bloat of GPU libraries, runtimes or 3rd-party libraries.
  - 💪 **High device-utilization:**  With better model management, client’s won’t have to wait on model inference and instead can take advantage of the full GPU resources available. Model multiplexing, and efficient bin-packing of models allow us to leverage the resources optimally (without requiring additional user input).
  - 📦 **Custom model support:** NOS allows you to easily add support for custom models with a few lines of code. We provide a simple API to register custom models with NOS, and allow you to optimize and run models on any hardware (NVIDIA, custom ASICs) without any model compilation or runtime management (see [example](../guides/running-custom-models.md)).
- - **Concurrency**: NOS is built to efficiently serve AI models, ensuring concurrency, parallelism, optimal memory management, and automatic garbage collection. It is particularly well-suited for multi-modal AI applications.
+ - ⏩ **Concurrency**: NOS is built to efficiently serve AI models, ensuring concurrency, parallelism, optimal memory management, and automatic garbage collection. It is particularly well-suited for multi-modal AI applications.
 
-![Unified NOS Inference Server](./assets/arch-how-nos-works.png)
+## 🏗️ Architecture
+
+![NOS Architecture](./assets/arch-client-server.png)
 
 ## 🛠️ Core Components
 
@@ -28,20 +30,6 @@ NOS is built to efficiently serve AI models, ensuring concurrency, parallelism,
     - Submit tasks to specific methods of the model.
     - Garbage collect models when they are evicted.
 
-Model manager for serving and running multiple models with Ray actors.
-- [**`InferenceService`**](#inferenceservice): Ray-executor based inference service that executes inference requests.
-- [**`InferenceRuntimeService`**](#inferenceruntimeservice): Dockerized runtime environment for server-side remote execution
-
-![NOS Architecture](./assets/arch-client-server.png)
-
-
-## Overview
-
-NOS is built to efficiently serve AI models, ensuring concurrency, parallelism, optimal memory management, and automatic garbage collection. It is particularly well-suited for multi-modal AI applications.
-
-Key Features:
-- Concurrency support for multiple models running simultaneously.
-- Parallelism support with multiple replicas of the same model.
-- Optimal memory management, dynamically adjusting to model memory consumption.
-- Automatic garbage collection to prevent Out-Of-Memory issues.
+- [**`InferenceService`**](../api/server.md#inferenceservice): Ray-executor based inference service that executes inference requests.
 
+- [**`InferenceRuntimeService`**](../api/server.md#inferenceserviceruntime): Dockerized runtime environment for server-side remote execution
diff --git a/docs/concepts/what-is-nos.md b/docs/concepts/what-is-nos.md
@@ -1,7 +1,7 @@
 !!!note ""
     **NOS (`torch-nos`)** is a fast and flexible Pytorch inference server, specifically designed for optimizing and running lightning-fast inference of popular foundational AI models.
 
-*Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.*
+Optimizing and serving models for production AI inference is still difficult, often leading to notoriously expensive cloud bills and often underutilized GPUs. That’s why we’re building **NOS** - a fast inference server for modern AI workloads. With a few lines of code, developers can optimize, serve, and auto-scale Pytorch model inference without having to deal with the complexities of ML compilers, HW-accelerators, or distributed inference. Simply put, NOS allows AI teams to cut inference costs up to **10x**, speeding up development time and time-to-market.
 
 ## ⚡️ Core Features
 
@@ -31,7 +31,7 @@ to a different execution flow with substantial gaps between dev and prod environ
 - They may raise privacy concerns as user data must go outside the wire for inferencing on vendor servers
 - Stability issues when using poorly maintained third party APIs
 
-We built **NOS** because we wanted an inference server combining best-practices in model-serving, distributed-inference, auto-scaling all in a single, easy-to-user containerized system that you can simply run with a few lines of Python. 
+We built **NOS** because we wanted an inference server combining best-practices in model-serving, distributed-inference, auto-scaling all in a single, easy-to-user containerized system that you can simply run with a few lines of Python.
 
 ## 📦 Model Containers