From 37d8d06b64b3ad50b77b01ca1ec8778bb7fedbe6 Mon Sep 17 00:00:00 2001
From: LiCHOTHU <lichothu@gmail.com>
Date: Tue, 2 Apr 2024 14:07:49 -0400
Subject: [PATCH] Update README.md

---
 README.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 1495473..e6e5f02 100644
--- a/README.md
+++ b/README.md
@@ -27,7 +27,7 @@ We also formulate our data collection algorithm here.
 
 ![data_collection_algo](https://github.com/pairlab/actaim2-eccv24/assets/30140814/3032710a-ac79-400e-99c9-8d23ea881806)
 
-## Mode Selector
+## Unsupervised Mode Selector Learning
 
 In this part, we show how we train and infer from the mode selector to extract the discrete task embedding for action predictor training. Our mode selector is a VAE-style generative model but replacing the simple Gaussian with the Mixture of Gaussian. 
 
@@ -51,6 +51,11 @@ In the inference phase, the agent discretely samples a cluster from the trained
 
 This disentanglement visualization with CGMVAE illustrates the efficacy of the Conditional Gaussian Mixture Variational Autoencoder (CGMVAE) in disentangling interaction modes for the "single drawer" object (ID: 20411), using a t-SNE plot for visualization. Task embeddings $\epsilon_j$, defined by the variance between initial and final object states, are visualized in distinct colors to denote various interaction modes and clusters. The sequence of figures demonstrates the CGMVAE's precision in clustering and aligning data points with their respective interaction modes: (1) Generated clusters from the CGMVAE mode selector reveal distinct groupings. (2) Ground truth task embeddings confirm the model's capacity for accurate interaction mode classification. (3) A combined visualization underscores the alignment between generated clusters and ground truth, showcasing the model's ability to consistently categorize tasks within identical interaction modes.
 
+## Supervised Action Predictor Learning
+
+[fig3.pdf](https://github.com/pairlab/actaim2-eccv24/files/14842055/fig3.pdf)
+
+Interaction mode $\epsilon$ is sampled from latent space embedding from the model selector. Multiview RGBD observations are back-projected and fused into a color point cloud. Novel views are rendered by projecting the point cloud onto orthogonal image planes. Rendered image tokens and interaction mode tokens are contacted and fed through the multiview transformer. This output consists of global feature for rotation $\mathbf{R}$ and gripper state $\mathbf{q}$ estimation and 2D per-view heatmap for position $\mathbf{p}$ prediction.