Skip to content

Code Documentation

Thomas Patton edited this page May 25, 2023 · 5 revisions

Author: Thomas Patton ([email protected])

scripts/annotate.py

Overview

The theory behind this algorithm is based on the computer vision concept of three-dimensional reconstruction. The core concept is that basic math can be applied to points specified in a three-dimensional coordinate system to compute how they would be viewed in two dimensions from an arbitrary camera position. In our algorithm, for each video we manually segment the first image in the video. We then use the camera’s depth value d at segmentation coordinate (i, j) to convert the two-dimensional segmentation into a three-dimensional representation of that segmentation. Then for each subsequent frame of the video, we compute the transformation from the initial camera position to the current camera position as [R|t]. For each of our three-dimensional points with coordinates (x, y, z), we can compute their new two-dimensional projection coordinates (u, v) using OpenCV.

On a more code-centric level, this module is a time-synchronized ROS node which reads in RGB images (/camera_1/color/image_raw), depth images (/camera_1/aligned_depth_to_color/image_raw) and to camera intrinsic information (/camera_1/color/camera_info). On each time synchronized message, we configure the program to invoke the callback() function.

If callback() is being invoked for the first time (denoted via the self.world_created variable), we use the initial segmentation provided by the user to create the three-dimensional representation of the segmentation using the math specified here. Importantly, we use a dictionary data structure here which maps the integer class id associated with the food item to the three-dimensional points. Also in this initial callback we record the homogenous transformation from the base of the robot to the current camera position.

Else if callback() has already been invoked, we are no longer in the initial frame and need to compute the transformation from the initial camera frame to the current one. We do this by looking up the transformation from the base link to the current camera position and multiplying the inverse of this times the transformation described in the previous paragraph. We can then pass this transformation matrix to comptue_projection() which uses OpenCV and the prior information to compute the two-dimensional projection. This information is then saved.

All of the above code runs in a single thread. Inside the program we also launch a subprocess which uses the rosbag play command to play the rosbag, allowing this whole code to run from a single Python execution. We also launch a separate display thread which allows us to show some intermediate displays for debugging purposes.

Function Descriptions

  • __init__ : Initializes the Annotator class by registering a number of important variables, initializing the filesystem tree, and ultimately launching the program
  • run : Calls rospy.spin() thus causing the program to wait until killed.
  • timer_callback : This is a callback function that gets invoked every few seconds. It checks if any new messages have come in recently and if they haven't it kills the program accordingly.
  • play_rosbag : This launches a subprocess which "plays" the rosbag thus publishing messages to the subscribed topics.
  • get_segmentation : This loads the user-defined segmentation for the first frame.
  • check_valid_experiment : This is a boolean check to make sure that a segmentation exists for the specified rosbag.
  • init_filesystem : This uses the path of this python file to locate other important files in the system. This helps the program to be robust so that it can be run from multiple places on the user's machine.
  • init_ros_pub_and_sub : This initializes ROS subscribers to the topics mentioned above and puts them inside a TimeSynchronizer to ensure the messages arrive with similar timestamps.
  • init_tf_buffer : This initializes a transform buffer. This is very important for looking up transformations between frames of the robot arm as all transformations are published to this buffer.
  • create_fork_mask : This creates a binary mask indicating where the fork is in the frame. We don't want our projections back to 2D to count for pixels which are occluded by the fork.
  • get_base_to_camera_tf : This uses the aforementioned tf_buffer to lookup the transform. It operates using a rospy.Rate since the transformation is not always instantly available.
  • register_camera_info : The camera intrinsics are an important part of segmentation propagation. However they're the same across a given video and as such only need to be processed once. So we save the important camera parameters in relevant variables.
  • transform_to_mat : This parses a TransformStamped ROS message and puts it into a 4x4 homogenous transformation matrix.
  • convert_images : This uses a cv_bridge to convert the color and depth images from ROS Image messages into actual cv2 objects.
  • vec_pixel_to_world : This is a vectorized version of the pixel_to_world function which converts two-dimensional pixel coordinates into three-dimensional world coordinates.
  • create_3d_world : A very important function. This uses a provided segmentation image to determine the i, j coordinates where that segmentation exists. We then use the corresponding depth values at these coordinates to create the previously mentioned three-dimensional world using the function above.
  • compute_projection : This is some wrapper code for the cv2.projectPoints function. This allows us to take the transformation applied to the camera along with the three-dimensional points and compute their two-dimensional projection.
  • create_projected_mask : This iterates over each segmentation class and calls the above function to compute the transformation.
  • save_images : Uses knowledge of the filesystem along with other variables to save all output files in the appropriate locations.
  • create_projection_image : If needed for displaying, overlays the projected mask onto the color image. Uses vectorized RGB lookup logic to add colors to the masks.
  • callback : Driver function which operates as described above.
  • deprecated create_pointcloud : At one point was used for debugging to publish pointclouds for visualization in RVIZ. Does not work in the current system.
  • display : Runs in a separate thread. Reads in the most recently published image and uses cv2.imshow to display it.

scripts/correct_masks.py

Overview

This program is the second major contribution to this project. While annotate.py is an incredible piece of code, unfortunately in our use case it was not able to perfectly project our segmentations across all frames. This was a huge detriment to this task as training a model with even slightly inaccurate masks produced "sloppy" masks. This matters greatly as the downstream task from this one (skewering the food) needs very precise location data. This piece of code uses the segment-anything model (SAM) from Meta (see here) to generate an unsupervised segmentation of the visual scene. We then apply some logic as stated below to confer the labels from our approximate segmentation thus giving us our final input to the model.

From annotate.py we have a large directory filled with images and their corresponding masks which we load in batch using get_files_in_bag(). We also load the pretrained SAM model using load_model(). We then generate our unsupervised segmentations of the images using the batch_predict_examples() function. The crux of the program lies in label_sam_predictions() as here we do the "dirty work" to combine all of the information we have.

For each image, we look at all of the subsegmentations generated by SAM. Since SAM visually separates the entire image, some processing logic is needed to determine which segmentations are actually of food items. Firstly, there is some logic inside mask_is_shadow() to determine if a subsegmentation is a shadow using basic LAB thresholding. This is important because shadows have a tendency to throw this algorithm off. If the subsegmentation is not a shadow, is bigger than our min area threshold, and smaller than our max area threshold, we use our approximate segmentation to find the closest label. This is done by taking the centroid of the subsegmentation and finding the closest nonzero pixel in the approximate segmentation. If the distance to the closest nonzero pixel does not exceed a distance threshold, we confer the label onto this subsegmentation.

All of this information is then formatted into individual segmentation masks, bounding boxes, and labels as specified here for easy feeding into the MaskRCNN model. This is then saved as a .npz file as the compression massively helps with the size of the files.

Because SAM is a relatively large model, this code takes a number of hours to run depending on hardware. Inside the code we used a downsample factor of 12 (meaning each 12th frame is used) because at 60hz sampling not much is lost in terms of "data expressivity" but much is saved in terms of runtime.

Function Descriptions

  • verbose_log : A nice helper function to only log things when a variable is set.
  • load_model : Loads the smallest SAM model into memory and initializes it. You'll need to download a checkpoint in order to run this section of code.
  • batch_load_examples : Uses file directory information to load in image and mask files
  • batch_predict_examples : Runs the SAM on the loaded examples
  • create_fork_mask : Shamelessly stolen from the previous program, see above documentation
  • create_projection_image : Also stolen from above, creates a visualization if needed
  • mask_is_shadow : Uses thresholding in LAB color space to determine if a section of the image is a shadow or not
  • get_label_for_mask : For a given unlabeled subsegmentation, takes the centroid and finds the nearest nonzero pixel in the approximate segmentation. This gives us the confered label for this example.
  • get_bbox_from_mask : Uses min and max nonzero coordinates from the mask to create a bounding box.
  • label_sam_predictions : Wraps the above functions to completely label and process the SAM predictions
  • save_output : Saves the output from this program as a .npz file in the correct directory
  • run_batch : Wraps the above: loads examples, predicts on them, saves the output
  • get_batch_indices : Used to split a total size into batches
  • get_files_in_bag : Uses glob to find all files in a dir
  • run : Wraps all of the above to run the program with tqdm for progress tracking

src/

While no piece of code inside src is worth too much time to touch on, I will cover the directory as a whole. This directory has everything related to the actual MaskRCNN model itself, training, and evaluation. Briefly:

  • dataset.py creates a custom PyTorch dataset to load examples. Importantly, it only loads their filenames into memory. This is because actually unloading one of the .npy files takes up a significant amount of storage (a few GB each) and so loading them all into memory is infeasable while using GPU.
  • train.py loads the MaskRCNN model and trains it. This is fairly straightforward.
  • eval.py loads a model checkpoint (or checkpoints) and computes the mean-average precision and mean-average recall so that different model checkpoints can be compared.
Clone this wiki locally