Module 2: Computer Vision & Object Understanding

Module 2 turns raw images into objects, masks, and poses your robot can reason about. You will revisit classical computer vision, then focus on deep vision models for detection, segmentation, depth estimation, and human pose—culminating in a real-time perception pipeline that runs on RGB-D data.

2.1 Classical Computer Vision

Before deep learning, perception pipelines relied on hand-crafted features:

Edge detection (Canny):
- Finds strong intensity gradients (edges)
- Useful for contour detection and shape analysis
Feature descriptors (SIFT, SURF, ORB):
- Extract keypoints and descriptors that are invariant to scale and rotation
- Enable matching across images for tracking and SLAM
Homography and projective transforms:
- Map points between images (e.g., planar surfaces)
- Used for image stitching, stabilization, and basic pose estimation

Classical methods are still valuable when:

Data is scarce (no large labeled datasets)
Real-time performance with limited compute is critical
You need interpretable geometric reasoning (e.g., planar homographies)

In practice, you will often mix classical and deep methods:

Classical feature tracking + deep detection
Geometry-based pose refinement on top of learned features

2.2 Deep Vision Models

Deep learning has dramatically improved perception performance and flexibility:

Architectures

Convolutional Neural Networks (CNNs):
- Exploit spatial locality
- Still widely used in detectors and segmenters
Vision Transformers (ViTs):
- Use self-attention over image patches
- Excel at global context and can integrate naturally with language models
Hybrid models:
- Combine CNN backbones with attention modules
- Balance inductive bias with flexibility

Object Detection

Detection models predict:

Bounding boxes
Class labels
Confidence scores

Common families:

YOLO-style models:
- Real-time performance
- Good for onboard inference on GPUs
RT-DETR and transformer-based detectors:
- Use attention to reason about object relations
- Flexible for integration with multimodal systems

Use cases:

Detect obstacles (chairs, boxes, humans)
Identify task-relevant objects (mugs, laptops, tools)

Semantic and Instance Segmentation

Semantic segmentation:
- Assigns a class label to every pixel
- Useful for understanding what each region is (floor, wall, table, person)
Instance segmentation:
- Separates individual instances of the same class
- Important for manipulation (which “mug” to pick up)

Modern segmenters (e.g., Mask2Former-like architectures) use transformers to:

Model long-range context
Produce coherent masks even in cluttered scenes

Depth Estimation & Monocular 3D

Beyond hardware depth sensors, deep models can:

Predict depth from a single RGB image (monocular depth)
Estimate surface normals and 3D layout

Benefits:

Provide approximate geometry where depth sensors fail
Support 3D reasoning even with simple cameras

Limitations:

Less accurate than dedicated depth sensors
Often require scene priors or training on similar environments

Human Pose Estimation

For humanoid interaction, it is critical to:

Detect humans in the scene
Estimate their body pose (joint positions)

Uses:

Safety zones around humans
Gesture recognition and interactive behaviors
Demonstration-based learning and imitation

Pose estimators typically output:

2D joint keypoints on the image
Optionally, 3D pose estimates relative to the camera or world

2.3 Hands-On Vision Pipeline

In this module, you will build a real-time vision pipeline that:

Subscribes to RGB-D camera topics in ROS 2
Runs detection and segmentation models
Publishes a structured world state for downstream modules

Pipeline Stages

Input acquisition:
- Subscribe to /camera/color/image_raw and /camera/depth/image_raw
- Optionally, synchronize frames and camera info
Preprocessing:
- Resize and normalize images
- Convert depth to metric units and apply simple filtering (e.g., median blur)
Inference:
- Run object detection model on RGB
- Optionally, run segmentation for fine-grained masks
Postprocessing:
- Apply non-maximum suppression to detections
- Associate detections with depth to estimate 3D positions
- Filter results by confidence and relevance
Publishing structured outputs:
- Publish a custom message (e.g., WorldObjects) containing:
  - Object IDs and classes
  - 2D bounding boxes and masks
  - Estimated 3D positions (relative to robot/base frame)

Outputs for Downstream Use

The output of this pipeline should be simple and stable enough that:

Planners can query “nearest obstacle” or “target object pose”
Controllers can aim hands or feet at 3D targets
Multimodal modules (Module 4) can connect language to actual, spatially grounded objects

By the end of Module 2, you will have:

A working perception pipeline that runs on live or simulated RGB-D data
Clear ROS 2 message interfaces for object-level world state
A foundation for mapping (Module 3) and multimodal reasoning (Module 4)

Module 2: Computer Vision & Object Understanding

2.1 Classical Computer Vision

2.2 Deep Vision Models

Architectures

Object Detection

Semantic and Instance Segmentation

Depth Estimation & Monocular 3D

Human Pose Estimation

2.3 Hands-On Vision Pipeline

Pipeline Stages

Outputs for Downstream Use

AI Assistant

AI Assistant

Start a Conversation

2.1 Classical Computer Vision​

2.2 Deep Vision Models​

Architectures​

Object Detection​

Semantic and Instance Segmentation​

Depth Estimation & Monocular 3D​

Human Pose Estimation​

2.3 Hands-On Vision Pipeline​

Pipeline Stages​

Outputs for Downstream Use​

AI Assistant

AI Assistant

Start a Conversation

2.1 Classical Computer Vision

2.2 Deep Vision Models

Architectures

Object Detection

Semantic and Instance Segmentation

Depth Estimation & Monocular 3D

Human Pose Estimation

2.3 Hands-On Vision Pipeline

Pipeline Stages

Outputs for Downstream Use