Skip to main content

Module 2: Computer Vision & Object Understanding

Module 2 turns raw images into objects, masks, and poses your robot can reason about. You will revisit classical computer vision, then focus on deep vision models for detection, segmentation, depth estimation, and human pose—culminating in a real-time perception pipeline that runs on RGB-D data.

2.1 Classical Computer Vision

Before deep learning, perception pipelines relied on hand-crafted features:

  • Edge detection (Canny):
    • Finds strong intensity gradients (edges)
    • Useful for contour detection and shape analysis
  • Feature descriptors (SIFT, SURF, ORB):
    • Extract keypoints and descriptors that are invariant to scale and rotation
    • Enable matching across images for tracking and SLAM
  • Homography and projective transforms:
    • Map points between images (e.g., planar surfaces)
    • Used for image stitching, stabilization, and basic pose estimation

Classical methods are still valuable when:

  • Data is scarce (no large labeled datasets)
  • Real-time performance with limited compute is critical
  • You need interpretable geometric reasoning (e.g., planar homographies)

In practice, you will often mix classical and deep methods:

  • Classical feature tracking + deep detection
  • Geometry-based pose refinement on top of learned features

2.2 Deep Vision Models

Deep learning has dramatically improved perception performance and flexibility:

Architectures

  • Convolutional Neural Networks (CNNs):
    • Exploit spatial locality
    • Still widely used in detectors and segmenters
  • Vision Transformers (ViTs):
    • Use self-attention over image patches
    • Excel at global context and can integrate naturally with language models
  • Hybrid models:
    • Combine CNN backbones with attention modules
    • Balance inductive bias with flexibility

Object Detection

Detection models predict:

  • Bounding boxes
  • Class labels
  • Confidence scores

Common families:

  • YOLO-style models:
    • Real-time performance
    • Good for onboard inference on GPUs
  • RT-DETR and transformer-based detectors:
    • Use attention to reason about object relations
    • Flexible for integration with multimodal systems

Use cases:

  • Detect obstacles (chairs, boxes, humans)
  • Identify task-relevant objects (mugs, laptops, tools)

Semantic and Instance Segmentation

  • Semantic segmentation:
    • Assigns a class label to every pixel
    • Useful for understanding what each region is (floor, wall, table, person)
  • Instance segmentation:
    • Separates individual instances of the same class
    • Important for manipulation (which “mug” to pick up)

Modern segmenters (e.g., Mask2Former-like architectures) use transformers to:

  • Model long-range context
  • Produce coherent masks even in cluttered scenes

Depth Estimation & Monocular 3D

Beyond hardware depth sensors, deep models can:

  • Predict depth from a single RGB image (monocular depth)
  • Estimate surface normals and 3D layout

Benefits:

  • Provide approximate geometry where depth sensors fail
  • Support 3D reasoning even with simple cameras

Limitations:

  • Less accurate than dedicated depth sensors
  • Often require scene priors or training on similar environments

Human Pose Estimation

For humanoid interaction, it is critical to:

  • Detect humans in the scene
  • Estimate their body pose (joint positions)

Uses:

  • Safety zones around humans
  • Gesture recognition and interactive behaviors
  • Demonstration-based learning and imitation

Pose estimators typically output:

  • 2D joint keypoints on the image
  • Optionally, 3D pose estimates relative to the camera or world

2.3 Hands-On Vision Pipeline

In this module, you will build a real-time vision pipeline that:

  • Subscribes to RGB-D camera topics in ROS 2
  • Runs detection and segmentation models
  • Publishes a structured world state for downstream modules

Pipeline Stages

  1. Input acquisition:

    • Subscribe to /camera/color/image_raw and /camera/depth/image_raw
    • Optionally, synchronize frames and camera info
  2. Preprocessing:

    • Resize and normalize images
    • Convert depth to metric units and apply simple filtering (e.g., median blur)
  3. Inference:

    • Run object detection model on RGB
    • Optionally, run segmentation for fine-grained masks
  4. Postprocessing:

    • Apply non-maximum suppression to detections
    • Associate detections with depth to estimate 3D positions
    • Filter results by confidence and relevance
  5. Publishing structured outputs:

    • Publish a custom message (e.g., WorldObjects) containing:
      • Object IDs and classes
      • 2D bounding boxes and masks
      • Estimated 3D positions (relative to robot/base frame)

Outputs for Downstream Use

The output of this pipeline should be simple and stable enough that:

  • Planners can query “nearest obstacle” or “target object pose”
  • Controllers can aim hands or feet at 3D targets
  • Multimodal modules (Module 4) can connect language to actual, spatially grounded objects

By the end of Module 2, you will have:

  • A working perception pipeline that runs on live or simulated RGB-D data
  • Clear ROS 2 message interfaces for object-level world state
  • A foundation for mapping (Module 3) and multimodal reasoning (Module 4)
💬

AI Assistant

Ask me anything about the book

AI Assistant

Ask questions about the AI-Native Book

💬

Start a Conversation

Ask me anything about the AI-Native Book and I'll search through the content to provide you with relevant answers.