IONA — Autonomous Perception for Nursing Robot

Motivation

Modern hospital environments place enormous cognitive and physical demands on nursing staff. Routine tasks — organizing supplies, delivering food, adjusting medical devices, moving between patient locations — are repetitive yet critical. Errors in these tasks directly affect patient care.

Existing robotic systems either demand continuous teleoperation, which creates operator burden and limits scalability, or rely on data-intensive learning approaches that are slow to retrain and impractical for real-world deployment. There is a clear gap for a system that provides genuine task-level autonomy while remaining modular and easy to extend to new objects and tasks.

Key Results

90% Pick-and-place success rate

<0.5cm Planar position error (PoseMate)

<5° Rotational error across all axes

80% Navigation success (indoor)

4 Hz LazyPose real-time rate

18 Object classes in custom dataset

System Overview

IONA is a bi-manual mobile humanoid robot built at WPI's Human-Inspired Robotics (HiRO) Lab. It combines a Fetch Freight 100 mobile base with two Kinova Gen3 7-DoF arms, Robotiq parallel grippers, and a four-camera RGB-D sensing suite. The software stack bridges a ROS Noetic robot-side system (low-level control, navigation) and an external ROS2 Humble compute node (perception, planning) over a custom TCP/UDP interface.

This thesis integrates a full autonomous perception subsystem into IONA's existing architecture, transitioning it from a predominantly teleoperated platform to one capable of executing structured nursing tasks — pick and place, shelf organization, and device manipulation — with minimal operator input.

📷 Figure — IONA Platform Fig. 3.4: IONA platform with front, side, and sensor views.

01 Scene Observation

→

02 Object Detection

→

03 Pose Estimation

→

04 Motion Planning

→

05 Task Execution

→

06 Human Monitoring

🗂 Figure — System Architecture Fig. 3.1: High-level system architecture showing perception, planning, control, and hardware.

ROS2 Humble ROS Noetic PyTorch Intel RealSense D435 Kinova Gen3 MoveIt + OMPL RTAB-Map FoundationPose MMPose OpenCV

Perception Module

The perception pipeline begins with a YOLO-based object detector trained on a custom dataset of 18 medical and food object classes (~1,200 annotated images collected via Roboflow). Detection outputs bounding boxes and segmentation masks that feed into pose estimation.

3D object meshes were acquired using a handheld Creality scanner and post-processed in Blender. These meshes underpin the high-precision pose estimation pathway.

🔍 Figure — Training Object Classes Fig. 4.2: 18 object classes spanning medical supplies and food items.

Hybrid Pose Estimation

A central contribution of this thesis is a dual-mode framework that dynamically selects between two complementary estimators based on object characteristics and task requirements.

Method	Speed	Accuracy	Planar Error	Depth Error	Use Case
PoseMate	10–20 s	High	< 0.5 cm	~1.5 cm	Complex / uncertain objects
LazyPose	4 Hz	Moderate	< 1 cm	< 3 cm	Simple / predictable objects

PoseMate builds on FoundationPose, using RGB-D data, segmentation masks, and pre-scanned meshes to compute full 6D pose. A validation pipeline (depth reprojection error, mask IoU) filters and aggregates estimates across five captured frames before accepting a result.

LazyPose is a lightweight 4D estimator (3D position + planar yaw) that applies PCA on the segmentation mask and retrieves depth at the median pixel. It runs at 4 Hz, enabling continuous streaming for objects with predictable geometry.

📐 Figure — PoseMate & LazyPose comparison Fig. 4.4: Accuracy vs. compute trade-off.

📐 Figure — Example outputs Fig. 4.5: PoseMate (left) and LazyPose (right) outputs.

Navigation & SLAM

Navigation runs as an independent ROS Noetic module using RTAB-Map for RGB-D visual SLAM and the move_base framework with the DWA local planner for goal-directed motion. A single RealSense D435i (RGB + depth + IMU) handles all navigation sensing.

Maps are built offline via teleoperation and reloaded for autonomous deployment. Loop closure corrects accumulated localization drift. Navigation and manipulation are sequenced — not concurrent — reducing inter-module interference.

🗺️ Figure — Lab Environment & RTAB-Map Reconstruction Fig. 5.2: Lab environment and its visual SLAM map.

Human-Aware Execution

Human detection uses MMPose (keypoint estimation at ~15 Hz), tracking shoulder and hip joints to estimate body velocity. If estimated velocity exceeds 0.5 m/s, the robot pauses and retracts its arms. Execution resumes after 5 seconds of confirmed stability — either no human detected in the workspace, or negligible motion observed.

This mechanism requires no full scene understanding and adds minimal computational overhead, enabling safe co-existence in shared spaces without dedicated safety hardware.

Hardware Platform

🤖 Figure — Mobile Base Fig. 3.2: Fetch Freight 100.

🦾 Figure — Kinova Gen3 Arms Fig. 3.3: Dual 7-DoF arms with grippers.

Mobile base: Fetch Freight 100 with onboard RealSense D435i for SLAM.
Manipulators: 2× Kinova Gen3 7-DoF arms, Robotiq 2-finger parallel grippers (85 mm stroke).
Cameras: Chest D435 (primary detection), Neck D435 (scene/human tracking), Left Wrist Kinova camera (close-range), Base D435i with IMU (navigation).
Compute: Onboard Fetch computer (ROS Noetic) + External workstation Intel i7 / RTX 3080 (ROS2 Humble, perception and planning).

Contributions

Integrated perception module into IONA — transitioning from a teleoperated platform to structured autonomy through object detection, pose estimation, active camera selection, and human monitoring.
Hybrid pose estimation framework — dual-mode system combining LazyPose (fast, 4 Hz) and PoseMate (high-precision mesh-based), selected dynamically per object.
Custom dataset and mesh pipeline — ~1,200 annotated images across 18 classes via Roboflow; 3D meshes via Creality scanner and Blender post-processing.
Active viewpoint selection — task-driven single-camera strategy as a practical alternative to expensive multi-camera sensor fusion.
End-to-end task pipelines — functional pick-and-place, shelf organization, and device manipulation demonstrating perception-driven autonomous execution on a physical robot.
Extensible modular architecture — new object classes, meshes, and task routines can be added with minimal changes to the pipeline.

Autonomous Perception forMobile Manipulator Nursing Robot