TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments

Abstract

Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new "maps" to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes.

Overview

Given a video sequence captured with a dynamic camera {f_i,i=1,...,N} with N frames, the user specifies K tracking subjects shown in the first frame.

method overview

Our goal is to simultaneously recover the 3D pose, shape, identity, and trajectory of each subject in global coordinates. To achieve this, TRACE first extracts temporal features and then decodes each sub-task with a separate head network.

First, via two parallel backbones, TRACE encodes the video and its motion into temporal image feature maps F^`_i and motion feature maps O_i.

The Detection and Tracking branches take these features and perform multi-subject tracking to recover the 3D human trajectory in camera coordinates.

Details of the Detection branch.

Unlike BEV, our detection method takes temporal image features F^`_i as input. It uses the features to detect the 3D human positions t_i and their confidence c_i for all people in frame f_i. The Mesh branch regresses all the human mesh parameters, in SMPL format, from the input Feature maps. Unlike BEV, this branch takes both temporal image features and motion features.

method overview

The tracking branch estimates the 3D Motion Offset map, indicating the 3D position change of each subject from the previous frame to the current frame. From the estimated 3D detections and 3D motion offsets, a novel memory unit determines the subject identities and builds human trajectories in camera coordinates.

Details of the Tracking branch.

The combined features (F^`_i, O_i) are fed to our novel Tracking branch to estimate the 3D Motion Offset map, indicating the 3D position change of each subject across frames. The new Memory Unit takes the 3D detection and its 3D motion offset as input. It then determines the subject identities and builds human trajectories t_i of the K subjects in camera coordinates. Note that, like BEV, our detection branch finds all the people in the video frames but our goal is to track only the K input subjects. Consequently, the memory unit filters out detected people who do not match the subject trajectories.

The Mesh branch regresses all the human mesh parameters, in SMPL format, from the temporal feature maps.

Finally, to estimate subject trajectories in world coordinates, the novel World branch estimates a world motion map, representing the 3D orientation τ_i and 3D translation offset ΔΤ_i of the K subjects in global coordinates. Starting with the 3D position of the subjects in the first frame in camera coordinates, we accumulate the 3D translation offsets ΔΤ_i to obtain the global 3D trajectory Τ in global coordinates.

DynaCam Dataset

method overview

Even with a powerful 5D representation, we still lack in-the-wild data for training and evaluation of global human trajectory estimation. However, collecting global human trajectory and camera poses for natural videos captured by dynamic cameras (DC-videos) is difficult. Therefore, we create a new dataset, DynaCam, by simulating camera motions to convert in-the-wild videos captured by static cameras to DC-videos.

How to simulate a moving camera using videos captured by static cameras?

method overview

To simulate a rotating camera, we project panoramic video frames into regular perspective views.

method overview

To simulate the 3D translation of dynamic cameras, we use a sliding window to crop regular videos captured by a static camera. Then we generate pseudo-ground-truth 3D human annotations via fitting SMPL to detected 2D pose sequences. With pseudo human poses and camera pose annotations, we obtain the global human trajectories using the PnP algorithm.

TRACE