Show HN: Building a self collected multiview mocap studio for robot training
The basic pipeline as of right now looks like this:
1. Capture – Using 4 iPhones and an Insta360 Go. iPhone videos are captured via Final Cut Pro Multicam for easy sync and the exocentric view; the Insta360 Go is used for the egocentric view.
2. Sync – Custom Gradio app using two @rerundotio viewers and callbacks for easily aligning frame timestamps so the ego and exo views are aligned.
3. Calibrate – Use VGGT from Meta to get intrinsics/extrinsics for sparse cameras.
4. Estimate 3D – Use RTMLib whole‑body keypoint estimator on each frame, then triangulate in 3D.
What's missing?
1. No temporal coherence: I’m estimating keypoints one frame at a time and one camera at a time. This leads to a lot of jittering. For now, I plan on adding a One Euro Filter to help with jittering. Long term, I'd want to train a multiview keypoint estimator
2. Kinematic fitting is still missing; this is my next goal. The output will be joint angles, as explored in my previous posts.
3. Missing dense point cloud: VGGT seems to fail for me here. I’m looking to explore using MP‑SFM as a method for generating dense multiview depth maps + normals (plus it has a friendlier license compared to VGGT).
4. Eventually, creation of 4D Gaussian splatting using something akin to DN‑splatter—my long‑term goal is a data engine that provides poses/depths/splats/keypoints/etc.
Links (still a work in progress, but wanted to share for folks who want to dig in):
1. Saved RRD visualization – <https://app.rerun.io/version/0.23.2/index.html?url=https://h...>
2. Multicam ego/exo sync app – <https://github.com/pablovela5620/multicam-ego-sync>
3. 3D person detection + triangulation – <https://github.com/rerun-io/pi0-lerobot/tree/hand-kinematic-...>