Markerless 3D motion capture based on MediaPipe

Human motion tracking is the process of acquisition and reconstruction of human movement. Currently, the gold standard for human motion tracking is marker-based, which uses reflective markers placed on key anatomical landmarks. This type of system has very high accuracy, but has a high cost, requires long preparation time, and can be used only inside a laboratory. An efficient alternative is a marker-less system, which uses RGB cameras or depth sensors to estimate the human pose. MediaPipe is an open-source framework that provides 2D and approximate 3D coordinates of human body landmarks using a single RGB camera. In addition, MediaPipe offers two main running modes: image and video. Image mode produces outputs without delay, but coordinates may be affected by noise and instability. Video mode, instead, produces more consistent and stable results, but these are affected by a temporal delay introduced by a filtering process. In the context of the Magic Climbing project, a triangulation algorithm was already available for 3D reconstruction. My contribution was to implement a Kalman Filter (KF) with four different models and then an Interactive Multiple Model algorithm (IMM) to improve 3D reconstruction accuracy and robustness. To address the specific issues introduced by image and video running mode, I tested two different strategies. In the first one, I implemented an IQR filter and applied this to 2D image mode coordinates, in order to reduce the impact of outlier measurements. In the second one, I used video mode and, to eliminate the delay, I developed two single-input single-output (SISO) models to learn the relationship between 2D delayed and non-delayed coordinates. In order to analyze the proposed solutions , I adopted some metrics from the literature and proposed others customized in the context of this work. Results obtained showed that KF and IMM introduce a significant improvement in accuracy with respect to the triangulation, and that KF with the higher order model offered the best overall trade-off. The introduction of IQR filter and SISO models do not lead to an increase of performance, so the best available 2D data are those obtained with the video running mode, despite the presence of delay.

Il tracciamento del movimento umano è il processo di acquisizione e ricostruzione del movimento umano. Ad oggi, lo standard di riferimento è rappresentato dai sistemi marker-based, che applicano markers riflettenti su punti di riferimento anatomici. Questo sistema ha un'elevata accuratezza, ma è molto costoso, richiede lunghi tempi di preparazione e può essere usato solo all'interno di laboratori. Un'alternativa efficiente sono i sistemi marker-less, i quali usano telecamere RGB o sensori di profondità per la stima della posa umana. MediaPipe è un framework open source che fornisce coordinate 2D e 3D approssimative dei punti di riferimento anatomici usando una sola telecamera RGB. Inoltre, MediaPipe offre due principali modalità di esecuzione: immagine e video. La modalità immagine produce coordinate prive di ritardo ma instabili, mentre quella video produce risultati più stabili ma affetti da un ritardo temporale introdotto da un processo di filtraggio. Nel progetto Magic Climbing, un algoritmo di triangolazione era già disponibile per la ricostruzione 3D. Il mio contributo è stato quello di implementare un Filtro di Kalman (KF) e un algoritmo Interattivo a multipli modelli (IMM) per migliorare l'accuratezza della ricostruzione 3D. Per affrontare le problematiche specifiche introdotte dalle modalità di esecuzione immagine e video, ho testato due diverse strategie. Nella prima, ho implementato un filtro IQR e l'ho applicato alle coordinate 2D ottenute tramite modalità immagine, in modo da ridurre l'impatto delle misurazioni outlier. Nel secondo, ho usato la modalità video e, per eliminare il ritardo, ho sviluppato due modelli single-input single-output (SISO) per l'apprendimento di una relazione tra le coordinate 2D con ritardo e quelle senza ritardo. Per l'analisi delle soluzioni proposte, ho adottato alcune metriche dalla letteratura e ne ho proposte altre customizzate per il contesto di questo lavoro. I risultati ottenuti mostrano che KF e IMM migliorano significativamente l'accuratezza rispetto alla triangolazione, e che la scelta migliore in termini di bilanciamento tra accuratezza e complessità è il KF con il modello ad ordine maggiore. L'introduzione del filtro IQR e i modelli SISO non portano a un miglioramento, quindi i migliori dati 2D sono quelli ottenuti dalla modalità video, nonostante la presenza di ritardo.