Real-time multi-person human activity recognition for social robots

Human Activity Recognition is a fast-growing research field in the wider context of sensing systems for artificial intelligence. The availability of a great variety of sensors, such as cameras and wearables, and the interest of different industries for applications have led to the diffusion of large data sets. Between many approaches developed, the most recent advances in deep learning result effective in extracting accurate skeletal poses from images without the need to design complex handcrafted features. Such poses allow a simple yet effective representation for the sake of activity classification. However, many studies are performed in laboratory or make strong assumptions, thus they are not easily transferable to real-world scenarios. In this thesis we investigated the possibility of applying the most recent deep learning techniques for human pose estimation and activity recognition to the wild context: a non-controllable environment characterised by the presence of many people. In such a context time complexity is critical, but recent techniques focused more on precision of poses at expenses of inference time, by exploiting powerful processing units to perform complex computations. Conversely, we designed and implemented a system to perform real-time multi-person activity recognition for robots in the wild, running purely on a low-consumption CPU from a camera stream. This has been done by integrating a 2D skeletal poses estimator with a heuristic-based tracker to feed a LSTM network for human activity recognition. We trained such a network to obtain different models for activity classification using a data set of poses In The Wild, which we have built ad hoc to resemble a running environment with noisy data and no ground truth. Our experimental results show that high precision on poses is not necessary to achieve a performing activity classifier, whereas inference time is crucial to work online on a camera stream. We achieved results comparable with state-of-the-art solutions, guaranteeing invariant time with respect to number of people in the images.

Il riconoscimento di azioni umane è un campo di ricerca in rapida crescita nel contesto più ampio di sistemi sensoriali per l'intelligenza artificiale. La disponibilità di una gran varietà di sensori, come videocamere e dispositivi indossabili, e l'interesse di diverse industrie per sue applicazioni hanno portato alla diffusione di grandi data set. Tra i vari approcci sviluppati, i più recenti progressi nel deep learning risultano efficaci nell'estrazione di accurate pose scheletriche dalle immagini, senza la necessità di progettare ad hoc feature complesse. Tali pose permettono una semplice quanto efficace rappresentazione per classificare attività. Tuttavia, molti studi sono condotti in laboratorio o formulano forti ipotesi, rendendoli difficilmente applicabili a scenari reali. In questa tesi abbiamo studiato la possibilità di utilizzare le più recenti tecniche di deep learning per stimare la posa umana e riconoscere attività in un contesto wild: un ambiente non controllabile caratterizzato dalla presenza di molte persone. In tale contesto la complessità temporale è cruciale, ma le tecniche più recenti si concentrano maggiormente sulla precisione della posa a discapito del tempo di inferenza, sfruttando potenti processori per calcoli complessi. Al contrario, noi abbiamo progettato e implementato un sistema per effettuare riconoscimento di attività multi-persona e in tempo reale per robot in the wild, eseguibile puramente su CPU sul flusso di una videocamera. Ciò è stato possibile integrando uno stimatore di pose scheletriche 2D e un tracker euristico che alimentano una rete LSTM per il riconoscimento di azioni. Abbiamo addestrato questa rete per ottenere diversi modelli di classificazione di azioni usando un data set di pose In The Wild, che abbiamo costruito appositamente per riprodurre un ambiente di esecuzione con dati rumorosi e senza l'ausilio di ground truth. I nostri risultati sperimentali mostrano come non siano necessarie pose molto precise per ottenere un classificatore di attività performante, al contrario il tempo di inferenza è cruciale per farlo in tempo reale sul flusso di una videocamera. Abbiamo raggiunto risultati comparabili allo stato dell'arte, garantendo tempo invariante rispetto al numero di persone presenti nelle immagini.