Recent progress in pre-trained vision-language models has greatly advanced multimodal learning by enabling robust and generalizable feature extraction. However, animal action recognition remains a challenging task due to high intraspecies variability, subtle motion cues, and the need for fine-grained spatio-temporal reasoning. In this work, we present ARTEMIS2, an improved version of the ARTEMIS framework, featuring enhanced frame selection, more powerful visual and textual encoders, and a refined spatio-temporal captioning module for stronger alignment between visual content and textual descriptions. Evaluated on the Animal Kingdom benchmark, ARTEMIS2 achieves 82.4 mAP, setting a new state-of-the-art. To support real-time deployment on robotic platforms, we propose ARTEMIS2-Mini, a distilled unimodal variant based on the TimeSformer architecture. Despite relying solely on video input, it achieves 77.91 mAP and enables real-time inference onboard a Unitree Go2 quadruped robot. Field experiments demonstrate its effectiveness in recognizing feline behaviors across indoor and outdoor environments, showcasing its potential for embodied AI in animal monitoring and interaction tasks. The code of ARTEMIS2 is available at https://github.com/edofazza/artemis2.
ARTEMIS2 & ARTEMIS2-Mini: From Full-Scale to Distilled Real-Time Animal Behavior Recognition in a Robotic Dog
Fazzari, Edoardo;Romano, Donato;Stefanini, Cesare
2025-01-01
Abstract
Recent progress in pre-trained vision-language models has greatly advanced multimodal learning by enabling robust and generalizable feature extraction. However, animal action recognition remains a challenging task due to high intraspecies variability, subtle motion cues, and the need for fine-grained spatio-temporal reasoning. In this work, we present ARTEMIS2, an improved version of the ARTEMIS framework, featuring enhanced frame selection, more powerful visual and textual encoders, and a refined spatio-temporal captioning module for stronger alignment between visual content and textual descriptions. Evaluated on the Animal Kingdom benchmark, ARTEMIS2 achieves 82.4 mAP, setting a new state-of-the-art. To support real-time deployment on robotic platforms, we propose ARTEMIS2-Mini, a distilled unimodal variant based on the TimeSformer architecture. Despite relying solely on video input, it achieves 77.91 mAP and enables real-time inference onboard a Unitree Go2 quadruped robot. Field experiments demonstrate its effectiveness in recognizing feline behaviors across indoor and outdoor environments, showcasing its potential for embodied AI in animal monitoring and interaction tasks. The code of ARTEMIS2 is available at https://github.com/edofazza/artemis2.| File | Dimensione | Formato | |
|---|---|---|---|
|
Fazzari et al_IEEE International Conference on Robotics, Automation and Artificial Intelligence.pdf
embargo fino al 30/04/2027
Tipologia:
Documento in Post-print/Accepted manuscript
Licenza:
Copyright dell'editore
Dimensione
6.02 MB
Formato
Adobe PDF
|
6.02 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

