Recent progress in pre-trained vision-language models has greatly advanced multimodal learning by enabling robust and generalizable feature extraction. However, animal action recognition remains a challenging task due to high intraspecies variability, subtle motion cues, and the need for fine-grained spatio-temporal reasoning. In this work, we present ARTEMIS2, an improved version of the ARTEMIS framework, featuring enhanced frame selection, more powerful visual and textual encoders, and a refined spatio-temporal captioning module for stronger alignment between visual content and textual descriptions. Evaluated on the Animal Kingdom benchmark, ARTEMIS2 achieves 82.4 mAP, setting a new state-of-the-art. To support real-time deployment on robotic platforms, we propose ARTEMIS2-Mini, a distilled unimodal variant based on the TimeSformer architecture. Despite relying solely on video input, it achieves 77.91 mAP and enables real-time inference onboard a Unitree Go2 quadruped robot. Field experiments demonstrate its effectiveness in recognizing feline behaviors across indoor and outdoor environments, showcasing its potential for embodied AI in animal monitoring and interaction tasks. The code of ARTEMIS2 is available at https://github.com/edofazza/artemis2.

ARTEMIS2 & ARTEMIS2-Mini: From Full-Scale to Distilled Real-Time Animal Behavior Recognition in a Robotic Dog

Fazzari, Edoardo;Romano, Donato;Stefanini, Cesare
2025-01-01

Abstract

Recent progress in pre-trained vision-language models has greatly advanced multimodal learning by enabling robust and generalizable feature extraction. However, animal action recognition remains a challenging task due to high intraspecies variability, subtle motion cues, and the need for fine-grained spatio-temporal reasoning. In this work, we present ARTEMIS2, an improved version of the ARTEMIS framework, featuring enhanced frame selection, more powerful visual and textual encoders, and a refined spatio-temporal captioning module for stronger alignment between visual content and textual descriptions. Evaluated on the Animal Kingdom benchmark, ARTEMIS2 achieves 82.4 mAP, setting a new state-of-the-art. To support real-time deployment on robotic platforms, we propose ARTEMIS2-Mini, a distilled unimodal variant based on the TimeSformer architecture. Despite relying solely on video input, it achieves 77.91 mAP and enables real-time inference onboard a Unitree Go2 quadruped robot. Field experiments demonstrate its effectiveness in recognizing feline behaviors across indoor and outdoor environments, showcasing its potential for embodied AI in animal monitoring and interaction tasks. The code of ARTEMIS2 is available at https://github.com/edofazza/artemis2.
File in questo prodotto:
File Dimensione Formato  
Fazzari et al_IEEE International Conference on Robotics, Automation and Artificial Intelligence.pdf

embargo fino al 30/04/2027

Tipologia: Documento in Post-print/Accepted manuscript
Licenza: Copyright dell'editore
Dimensione 6.02 MB
Formato Adobe PDF
6.02 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11382/587012
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
social impact