ARTEMIS2 & ARTEMIS2-Mini: From Full-Scale to Distilled Real-Time Animal Behavior Recognition in a Robotic Dog

IRIS

Recent progress in pre-trained vision-language models has greatly advanced multimodal learning by enabling robust and generalizable feature extraction. However, animal action recognition remains a challenging task due to high intraspecies variability, subtle motion cues, and the need for fine-grained spatio-temporal reasoning. In this work, we present ARTEMIS2, an improved version of the ARTEMIS framework, featuring enhanced frame selection, more powerful visual and textual encoders, and a refined spatio-temporal captioning module for stronger alignment between visual content and textual descriptions. Evaluated on the Animal Kingdom benchmark, ARTEMIS2 achieves 82.4 mAP, setting a new state-of-the-art. To support real-time deployment on robotic platforms, we propose ARTEMIS2-Mini, a distilled unimodal variant based on the TimeSformer architecture. Despite relying solely on video input, it achieves 77.91 mAP and enables real-time inference onboard a Unitree Go2 quadruped robot. Field experiments demonstrate its effectiveness in recognizing feline behaviors across indoor and outdoor environments, showcasing its potential for embodied AI in animal monitoring and interaction tasks. The code of ARTEMIS2 is available at https://github.com/edofazza/artemis2.

ARTEMIS2 & ARTEMIS2-Mini: From Full-Scale to Distilled Real-Time Animal Behavior Recognition in a Robotic Dog

Fazzari, Edoardo;Romano, Donato;Falchi, Fabrizio;Stefanini, Cesare

2025-01-01

Abstract

Recent progress in pre-trained vision-language models has greatly advanced multimodal learning by enabling robust and generalizable feature extraction. However, animal action recognition remains a challenging task due to high intraspecies variability, subtle motion cues, and the need for fine-grained spatio-temporal reasoning. In this work, we present ARTEMIS2, an improved version of the ARTEMIS framework, featuring enhanced frame selection, more powerful visual and textual encoders, and a refined spatio-temporal captioning module for stronger alignment between visual content and textual descriptions. Evaluated on the Animal Kingdom benchmark, ARTEMIS2 achieves 82.4 mAP, setting a new state-of-the-art. To support real-time deployment on robotic platforms, we propose ARTEMIS2-Mini, a distilled unimodal variant based on the TimeSformer architecture. Despite relying solely on video input, it achieves 77.91 mAP and enables real-time inference onboard a Unitree Go2 quadruped robot. Field experiments demonstrate its effectiveness in recognizing feline behaviors across indoor and outdoor environments, showcasing its potential for embodied AI in animal monitoring and interaction tasks. The code of ARTEMIS2 is available at https://github.com/edofazza/artemis2.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno del prodotto

2025

Appare nelle tipologie:

4.1 Contributo Atti Congressi/Articoli in extenso

File in questo prodotto:

File	Dimensione	Formato
Fazzari et al_IEEE International Conference on Robotics, Automation and Artificial Intelligence.pdf embargo fino al 30/04/2027 Tipologia: Documento in Post-print/Accepted manuscript Licenza: Copyright dell'editore Dimensione 6.02 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	6.02 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11382/587012

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

social impact