This paper tackles the problem of optimal configuration and deployment of fault-tolerant time-critical service chains with arbitrary DAG-alike topologies. We propose RTilience, designed according to a scalable cloud microservice paradigm, and prototyped on top of the well-known Kubernetes cloud orchestrator. It features real-time reservation scheduling of containers to guarantee temporal isolation of time-critical tasks, leading to fine-grained control of compute latencies, while allowing for sharing physical CPUs among containers. A distributed routing library, ReqRoute, is configured with a timeout and primary and secondary routes, enabling autonomous and decentralized handling of failing requests. The routes are configured by a centralized controller that performs admission control, resource management of microservice instances, task placement, and fault detection and recovery, extending the features available in Kubernetes. Admission control is based on a theoretical framework enclosing a worst-case performance model for the experienced end-to-end response-time under various fault handling options, and an optimization framework that computes the optimum resource allocation for admitted services. Extensive experimentation of the proposed solution has been performed with synthetic examples, and an autonomous transport robot use-case, verifying that end-to-end deadlines are effectively respected, even in presence of high fault rates of individual microservice instances, according to the theoretical expectations. RTilience is made available as open-source software, released under a MIT license.
RTilience: Fault-Tolerant Time-Critical Kubernetes
Abeni, Luca
;Andreoli, Remo
;Cucinotta, Tommaso
2025-01-01
Abstract
This paper tackles the problem of optimal configuration and deployment of fault-tolerant time-critical service chains with arbitrary DAG-alike topologies. We propose RTilience, designed according to a scalable cloud microservice paradigm, and prototyped on top of the well-known Kubernetes cloud orchestrator. It features real-time reservation scheduling of containers to guarantee temporal isolation of time-critical tasks, leading to fine-grained control of compute latencies, while allowing for sharing physical CPUs among containers. A distributed routing library, ReqRoute, is configured with a timeout and primary and secondary routes, enabling autonomous and decentralized handling of failing requests. The routes are configured by a centralized controller that performs admission control, resource management of microservice instances, task placement, and fault detection and recovery, extending the features available in Kubernetes. Admission control is based on a theoretical framework enclosing a worst-case performance model for the experienced end-to-end response-time under various fault handling options, and an optimization framework that computes the optimum resource allocation for admitted services. Extensive experimentation of the proposed solution has been performed with synthetic examples, and an autonomous transport robot use-case, verifying that end-to-end deadlines are effectively respected, even in presence of high fault rates of individual microservice instances, according to the theoretical expectations. RTilience is made available as open-source software, released under a MIT license.| File | Dimensione | Formato | |
|---|---|---|---|
|
IEEE-TSC-2025-RTilience.pdf
non disponibili
Tipologia:
Documento in Pre-print/Submitted manuscript
Licenza:
Copyright dell'editore
Dimensione
2.18 MB
Formato
Adobe PDF
|
2.18 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

