In cloud computing, auto-healing is a feature used to monitor a cluster and detect faulty application instances or nodes.
If a faulty instance is detected, the instance is neutralized and a new one is started. If a physical node fails, all its hosted instances are rescheduled to other healthy nodes in the cluster.
In this practical assignment, we implemented a distributed auto-healer which monitors a group of worker instances across multiple physical nodes using a Leader/Worker architecture.
### The Components:
1.**The Scheduler (Leader):** Monitors the cluster state and maintains the target number of workers by distributing them across active nodes.
2.**The Node Agent:** Runs on each physical machine, registers the node, and manages local worker processes.
3.**The TransientWorker:** The application instance that performs computations and may crash randomly due to unhandled edge cases.
### Our Mission:
Our mission is to maintain at least **N** worker instances in the cluster at any given moment.
- If a **Worker** crashes, the system restarts it on the same node.
- If a **Node** fails, the system redistributes the lost workers to the remaining healthy nodes.
---
### Build the Project
Use Maven to build all modules (common, scheduler, node, worker):
```bash
mvn clean install
```
### Run the Scheduler (Master)
The scheduler monitors the cluster and maintains the desired state. Launch it by providing the target number of workers.
```bash
java -jar scheduler/target/scheduler-1.0-SNAPSHOT.jar <number of workers>
The Node Agent must be running on each machine (or terminal for simulation) to host the workers. You need to provide a unique Node ID and the path to the worker jar.
```bash
java -jar node/target/node-1.0-SNAPSHOT.jar <node id> <path to worker jar>