# Cluster Auto-healer using Zookeeper

In cloud computing, auto-healing is a feature used to monitor a cluster and detect faulty application instances or nodes.

If a faulty instance is detected, the instance is neutralized and a new one is started. If a physical node fails, all its hosted instances are rescheduled to other healthy nodes in the cluster.

In this practical assignment, we implemented a distributed auto-healer which monitors a group of worker instances across multiple physical nodes using a Leader/Worker architecture.

### The Components:
1. **The Scheduler (Leader):** Monitors the cluster state and maintains the target number of workers by distributing them across active nodes.
2. **The Node Agent:** Runs on each physical machine, registers the node, and manages local worker processes.
3. **The TransientWorker:** The application instance that performs computations and may crash randomly due to unhandled edge cases.

### Our Mission:
Our mission is to maintain at least **N** worker instances in the cluster at any given moment.
- If a **Worker** crashes, the system restarts it on the same node.
- If a **Node** fails, the system redistributes the lost workers to the remaining healthy nodes.

---

### Build the Project
Use Maven to build all modules (common, scheduler, node, worker):
```bash
mvn clean install
```

### Run the Scheduler (Master)
The scheduler monitors the cluster and maintains the desired state. Launch it by providing the target number of workers.
```bash
java -jar scheduler/target/scheduler-1.0-SNAPSHOT.jar <number of workers>
```

#### Example:
```bash
java -jar scheduler/target/scheduler-1.0-SNAPSHOT.jar 10
```

---

### Run the Node Agent (Physical Node)
The Node Agent must be running on each machine (or terminal for simulation) to host the workers. You need to provide a unique Node ID and the path to the worker jar.
```bash
java -jar node/target/node-1.0-SNAPSHOT.jar <node id> <path to worker jar>
```

#### Example:
```bash
java -jar node/target/node-1.0-SNAPSHOT.jar node-1 "../worker/target/worker-1.0-SNAPSHOT.jar"
```

---

### Features Implemented:
*   **Service Discovery:** Automatic registration of nodes using Zookeeper Ephemeral nodes.
*   **Least Load Scheduling:** Distributed workers across nodes using a Round-Robin algorithm.
*   **Fault Detection:** Real-time monitoring of worker and node health via Zookeeper Watchers.
*   **Asynchronous Logging:** All events (assignments, failures, healing) are logged chronologically in `cluster_events.log`.