Commit 2dbfe894 authored by tammam.alsoleman's avatar tammam.alsoleman

add README

parent bbfde650
# Cluster Auto-healer using Zookeeper
In cloud computing, auto-healing is a feature used to monitor a cluster and detect faulty application instances or nodes.
If a faulty instance is detected, the instance is neutralized and a new one is started. If a physical node fails, all its hosted instances are rescheduled to other healthy nodes in the cluster.
In this practical assignment, we implemented a distributed auto-healer which monitors a group of worker instances across multiple physical nodes using a Leader/Worker architecture.
### The Components:
1. **The Scheduler (Leader):** Monitors the cluster state and maintains the target number of workers by distributing them across active nodes.
2. **The Node Agent:** Runs on each physical machine, registers the node, and manages local worker processes.
3. **The TransientWorker:** The application instance that performs computations and may crash randomly due to unhandled edge cases.
### Our Mission:
Our mission is to maintain at least **N** worker instances in the cluster at any given moment.
- If a **Worker** crashes, the system restarts it on the same node.
- If a **Node** fails, the system redistributes the lost workers to the remaining healthy nodes.
---
### Build the Project
Use Maven to build all modules (common, scheduler, node, worker):
```bash
mvn clean install
```
### Run the Scheduler (Master)
The scheduler monitors the cluster and maintains the desired state. Launch it by providing the target number of workers.
```bash
java -jar scheduler/target/scheduler-1.0-SNAPSHOT.jar <number of workers>
```
#### Example:
```bash
java -jar scheduler/target/scheduler-1.0-SNAPSHOT.jar 10
```
---
### Run the Node Agent (Physical Node)
The Node Agent must be running on each machine (or terminal for simulation) to host the workers. You need to provide a unique Node ID and the path to the worker jar.
```bash
java -jar node/target/node-1.0-SNAPSHOT.jar <node id> <path to worker jar>
```
#### Example:
```bash
java -jar node/target/node-1.0-SNAPSHOT.jar node-1 "../worker/target/worker-1.0-SNAPSHOT.jar"
```
---
### Features Implemented:
* **Service Discovery:** Automatic registration of nodes using Zookeeper Ephemeral nodes.
* **Least Load Scheduling:** Distributed workers across nodes using a Round-Robin algorithm.
* **Fault Detection:** Real-time monitoring of worker and node health via Zookeeper Watchers.
* **Asynchronous Logging:** All events (assignments, failures, healing) are logged chronologically in `cluster_events.log`.
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment