add README

2dbfe894 · tammam.alsoleman · bbfde650 · 2dbfe894
Commit 2dbfe894 authored Dec 23, 2025 by tammam.alsoleman
Hide whitespace changes
Inline Side-by-side

Showing with 57 additions and 0 deletions

README.md README.md +57 -0

No files found.
--- a/README.md
+++ b/README.md
+# Cluster Auto-healer using Zookeeper
+
+In cloud computing, auto-healing is a feature used to monitor a cluster and detect faulty application instances or nodes.
+
+If a faulty instance is detected, the instance is neutralized and a new one is started. If a physical node fails, all its hosted instances are rescheduled to other healthy nodes in the cluster.
+
+In this practical assignment, we implemented a distributed auto-healer which monitors a group of worker instances across multiple physical nodes using a Leader/Worker architecture.
+
+### The Components:
+1. **The Scheduler (Leader):** Monitors the cluster state and maintains the target number of workers by distributing them across active nodes.
+2. **The Node Agent:** Runs on each physical machine, registers the node, and manages local worker processes.
+3. **The TransientWorker:** The application instance that performs computations and may crash randomly due to unhandled edge cases.
+
+### Our Mission:
+Our mission is to maintain at least **N** worker instances in the cluster at any given moment.
+- If a **Worker** crashes, the system restarts it on the same node.
+- If a **Node** fails, the system redistributes the lost workers to the remaining healthy nodes.
+
+---
+
+### Build the Project
+Use Maven to build all modules (common, scheduler, node, worker):
+```bash
+mvn clean install
+```
+
+### Run the Scheduler (Master)
+The scheduler monitors the cluster and maintains the desired state. Launch it by providing the target number of workers.
+```bash
+java -jar scheduler/target/scheduler-1.0-SNAPSHOT.jar <number of workers>
+```
+
+#### Example:
+```bash
+java -jar scheduler/target/scheduler-1.0-SNAPSHOT.jar 10
+```
+
+---
+
+### Run the Node Agent (Physical Node)
+The Node Agent must be running on each machine (or terminal for simulation) to host the workers. You need to provide a unique Node ID and the path to the worker jar.
+```bash
+java -jar node/target/node-1.0-SNAPSHOT.jar <node id> <path to worker jar>
+```
+
+#### Example:
+```bash
+java -jar node/target/node-1.0-SNAPSHOT.jar node-1 "../worker/target/worker-1.0-SNAPSHOT.jar"
+```
+
+---
+
+### Features Implemented:
+*   **Service Discovery:** Automatic registration of nodes using Zookeeper Ephemeral nodes.
+*   **Least Load Scheduling:** Distributed workers across nodes using a Round-Robin algorithm.
+*   **Fault Detection:** Real-time monitoring of worker and node health via Zookeeper Watchers.
+*   **Asynchronous Logging:** All events (assignments, failures, healing) are logged chronologically in `cluster_events.log`.
\ No newline at end of file