Add the README File

e55a9da1 · tammam.alsoleman · ee63c939 · e55a9da1
Commit e55a9da1 authored Jan 22, 2026 by tammam.alsoleman
Hide whitespace changes
Inline Side-by-side

Showing with 106 additions and 0 deletions

README.md README.md +106 -0

No files found.
--- a/README.md
+++ b/README.md
+# ClusterSearch: Intelligent Distributed Text Search Engine
+
+ClusterSearch is a high-performance, fault-tolerant distributed system designed to perform text retrieval across massive datasets using the **TF-IDF (Term Frequency-Inverse Document Frequency)** algorithm.
+
+This project demonstrates core distributed systems concepts including **Leader Election**, **Service Discovery**, **Load Distribution (Sharding)**, and **Multi-tier Orchestration**.
+
+## 🚀 Key Features
+
+*   **Distributed TF-IDF Engine:** Implements a 2-Phase search algorithm to ensure mathematical accuracy across the cluster.
+*   **Leader Election:** Powered by **Apache ZooKeeper** to ensure high availability and automatic failover.
+*   **Scalable Architecture:** Dynamically partitions thousands of documents across an arbitrary number of Worker nodes.
+*   **Multi-Protocol Communication:** Uses **gRPC** for efficient internal cluster communication and **HTTP/JSON** for external interaction.
+*   **Professional UI:** A vibrant, modern web dashboard with real-time performance metrics, search history, and smooth pagination.
+*   **Fault Tolerance:** Automatically detects node failures and redistributes search tasks.
+
+---
+
+## 🏗 System Architecture
+
+The system follows a **Three-Tier Distributed Architecture**:
+
+1.  **Frontend Node:** A standalone web server that serves the UI and acts as a gateway, discovering the current Leader via ZooKeeper.
+2.  **Coordinator (Leader):** The brain of the cluster. It manages workers, partitions data, calculates global IDF statistics, and provides an internal HTTP API.
+3.  **Workers:** Computational nodes that perform local text processing and TF-IDF scoring on assigned document shards.
+
+---
+
+## 🛠 Tech Stack
+
+*   **Language:** Java 17
+*   **Coordination:** Apache ZooKeeper 3.9.1
+*   **Communication:** gRPC (via Netty Shaded) & Java HttpClient
+*   **Build Tool:** Maven
+*   **Serialization:** Google Gson (JSON) & Protocol Buffers (Proto3)
+*   **Frontend:** Modern HTML5, CSS3 (Flexbox/Grid), JavaScript (Async/Await)
+
+---
+
+## 🧬 Distributed TF-IDF Algorithm
+
+To maintain global accuracy, the search is performed in two synchronized phases:
+
+1.  **Phase 1 (Global Stats):** The Coordinator requests local term counts from all Workers. It aggregates these counts to compute the **Global IDF** (Inverse Document Frequency).
+2.  **Phase 2 (Scoring & Ranking):** The Coordinator sends the Global IDF back to the Workers. Each Worker calculates final scores ($TF \times IDF$) for its local documents. The Coordinator then gathers, sorts, and returns the top results.
+
+---
+
+## 📂 Dynamic Data Management
+
+The system is designed with **Hot-Swappable Data** support. You can scale the dataset horizontally without any code modifications or system restarts:
+
+*   **Auto-Detection:** The Coordinator and Workers perform an on-demand scan of the `storage` directory for every new search query.
+*   **Easy Expansion:** Simply drop new `.txt` files into the `storage` folder while the system is running.
+*   **Dynamic Partitioning:** The Coordinator automatically recalculates the document count and redistributes the new workload across the active workers instantly.
+---
+
+## 🚦 Getting Started
+
+### Prerequisites
+*   Java 17 JDK or higher.
+*   Maven 3.x.
+*   Apache ZooKeeper running on `localhost:2181`.
+
+### Installation
+1.  Clone the repository:
+    ```bash
+    git clone <your-gitlab-repo-link>
+    cd DistributedSearchEngine
+    ```
+2.  Build the project (Fat JAR):
+    ```bash
+    mvn clean package
+    ```
+
+### Running the Cluster
+1.  **Start ZooKeeper** (Ensure it is active).
+2.  **Start the Cluster Nodes** (Run in multiple terminals):
+    ```bash
+    # Node 1 (Will likely become Leader)
+    java -jar target/search-engine-1.0-jar-with-dependencies.jar 8081
+    
+    # Node 2 (Worker)
+    java -jar target/search-engine-1.0-jar-with-dependencies.jar 8082
+    ```
+3.  **Start the Frontend Application:**
+    Run the `FrontendApplication` class from your IDE or as a separate JAR.
+4.  **Access the Search Engine:**
+    Open `http://localhost:8080` in your browser.
+
+---
+
+## 📈 Performance Benchmarking
+The system includes a built-in benchmark tool. You can compare search latency by increasing the number of workers.
+*   **1 Worker:** Processing ~1000 files in ~1200ms.
+*   **4 Workers:** Processing ~1000 files in ~400ms.
+
+---
+
+## 📁 Project Structure
+```text
+src/main/java/com/distributed/search/
+├── cluster/      # ZooKeeper Election & Registry Logic
+├── grpc/         # gRPC Service Implementation & Clients
+├── logic/        # TF-IDF Mathematics & File Management
+├── web/          # Internal & External HTTP Servers
+└── Application/  # Entry Points
\ No newline at end of file