# ClusterSearch: Intelligent Distributed Text Search Engine

ClusterSearch is a high-performance, fault-tolerant distributed system designed to perform text retrieval across massive datasets using the **TF-IDF (Term Frequency-Inverse Document Frequency)** algorithm.

This project demonstrates core distributed systems concepts including **Leader Election**, **Service Discovery**, **Load Distribution (Sharding)**, and **Multi-tier Orchestration**.

## 🚀 Key Features

*   **Distributed TF-IDF Engine:** Implements a 2-Phase search algorithm to ensure mathematical accuracy across the cluster.
*   **Leader Election:** Powered by **Apache ZooKeeper** to ensure high availability and automatic failover.
*   **Scalable Architecture:** Dynamically partitions thousands of documents across an arbitrary number of Worker nodes.
*   **Multi-Protocol Communication:** Uses **gRPC** for efficient internal cluster communication and **HTTP/JSON** for external interaction.
*   **Professional UI:** A vibrant, modern web dashboard with real-time performance metrics, search history, and smooth pagination.
*   **Fault Tolerance:** Automatically detects node failures and redistributes search tasks.

---

## 🏗 System Architecture

The system follows a **Three-Tier Distributed Architecture**:

1.  **Frontend Node:** A standalone web server that serves the UI and acts as a gateway, discovering the current Leader via ZooKeeper.
2.  **Coordinator (Leader):** The brain of the cluster. It manages workers, partitions data, calculates global IDF statistics, and provides an internal HTTP API.
3.  **Workers:** Computational nodes that perform local text processing and TF-IDF scoring on assigned document shards.

---

## 🛠 Tech Stack

*   **Language:** Java 17
*   **Coordination:** Apache ZooKeeper 3.9.1
*   **Communication:** gRPC (via Netty Shaded) & Java HttpClient
*   **Build Tool:** Maven
*   **Serialization:** Google Gson (JSON) & Protocol Buffers (Proto3)
*   **Frontend:** Modern HTML5, CSS3 (Flexbox/Grid), JavaScript (Async/Await)

---

## 🧬 Distributed TF-IDF Algorithm

To maintain global accuracy, the search is performed in two synchronized phases:

1.  **Phase 1 (Global Stats):** The Coordinator requests local term counts from all Workers. It aggregates these counts to compute the **Global IDF** (Inverse Document Frequency).
2.  **Phase 2 (Scoring & Ranking):** The Coordinator sends the Global IDF back to the Workers. Each Worker calculates final scores ($TF \times IDF$) for its local documents. The Coordinator then gathers, sorts, and returns the top results.

---

## 📂 Dynamic Data Management

The system is designed with **Hot-Swappable Data** support. You can scale the dataset horizontally without any code modifications or system restarts:

*   **Auto-Detection:** The Coordinator and Workers perform an on-demand scan of the `storage` directory for every new search query.
*   **Easy Expansion:** Simply drop new `.txt` files into the `storage` folder while the system is running.
*   **Dynamic Partitioning:** The Coordinator automatically recalculates the document count and redistributes the new workload across the active workers instantly.
---

## 🚦 Getting Started

### Prerequisites
*   Java 17 JDK or higher.
*   Maven 3.x.
*   Apache ZooKeeper running on `localhost:2181`.

### Installation
1.  Clone the repository:
    ```bash
    git clone <your-gitlab-repo-link>
    cd DistributedSearchEngine
    ```
2.  Build the project (Fat JAR):
    ```bash
    mvn clean package
    ```

### Running the Cluster
1.  **Start ZooKeeper** (Ensure it is active).
2.  **Start the Cluster Nodes** (Run in multiple terminals):
    ```bash
    # Node 1 (Will likely become Leader)
    java -jar target/search-engine-1.0-jar-with-dependencies.jar 8081
    
    # Node 2 (Worker)
    java -jar target/search-engine-1.0-jar-with-dependencies.jar 8082
    ```
3.  **Start the Frontend Application:**
    Run the `FrontendApplication` class from your IDE or as a separate JAR.
4.  **Access the Search Engine:**
    Open `http://localhost:8080` in your browser.

---

## 📈 Performance Benchmarking
The system includes a built-in benchmark tool. You can compare search latency by increasing the number of workers.
*   **1 Worker:** Processing ~1000 files in ~1200ms.
*   **4 Workers:** Processing ~1000 files in ~400ms.

---

## 📁 Project Structure
```text
src/main/java/com/distributed/search/
├── cluster/      # ZooKeeper Election & Registry Logic
├── grpc/         # gRPC Service Implementation & Clients
├── logic/        # TF-IDF Mathematics & File Management
├── web/          # Internal & External HTTP Servers
└── Application/  # Entry Points