# ClusterSearch: Intelligent Distributed Text Search Engine
ClusterSearch is a high-performance, fault-tolerant distributed system designed to perform text retrieval across massive datasets using the **TF-IDF (Term Frequency-Inverse Document Frequency)** algorithm.
This project demonstrates core distributed systems concepts including **Leader Election**, **Service Discovery**, **Load Distribution (Sharding)**, and **Multi-tier Orchestration**.
## 🚀 Key Features
***Distributed TF-IDF Engine:** Implements a 2-Phase search algorithm to ensure mathematical accuracy across the cluster.
***Leader Election:** Powered by **Apache ZooKeeper** to ensure high availability and automatic failover.
***Scalable Architecture:** Dynamically partitions thousands of documents across an arbitrary number of Worker nodes.
***Multi-Protocol Communication:** Uses **gRPC** for efficient internal cluster communication and **HTTP/JSON** for external interaction.
***Professional UI:** A vibrant, modern web dashboard with real-time performance metrics, search history, and smooth pagination.
***Fault Tolerance:** Automatically detects node failures and redistributes search tasks.
---
## 🏗 System Architecture
The system follows a **Three-Tier Distributed Architecture**:
1.**Frontend Node:** A standalone web server that serves the UI and acts as a gateway, discovering the current Leader via ZooKeeper.
2.**Coordinator (Leader):** The brain of the cluster. It manages workers, partitions data, calculates global IDF statistics, and provides an internal HTTP API.
3.**Workers:** Computational nodes that perform local text processing and TF-IDF scoring on assigned document shards.
***Serialization:** Google Gson (JSON) & Protocol Buffers (Proto3)
***Frontend:** Modern HTML5, CSS3 (Flexbox/Grid), JavaScript (Async/Await)
---
## 🧬 Distributed TF-IDF Algorithm
To maintain global accuracy, the search is performed in two synchronized phases:
1.**Phase 1 (Global Stats):** The Coordinator requests local term counts from all Workers. It aggregates these counts to compute the **Global IDF** (Inverse Document Frequency).
2.**Phase 2 (Scoring & Ranking):** The Coordinator sends the Global IDF back to the Workers. Each Worker calculates final scores ($TF \times IDF$) for its local documents. The Coordinator then gathers, sorts, and returns the top results.
---
## 📂 Dynamic Data Management
The system is designed with **Hot-Swappable Data** support. You can scale the dataset horizontally without any code modifications or system restarts:
***Auto-Detection:** The Coordinator and Workers perform an on-demand scan of the `storage` directory for every new search query.
***Easy Expansion:** Simply drop new `.txt` files into the `storage` folder while the system is running.
* **Dynamic Partitioning:** The Coordinator automatically recalculates the document count and redistributes the new workload across the active workers instantly.
---
## 🚦 Getting Started
### Prerequisites
* Java 17 JDK or higher.
* Maven 3.x.
* Apache ZooKeeper running on `localhost:2181`.
### Installation
1. Clone the repository:
```bash
git clone <your-gitlab-repo-link>
cd DistributedSearchEngine
```
2. Build the project (Fat JAR):
```bash
mvn clean package
```
### Running the Cluster
1. **Start ZooKeeper**(Ensure it is active).
2. **Start the Cluster Nodes**(Run in multiple terminals):