# ***NLP course project***
***Almouhannad Hafez + Mariam Khierbek***

**[Description](#description)**  
**[How to run](#how-to-run)**  
**[Part1. Data preprocessing](#part1-data-preprocessing)**  
**[Part2. Basic Morphological analyzer](#part2-basic-morpholgical-analyzer)**  
**[Part3. Lemmatization, POS Tagging, and N-Gram](#part3-lemmatization-pos-tagging-and-n-gram)**  
**[Results](#results)**

## ***Description***  
**Classifying symptom (as a text data) into a disease**  
> [Dataset link](https://www.kaggle.com/datasets/niyarrbarman/symptom2disease)

### ***Main contents***  
- **Data folder**: Containing dataset and train and test sets
- **Constants.py**: Some fixed values to use in other files as `CONSTANTS` class
- **Other .ipynb files**: Jupyter notebooks containing actual work
- **Results.xlsx**: Excel worksheet containing results
- **conda_nlp_environment.yml**: Python modules requirements

## ***How to run***
> **Using [Anaconda](https://www.anaconda.com/)**  
1. **Clone this repository**  
    ```bash
        git clone git@git.hiast.edu.sy:almohanad.hafez/nlp-project.git
    ```
1. **Open Anaconda Prompt**  
1. **Navigate to the repository directory in your machine `cd <path_to_repository_directory>/nlp-project`**  
    **If you want to navigate to a disk other than C:\, let's say D, you can run `D:`**
1. **Create the Conda Environment from the .yml File**  
    ```bash
        conda env create -f conda_nlp_environment.yml
    ```
1. **Activate the New Environment**
    ```bash
        conda activate NLP
    ```
1. **Open jupyter notebook**    
    ```bash
        jupyter notebook
    ```

## ***Part1. Data preprocessing***
**Files:**
> **`1.Data_Preprocessing.ipynb`**  
- **Applying preprocessing steps on dataset, this includes:**
    1. Refactoring dataset schema
    1. Handling nulls/duplicates
    1. Shuffling
    1. Converting text to lowercase
    1. Expanding contractions
    1. Splitting into train/test sets

## ***Part2. Basic Morpholgical analyzer***
**Files:**
> **`2.Stemmer.ipynb`**  
- **Applying classification task, this includes:**
    1. Using `nltk` modules
    1. Tokenizing text
    1. Stemming tokens
    1. Removing stopwords
    1. Vectorizing using `TF-IDF`
    1. Training a `Naive bayes` classifier and evaluate it

## ***Part3. Lemmatization, POS Tagging, and N-Gram***
**Files:**
> **`3.1.Lemmatizer.ipynb`**  
- **Applying classification task using tokens lemmatization using different modules, this includes:**
    1. `nltk`
    1. `SpaCy`
    1. `Stanza`  
> **`3.2.POS_Tagging_Filter.ipynb`**  
- **Applying classification task using POS tagger to perform task using only one tag, this includes:**
    1. Testing ***Verbs*** only
    1. Testing ***Adjectives*** only
    1. Testing ***Nouns*** only

**`3.3.N-Grams.ipynb`**  
- **Applying classification task using n-gram to perform task using only TF-IDF with different grams, this includes:**
    1. `1-Gram`
    1. `2-Gram`
    1. `3-Gram`
    1. `4-Gram`
    1. `5-Gram`

## ***Results***
| Case\\Criterion         | Accuracy(Train) | Accuracy(Test) | Precision(Test-Average) | Recall(Test-Average) | F1-Score(Test-Average) |
| ----------------------- | --------------- | -------------- | ----------------------- | -------------------- | ---------------------- |
| nltk stemmer            | 0.994211288     | 0.91991342     | 0.925513814             | 0.923767509          | 0.919308               |
| nltk lemmatizer         | 0.994211288     | 0.924242424    | 0.929407177             | 0.927885156          | 0.923453               |
| Stanza lemmatizer       | 0.994211288     | 0.928571429    | 0.932850383             | 0.931115994          | 0.927117               |
| SpaCy lemmatizer        | 0.995658466     | 0.928571429    | 0.934227363             | 0.931373992          | 0.928329               |
| Lemma + Verbs only      | 0.781476122     | 0.608225108    | 0.656473336             | 0.62431736           | 0.6131                 |
| Lemma + Adjectives only | 0.868306802     | 0.606060606    | 0.681515062             | 0.620177307          | 0.614097               |
| Lemma + Nouns only      | 0.97829233      | 0.876623377    | 0.886798865             | 0.880636574          | 0.873959               |
| Text + 1Gram            | 0.997105644     | 0.898268398    | 0.912889052             | 0.90487335           | 0.89945                |
| Text + 2Gram            | 0.998552822     | 0.885281385    | 0.894742015             | 0.891828538          | 0.883421               |
| Text + 3Gram            | 0.997105644     | 0.867965368    | 0.881810904             | 0.874727644          | 0.866752               |
| Text + 4Gram            | 1               | 0.800865801    | 0.848577524             | 0.814521589          | 0.809801               |
| Text + 5Gram            | 1               | 0.707792208    | 0.839340945             | 0.72337248           | 0.739326               |    