# ***NLP course project***
***Almouhannad Hafez + Mariam Khierbek***

## ***Contents***  
**[Description](#description)**  
**[How to run](#how-to-run)**  
**[Part1. Data preprocessing](#part1-data-preprocessing)**  
**[Part2. Basic Morphological analyzer](#part2-basic-morpholgical-analyzer)**  
**[Part3. Lemmatization, POS Tagging, and N-Gram](#part3-lemmatization-pos-tagging-and-n-gram)**  
**[Part4. Data augmentation](#part4-data-augmentation)**  
**[Part5. Dependency tree](#part5-dependency-tree)**  
**[Results](#results)**

## ***Description***  
**Classifying symptom (as a text data) into a disease**  
> [Dataset link](https://www.kaggle.com/datasets/niyarrbarman/symptom2disease)

### ***Main contents***  
- **Data folder**: Containing dataset and train and test sets
- **Constants.py**: Some fixed values to use in other files as `CONSTANTS` class
- **Other .ipynb files**: Jupyter notebooks containing actual work
- **Results.xlsx**: Excel worksheet containing results
- **conda_nlp_environment.yml**: Python modules requirements

## ***How to run***
> **Using [Anaconda](https://www.anaconda.com/)**  
1. **Clone this repository**  
    ```bash
        git clone git@git.hiast.edu.sy:almohanad.hafez/nlp-project.git
    ```
1. **Open Anaconda Prompt**  
1. **Navigate to the repository directory in your machine `cd <path_to_repository_directory>/nlp-project`**  
    **If you want to navigate to a disk other than C:\, let's say D, you can run `D:`**
1. **Create the Conda Environment from the .yml File**  
    ```bash
        conda env create -f conda_nlp_environment.yml
    ```
1. **Activate the New Environment**
    ```bash
        conda activate NLP
    ```
1. **Open jupyter notebook**    
    ```bash
        jupyter notebook
    ```

## ***Part1. Data preprocessing***
**Files:**
> **`1.1.Dataset_Overview.ipynb`**  
- **Some interesting details about used dataset**

> **`1.2.Data_Preprocessing.ipynb`**  
- **Applying preprocessing steps on dataset, this includes:**
    1. Refactoring dataset schema
    1. Handling nulls/duplicates
    1. Shuffling
    1. Converting text to lowercase
    1. Expanding contractions
    1. Splitting into train/test sets

## ***Part2. Basic Morpholgical analyzer***
**Files:**
> **`2.Stemmer.ipynb`**  
- **Applying classification task, this includes:**
    1. Using `nltk` modules
    1. Tokenizing text
    1. Stemming tokens
    1. Removing stopwords
    1. Vectorizing using `TF-IDF`
    1. Training a `Naive bayes` classifier and evaluate it

## ***Part3. Lemmatization, POS Tagging, and N-Gram***
**Files:**
> **`3.1.Lemmatizer.ipynb`**  
- **Applying classification task using tokens lemmatization using different modules, this includes:**
    1. `nltk`
    1. `SpaCy`
    1. `Stanza`  
> **`3.2.POS_Tagging_Filter.ipynb`**  
- **Applying classification task using POS tagger to perform task using only one tag, this includes:**
    1. Testing ***Verbs*** only
    1. Testing ***Adjectives*** only
    1. Testing ***Nouns*** only

**`3.3.N-Grams.ipynb`**  
- **Applying classification task using n-gram to perform task using only TF-IDF with different grams**

## ***Part4. Data augmentation***
**Files:**
> **`4.data_augmentation.ipynb`**  
- **Applying data augmentation on the original dataset, added 5 new rephrased rows for each original row using LLM `LLama3`**
![Augmentation_effect](./images/Augmentation_effect.png)

## ***Part5. Dependency tree***
**Files:**
> **`5.0.Process_texts_stanza.ipynb`**  
- **Applying stanza pipeline containing `tokenize,mwt,pos,lemma,depparse` and storing results for using later**
> **`5.1.Dep_parsing_classifier.ipynb`**  
- **Applying classification task using dep. tree features as tuples, this includes using:**
    1. `(head_word, dependent_word, dependency_relation)`
        - Syntactic inner relations between words in a sentence (Shallow parsing)
    1. `(head_pos, dependent_pos, dependency_relation)`
        - pos: Part Of Speech
    1. `(root_word, "ROOT")`
        - i.e. Head words for sentences

## ***Results***

> ***Using augmented dataset*** 

| Case\\Criterion                                    | Accuracy(Train) | Accuracy(Test) | Precision(Test-Average) | Recall(Test-Average) | F1-Score(Test-Average) | Notes                     |
| -------------------------------------------------- | --------------- | -------------- | ----------------------- | -------------------- | ---------------------- | ------------------------- |
| nltk stemmer                                       | 0.9629          | 0.9524         | 0.9513                  | 0.9522               | 0.9509                 | alpha=0.1, 300features    |
| nltk lemmatizer                                    | 0.9832          | 0.9699         | 0.9703                  | 0.9699               | 0.9696                 | alpha=0.1, 700features    |
| Stanza lemmatizer                                  | 0.9783          | 0.9671         | 0.9673                  | 0.9672               | 0.9668                 | alpha=0.1, 550features    |
| SpaCy lemmatizer                                   | 0.9776          | 0.9657         | 0.9655                  | 0.9656               | 0.9652                 | alpha=0.1, 550features    |
| Lemma + Verbs only                                 | 0.7106          | 0.6321         | 0.6293                  | 0.6278               | 0.6214                 | alpha=0.1, 400features    |
| Lemma + Adjectives only                            | 0.7990          | 0.7357         | 0.7383                  | 0.7351               | 0.7299                 | alpha=0.1, 450features    |
| Lemma + Nouns only                                 | 0.9678          | 0.9419         | 0.9406                  | 0.9419               | 0.9406                 | alpha=0.1, 600features    |
| Text + (1,2)Gram                                   | 0.9965          | 0.9800         | 0.9801                  | 0.9799               | 0.9798                 | alpha=0.01, 3100features  |
| Text + (1,3)Gram                                   | 0.9960          | 0.9807         | 0.9806                  | 0.9805               | 0.9803                 | alpha=0.01, 6600features  |
| Text + (1,4)Gram                                   | 0.9967          | 0.9807         | 0.9802                  | 0.9805               | 0.9802                 | alpha=0.01, 12100features |
| Text + (2,3)Gram                                   | 0.9951          | 0.9695         | 0.9688                  | 0.9694               | 0.9688                 | alpha=0.01, 9100features  |
| Text + (2,4)Gram                                   | 0.9951          | 0.9646         | 0.9634                  | 0.9645               | 0.9635                 | alpha=0.01, 14100features |
| Stanza Dep. Relation tuples                        | 0.9984          | 0.9781         | 0.9783                  | 0.9784               | 0.9781                 | alpha=0.01, 7000features  |
| Stanza Dep.Relation+POS Relations+Headwords tuples | 0.9981          | 0.9747         | 0.9747                  | 0.9749               | 0.9745                 | alpha=0.01, 8000features  |
---

> ***Applied features selection and model's hyperparameters tuning*** 

| Case\\Criterion         | Accuracy(Train) | Accuracy(Test) | Precision(Test-Average) | Recall(Test-Average) | F1-Score(Test-Average) | Notes                     |
| ----------------------- | --------------- | -------------- | ----------------------- | -------------------- | ---------------------- | ------------------------- |
| nltk stemmer            | 0.9783          | 0.9416         | 0.9406                  | 0.9392               | 0.9377                 | alpha=0.1, 300features    |
| nltk lemmatizer         | 0.9957          | 0.9567         | 0.9578                  | 0.9561               | 0.9558                 | alpha=0.1, 800features    |
| Stanza lemmatizer       |                 |                |                         |                      |                        |                           |
| SpaCy lemmatizer        | 0.9957          | 0.9545         | 0.9557                  | 0.9537               | 0.9536                 | alpha=0.1, 750features    |
| Lemma + Verbs only      | 0.7438          | 0.6082         | 0.6461                  | 0.6166               | 0.6098                 | alpha=0.1, 150features    |
| Lemma + Adjectives only | 0.7496          | 0.5974         | 0.6807                  | 0.6065               | 0.6143                 | alpha=0.1, 150features    |
| Lemma + Nouns only      | 0.9826          | 0.9026         | 0.9056                  | 0.9031               | 0.9001                 | alpha=0.1, 400features    |
| Text + (1,2)Gram        | 1.0000          | 0.9719         | 0.9728                  | 0.9726               | 0.9716                 | alpha=0.01, 7100features  |
| Text + (1,3)Gram        | 1.0000          | 0.9675         | 0.9692                  | 0.9678               | 0.9675                 | alpha=0.01, 18600features |
| Text + (1,4)Gram        | 1.0000          | 0.9675         | 0.9693                  | 0.9681               | 0.9675                 | alpha=0.01, 32100features |
| Text + (2,3)Gram        | 1.0000          | 0.9502         | 0.9530                  | 0.9508               | 0.9501                 | alpha=0.01, 17100features |
| Text + (2,4)Gram        | 1.0000          | 0.9502         | 0.9523                  | 0.9504               | 0.9492                 | alpha=0.01, 30600features |

---
> ***Without features selection***  

| Case\\Criterion         | Accuracy(Train) | Accuracy(Test) | Precision(Test-Average) | Recall(Test-Average) | F1-Score(Test-Average) |
| ----------------------- | --------------- | -------------- | ----------------------- | -------------------- | ---------------------- |
| nltk stemmer            | 0.9942          | 0.9199         | 0.9255                  | 0.9238               | 0.9193                 |
| nltk lemmatizer         | 0.9942          | 0.9242         | 0.9294                  | 0.9279               | 0.9235                 |
| Stanza lemmatizer       | 0.9942          | 0.9286         | 0.9329                  | 0.9311               | 0.9271                 |
| SpaCy lemmatizer        | 0.9957          | 0.9286         | 0.9342                  | 0.9314               | 0.9283                 |
| Lemma + Verbs only      | 0.7815          | 0.6082         | 0.6565                  | 0.6243               | 0.6131                 |
| Lemma + Adjectives only | 0.8683          | 0.6061         | 0.6815                  | 0.6202               | 0.6141                 |
| Lemma + Nouns only      | 0.9783          | 0.8766         | 0.8868                  | 0.8806               | 0.8740                 |
| Text + 1Gram            | 0.9971          | 0.8983         | 0.9129                  | 0.9049               | 0.8995                 |
| Text + 2Gram            | 0.9986          | 0.8853         | 0.8947                  | 0.8918               | 0.8834                 |
| Text + 3Gram            | 0.9971          | 0.8680         | 0.8818                  | 0.8747               | 0.8668                 |
| Text + 4Gram            | 1.0000          | 0.8009         | 0.8486                  | 0.8145               | 0.8098                 |
| Text + 5Gram            | 1.0000          | 0.7078         | 0.8393                  | 0.7234               | 0.7393                 |