Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Sign in
Toggle navigation
N
NLP-Project
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
almohanad.hafez
NLP-Project
Commits
b2228e32
Commit
b2228e32
authored
Nov 22, 2024
by
Almouhannad Hafez
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
(5) Use N-Grams with dep. parsing features
parent
28e67496
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
937 additions
and
768 deletions
+937
-768
5.1.Dep_parsing_classifier.ipynb
5/5.1.Dep_parsing_classifier.ipynb
+936
-768
README.md
README.md
+1
-0
Results.xlsx
Results.xlsx
+0
-0
No files found.
5/5.1.Dep_parsing_classifier.ipynb
View file @
b2228e32
...
...
@@ -185,13 +185,53 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ***Usage example***"
"### ***Features extraction***\n",
"**Dependency Relation Tuples:** \n",
"- `(head_word, dependent_word, dependency_relation)`\n",
"- `n1 -> n2 grams`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def extract_dependency_features_with_n_grams(row_id, n1=1, n2=2):\n",
" doc = get_doc_by_id(row_id)\n",
" feature_tuples = set() # Use a set to avoid duplicates\n",
" \n",
" # Extract dependency relations\n",
" for sentence in doc.sentences:\n",
" for word in sentence.words:\n",
" # Dependency relation tuples\n",
" if word.head > 0: # If not root\n",
" head = sentence.words[word.head - 1] # Adjust head index\n",
" feature_tuples.add((head.lemma, word.lemma, word.deprel))\n",
" \n",
" # Extract n-grams from n1 to n2\n",
" for sentence in doc.sentences:\n",
" words = [word.lemma for word in sentence.words]\n",
" \n",
" for n in range(n1, n2 + 1): # Loop from n1 to n2 inclusive\n",
" for i in range(len(words) - n + 1):\n",
" n_gram = tuple(words[i:i+n]) # Create a tuple for the n-gram\n",
" feature_tuples.add(n_gram)\n",
" \n",
" return feature_tuples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ***Usage example***"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
...
...
@@ -207,7 +247,7 @@
},
{
"cell_type": "code",
"execution_count": 9
,
"execution_count": 10
,
"metadata": {},
"outputs": [
{
...
...
@@ -235,7 +275,7 @@
" ('painful', '.', 'punct')]"
]
},
"execution_count": 9
,
"execution_count": 10
,
"metadata": {},
"output_type": "execute_result"
}
...
...
@@ -253,7 +293,7 @@
},
{
"cell_type": "code",
"execution_count": 10
,
"execution_count": 11
,
"metadata": {},
"outputs": [],
"source": [
...
...
@@ -272,7 +312,7 @@
},
{
"cell_type": "code",
"execution_count": 11
,
"execution_count": 12
,
"metadata": {},
"outputs": [],
"source": [
...
...
@@ -294,7 +334,7 @@
},
{
"cell_type": "code",
"execution_count": 12
,
"execution_count": 13
,
"metadata": {},
"outputs": [],
"source": [
...
...
@@ -361,7 +401,7 @@
},
{
"cell_type": "code",
"execution_count": 13
,
"execution_count": 14
,
"metadata": {},
"outputs": [],
"source": [
...
...
@@ -382,7 +422,7 @@
},
{
"cell_type": "code",
"execution_count": 14
,
"execution_count": 15
,
"metadata": {},
"outputs": [],
"source": [
...
...
@@ -422,7 +462,7 @@
},
{
"cell_type": "code",
"execution_count": 15
,
"execution_count": 16
,
"metadata": {},
"outputs": [],
"source": [
...
...
@@ -439,7 +479,7 @@
},
{
"cell_type": "code",
"execution_count": 16
,
"execution_count": 17
,
"metadata": {},
"outputs": [],
"source": [
...
...
@@ -460,7 +500,7 @@
},
{
"cell_type": "code",
"execution_count": 17
,
"execution_count": 18
,
"metadata": {},
"outputs": [
{
...
...
@@ -482,7 +522,7 @@
},
{
"cell_type": "code",
"execution_count": 18
,
"execution_count": 19
,
"metadata": {},
"outputs": [],
"source": [
...
...
@@ -499,7 +539,7 @@
},
{
"cell_type": "code",
"execution_count": 19
,
"execution_count": 20
,
"metadata": {},
"outputs": [
{
...
...
@@ -536,7 +576,7 @@
},
{
"cell_type": "code",
"execution_count": 20
,
"execution_count": 21
,
"metadata": {},
"outputs": [
{
...
...
@@ -573,7 +613,7 @@
},
{
"cell_type": "code",
"execution_count": 21
,
"execution_count": 22
,
"metadata": {},
"outputs": [
{
...
...
@@ -634,7 +674,7 @@
},
{
"cell_type": "code",
"execution_count": 22
,
"execution_count": 23
,
"metadata": {},
"outputs": [
{
...
...
@@ -667,7 +707,7 @@
},
{
"cell_type": "code",
"execution_count": 23
,
"execution_count": 24
,
"metadata": {},
"outputs": [
{
...
...
@@ -697,7 +737,7 @@
},
{
"cell_type": "code",
"execution_count": 24
,
"execution_count": 25
,
"metadata": {},
"outputs": [
{
...
...
@@ -745,6 +785,134 @@
"model = MultinomialNB(alpha=0.01)\n",
"evaluate_model(X_train, X_test, y_train, y_test, chi2, 7500, model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ***Use Dependecy features with N-Grams***\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train set shape after features extraction: (4320, 84052)\n",
"Test set shape after features extraction: (480, 84052)\n"
]
}
],
"source": [
"n1 = 1\n",
"n2 = 3\n",
"def extractor(row_id):\n",
" return extract_dependency_features_with_n_grams(row_id, n1, n2)\n",
"\n",
"\n",
"train_df[\"features\"] = train_df[\"Id\"].apply(extractor)\n",
"test_df[\"features\"] = test_df[\"Id\"].apply(extractor)\n",
"\n",
"all_features_train = train_df['features'].apply(flatten_tuples)\n",
"all_features_flat_train = [' '.join(features) for features in all_features_train]\n",
"all_features_test = test_df['features'].apply(flatten_tuples)\n",
"all_features_flat_test = [' '.join(features) for features in all_features_test]\n",
"\n",
"vectorizer = TfidfVectorizer()\n",
"X_train = vectorizer.fit_transform(all_features_flat_train)\n",
"X_test = vectorizer.transform(all_features_flat_test)\n",
"print(f\"Train set shape after features extraction: {X_train.shape}\")\n",
"print(f\"Test set shape after features extraction: {X_test.shape}\")\n",
"\n",
"y_train = train_df[\"label\"]\n",
"y_test = test_df[\"label\"]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Closest Point 1: Number of Features = 66000, Train Accuracy = 1.0, Test Accuracy = 0.975\n",
"Closest Point 2: Number of Features = 68000, Train Accuracy = 1.0, Test Accuracy = 0.975\n",
"Closest Point 3: Number of Features = 67000, Train Accuracy = 1.0, Test Accuracy = 0.975\n"
]
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1000x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Make sure to update step in plotting before running this cell\n",
"model = MultinomialNB(alpha = 0.01)\n",
"plot_accuracies(X_train, X_test, y_train, y_test, model)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train Accuracy: 1.0\n",
"Test Accuracy: 0.975\n",
"Difference: 0.025000000000000022\n",
"+---------------------------------+--------------------+--------------------+------------+-----------+\n",
"| Class | Precision | Recall | F1-score | Support |\n",
"|---------------------------------+--------------------+--------------------+------------+-----------|\n",
"| Acne | 1.0 | 1.0 | 1 | 21.0 |\n",
"| Arthritis | 1.0 | 1.0 | 1 | 20.0 |\n",
"| Bronchial Asthma | 1.0 | 0.9473684210526315 | 0.972973 | 19.0 |\n",
"| Cervical spondylosis | 0.9130434782608695 | 1.0 | 0.954545 | 21.0 |\n",
"| Chicken pox | 0.75 | 1.0 | 0.857143 | 15.0 |\n",
"| Common Cold | 1.0 | 1.0 | 1 | 21.0 |\n",
"| Dengue | 1.0 | 0.7727272727272727 | 0.871795 | 22.0 |\n",
"| Dimorphic Hemorrhoids | 1.0 | 1.0 | 1 | 19.0 |\n",
"| Fungal infection | 1.0 | 1.0 | 1 | 26.0 |\n",
"| Hypertension | 1.0 | 0.9444444444444444 | 0.971429 | 18.0 |\n",
"| Impetigo | 0.9583333333333334 | 1.0 | 0.978723 | 23.0 |\n",
"| Jaundice | 1.0 | 1.0 | 1 | 22.0 |\n",
"| Malaria | 1.0 | 1.0 | 1 | 17.0 |\n",
"| Migraine | 1.0 | 1.0 | 1 | 24.0 |\n",
"| Pneumonia | 1.0 | 1.0 | 1 | 22.0 |\n",
"| Psoriasis | 1.0 | 0.8823529411764706 | 0.9375 | 17.0 |\n",
"| Typhoid | 0.9473684210526315 | 1.0 | 0.972973 | 18.0 |\n",
"| Varicose Veins | 1.0 | 0.96 | 0.979592 | 25.0 |\n",
"| allergy | 0.9375 | 1.0 | 0.967742 | 15.0 |\n",
"| diabetes | 1.0 | 0.9411764705882353 | 0.969697 | 17.0 |\n",
"| drug reaction | 1.0 | 1.0 | 1 | 16.0 |\n",
"| gastroesophageal reflux disease | 0.9545454545454546 | 1.0 | 0.976744 | 21.0 |\n",
"| peptic ulcer disease | 1.0 | 0.9444444444444444 | 0.971429 | 18.0 |\n",
"| urinary tract infection | 0.9583333333333334 | 1.0 | 0.978723 | 23.0 |\n",
"| accuracy | | | 0.975 | |\n",
"| macro avg | 0.9757968341885676 | 0.9746880831013959 | 0.973375 | |\n",
"| weighted avg | 0.9784746510441948 | 0.975 | 0.975031 | |\n",
"+---------------------------------+--------------------+--------------------+------------+-----------+\n"
]
}
],
"source": [
"model = MultinomialNB(alpha=0.01)\n",
"evaluate_model(X_train, X_test, y_train, y_test, chi2, 66000, model)"
]
}
],
"metadata": {
...
...
README.md
View file @
b2228e32
...
...
@@ -124,6 +124,7 @@
| Text + (2,4)Gram | 0.9975 | 0.9375 | 6.0 | 0.9366 | 0.9334 | 0.9311 | alpha=0.01, 16600features |
| Stanza Dep. Relation tuples | 0.9995 | 0.9521 | 4.7 | 0.9513 | 0.9503 | 0.9484 | alpha=0.01, 8000features |
| Stanza Dep.Relation+POS Relations+Headwords tuples | 0.9986 | 0.9479 | 5.1 | 0.9481 | 0.9471 | 0.9440 | alpha=0.01, 7500features |
| Stanza Dep. Relation tuples + (1,3) Grams | 1.0000 | 0.9750 | 2.5 | 0.9758 | 0.9747 | 0.9734 | alpha=0.01, 66000features |
---
> ***Applied features selection and model's hyperparameters tuning***
...
...
Results.xlsx
View file @
b2228e32
No preview for this file type
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment