Commit 3f1886ba authored by Almouhannad Hafez's avatar Almouhannad Hafez

(1) (2) Add expand contractions to preprocessing

parent 2f6f6c6b
......@@ -17,6 +17,7 @@
"from sklearn.utils import shuffle\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"import contractions\n",
"\n",
"from constants import CONSTANTS"
]
......@@ -521,13 +522,100 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### ***7- Split into Train-Test and save in .csv files***"
"### ***7- Expand contractions***\n",
"**i.e. I'm => I am**"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>i have had back pain, a cough that will not go...</td>\n",
" <td>cervical spondylosis</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>recently, i have been having problems using th...</td>\n",
" <td>dimorphic hemorrhoids</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>i have been exhausted and experiencing nausea ...</td>\n",
" <td>jaundice</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>i have had a nagging cough that will not go aw...</td>\n",
" <td>bronchial asthma</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>i have been quite exhausted and ill. my throat...</td>\n",
" <td>common cold</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text label\n",
"0 i have had back pain, a cough that will not go... cervical spondylosis\n",
"1 recently, i have been having problems using th... dimorphic hemorrhoids\n",
"2 i have been exhausted and experiencing nausea ... jaundice\n",
"3 i have had a nagging cough that will not go aw... bronchial asthma\n",
"4 i have been quite exhausted and ill. my throat... common cold"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['text'] = df['text'].apply(contractions.fix)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ***8- Split into Train-Test and save in .csv files***"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
......@@ -546,7 +634,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
......
......@@ -33,27 +33,23 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Now, we have to download `Punkt Tokenizer Model`, try running following cell, if it didn't work successfully then try to download model manually from following links: [Manual installation](https://www.nltk.org/data.html), and [Model link](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip).\n"
"##### Now, we have to download `Punkt Tokenizer Model`, try running following cell, if it didn't work successfully then try to download model manually from following links: [Manual installation](https://www.nltk.org/data.html), and [Model link](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"nltk.download('punkt')"
"**Uncomment if you haven't already**"
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"##### You must see an output similar to the following output:\n",
"> `[nltk_data] Downloading package punkt to` \n",
"> `[nltk_data] ...\\AppData\\Roaming\\nltk_data...` \n",
"> `[nltk_data] Package punkt is already up-to-date!` \n",
"> `True`"
"# nltk.download('punkt')"
]
},
{
......@@ -62,7 +58,7 @@
"metadata": {},
"outputs": [],
"source": [
"nltk.download('stopwords')"
"# nltk.download('stopwords')"
]
},
{
......@@ -71,7 +67,18 @@
"metadata": {},
"outputs": [],
"source": [
"nltk.download('punkt_tab')"
"# nltk.download('punkt_tab')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### You must see an output similar to the following output:\n",
"> `[nltk_data] Downloading package punkt to` \n",
"> `[nltk_data] ...\\AppData\\Roaming\\nltk_data...` \n",
"> `[nltk_data] Package punkt is already up-to-date!` \n",
"> `True`"
]
},
{
......@@ -705,11 +712,11 @@
"| acne | 0.9047619047619048 | 1.0 | 0.95 | 19.0 |\n",
"| allergy | 0.8947368421052632 | 0.8947368421052632 | 0.894737 | 19.0 |\n",
"| arthritis | 0.9333333333333333 | 1.0 | 0.965517 | 14.0 |\n",
"| bronchial asthma | 0.7391304347826086 | 1.0 | 0.85 | 17.0 |\n",
"| bronchial asthma | 0.8095238095238095 | 1.0 | 0.894737 | 17.0 |\n",
"| cervical spondylosis | 1.0 | 1.0 | 1 | 21.0 |\n",
"| chicken pox | 0.8823529411764706 | 0.7894736842105263 | 0.833333 | 19.0 |\n",
"| chicken pox | 0.8235294117647058 | 0.7368421052631579 | 0.777778 | 19.0 |\n",
"| common cold | 0.8 | 0.8888888888888888 | 0.842105 | 18.0 |\n",
"| dengue | 0.6086956521739131 | 0.875 | 0.717949 | 16.0 |\n",
"| dengue | 0.5833333333333334 | 0.875 | 0.7 | 16.0 |\n",
"| diabetes | 1.0 | 0.6842105263157895 | 0.8125 | 19.0 |\n",
"| dimorphic hemorrhoids | 1.0 | 1.0 | 1 | 17.0 |\n",
"| drug reaction | 0.8333333333333334 | 0.9375 | 0.882353 | 16.0 |\n",
......@@ -721,14 +728,14 @@
"| malaria | 1.0 | 1.0 | 1 | 23.0 |\n",
"| migraine | 1.0 | 0.9473684210526315 | 0.972973 | 19.0 |\n",
"| peptic ulcer disease | 1.0 | 0.7727272727272727 | 0.871795 | 22.0 |\n",
"| pneumonia | 1.0 | 0.875 | 0.933333 | 24.0 |\n",
"| pneumonia | 1.0 | 0.9166666666666666 | 0.956522 | 24.0 |\n",
"| psoriasis | 1.0 | 0.8636363636363636 | 0.926829 | 22.0 |\n",
"| typhoid | 1.0 | 0.75 | 0.857143 | 24.0 |\n",
"| typhoid | 1.0 | 0.7916666666666666 | 0.883721 | 24.0 |\n",
"| urinary tract infection | 0.8888888888888888 | 1.0 | 0.941176 | 16.0 |\n",
"| varicose veins | 1.0 | 1.0 | 1 | 17.0 |\n",
"| varicose veins | 1.0 | 0.9411764705882353 | 0.969697 | 17.0 |\n",
"| accuracy | | | 0.919913 | |\n",
"| macro avg | 0.9260885007839509 | 0.9249392499556973 | 0.919695 | |\n",
"| weighted avg | 0.9329056274770945 | 0.9199134199134199 | 0.92054 | |\n",
"| macro avg | 0.9255138143876533 | 0.9237675093296224 | 0.919308 | |\n",
"| weighted avg | 0.9321983616985828 | 0.9199134199134199 | 0.92075 | |\n",
"+---------------------------------+--------------------+--------------------+------------+-----------+\n"
]
}
......
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment