Add preprocessing

fb8fcf89 · Almouhannad · 82cd5af1 · fb8fcf89 · fb8fcf89 · fb8fcf89
Commit fb8fcf89 authored Oct 29, 2024 by Almouhannad
Hide whitespace changes
Inline Side-by-side

Showing with 429 additions and 6 deletions

.gitignore .gitignore +1 -0

README.md README.md +3 -1

constants.py constants.py +4 -1

hw1.ipynb hw1.ipynb +421 -4

No files found.
--- a/.gitignore
+++ b/.gitignore
+__pycache__/constants.cpython-311.pyc
--- a/README.md
+++ b/README.md
@@ -4,10 +4,12 @@

 ***Dataset link: [The Bread Basket](https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket)***

-> ***This project contains a jupyter notebook `hw.ipynb` containing the following steps:***
+> ***This project contains a jupyter notebook `hw1.ipynb` containing the following steps:***
 > 1. **Setup requirements**
 > 1. **Data preprocessing**
 > 1. **Extracting rules using**
 >       - **Apriori**
 >       - **FP Growth**
 > 1. **Performance comparison between the two algorithms**
+
+> ***This project contains a python file `constants.ipynb` containing some fixed values used in `hw.ipynb`, refered as `CONSTANTS` class***
--- a/constants.py
+++ b/constants.py
 class CONSTANTS:
-    DATASET_PATH = 'data/bread_basket.csv'    
\ No newline at end of file
+    DATASET_PATH = 'data/bread_basket.csv'
+    DATASET_SHAPE = (20507, 5)
+    PREPROCESSED_DATASET_PATH = 'data/bread_basket_preprocessed.csv'
+    PREPROCESSED_DATASET_SHAPE = (9465, 94)
\ No newline at end of file
--- a/hw1.ipynb
+++ b/hw1.ipynb
@@ -7,24 +7,35 @@
    "# ***0. Setup***"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Please note that the following cell may require working VPN to work**"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
-    "pip install pandas"
+    "%pip install pandas\n",
+    "%pip install mlxtend"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
-    "import constants"
+    "from mlxtend.frequent_patterns import apriori, association_rules\n",
+    "from mlxtend.frequent_patterns import fpgrowth\n",
+    "\n",
+    "from constants import CONSTANTS"
   ]
  },
  {
@@ -34,6 +45,404 @@
    "# ***1. Data preprocessing***"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.1. Load dataset***"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset loaded successfully with shape: (20507, 5)\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = pd.read_csv(CONSTANTS.DATASET_PATH)\n",
+    "df.shape\n",
+    "assert df.shape == CONSTANTS.DATASET_SHAPE, f\"Expected shape {CONSTANTS.DATASET_SHAPE}, but got {df.shape}\" \n",
+    "print(\"Dataset loaded successfully with shape:\", df.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.2. Check null values***"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Null values in each column:\n",
+      "Transaction        0\n",
+      "Item               0\n",
+      "date_time          0\n",
+      "period_day         0\n",
+      "weekday_weekend    0\n",
+      "dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Null values in each column:\")\n",
+    "print(df.isnull().sum())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Nothing to do since there is no null values**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.3. Check duplicates***"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of duplicates in dataset: 1620\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"Number of duplicates in dataset: {df.duplicated().sum()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**We have 1620 duplicated rows, let's remove them**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of duplicates in dataset: 0\n",
+      "New dataset shape: (18887, 5)\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = df.drop_duplicates()\n",
+    "print(f\"Number of duplicates in dataset: {df.duplicated().sum()}\")\n",
+    "print(f\"New dataset shape: {df.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Now, let's count number of unique items, and total transactions in the dataset**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of transactions in the dataset: 9465\n",
+      "Number of unique items in the dataset: 94\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"Number of transactions in the dataset: {df['Transaction'].nunique()}\")\n",
+    "print(f\"Number of unique items in the dataset: {df['Item'].nunique()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.4. Process dataset columns***"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset columns:\n",
+      "Transaction         int64\n",
+      "Item               object\n",
+      "date_time          object\n",
+      "period_day         object\n",
+      "weekday_weekend    object\n",
+      "dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"Dataset columns:\")\n",
+    "print(df.dtypes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**We have 5 columns**\n",
+    "1. **`Transaction`**: Transaction id\n",
+    "1. **`Item`**: Item name\n",
+    "1. **`date_time`**: Date of transaction\n",
+    "1. **`period_day`**: In which period of day (morning, afternoon, ...) the transaction is\n",
+    "1. **`weekday_weeken`**: In weekday or weekend the transaction is\n",
+    "\n",
+    "***Please note:*** **If a transaction contains multiple items, each one will be represented in a seperate row with same id**\n",
+    "\n",
+    "**We are inrested only in `Transaction` and `Item`, so we'll delete other columns and rename them**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset new columns:\n",
+      "transaction_id     int64\n",
+      "item_name         object\n",
+      "dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = df.loc[:, ['Transaction', 'Item']].rename(columns={\n",
+    "    'Transaction': 'transaction_id',\n",
+    "    'Item': 'item_name'\n",
+    "})\n",
+    "\n",
+    "print(f\"Dataset new columns:\")\n",
+    "print(df.dtypes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.5. Convert to one-hot-encoding***"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**We'll convert the dataset into one-hot-encoding as following**:  \n",
+    "> - Each row contain 94 features (columns) + 1 feature for transaction_id\n",
+    ">   - 94 is number of unique items\n",
+    ">   - Each feature value is boolean (true meaning that item is in the transaction and vice versa)\n",
+    "> - So, the new shape of dataset will be (9465, 95)\n",
+    "> - We're doing so to be able to use libraries for applying Apriori and FP Growth"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset before one-hot-encoding:\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>transaction_id</th>\n",
+       "      <th>item_name</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>Bread</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>Scandinavian</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>3</td>\n",
+       "      <td>Hot chocolate</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>3</td>\n",
+       "      <td>Jam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>3</td>\n",
+       "      <td>Cookies</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   transaction_id      item_name\n",
+       "0               1          Bread\n",
+       "1               2   Scandinavian\n",
+       "3               3  Hot chocolate\n",
+       "4               3            Jam\n",
+       "5               3        Cookies"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "print(f\"Dataset before one-hot-encoding:\")\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "New dataset shape: (9465, 95)\n"
+     ]
+    }
+   ],
+   "source": [
+    "one_hot_encoded = pd.get_dummies(df['item_name'])\n",
+    "df = df[['transaction_id']].join(one_hot_encoded).groupby('transaction_id').sum()\n",
+    "df.reset_index(inplace=True)\n",
+    "print(f\"New dataset shape: {df.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Let's delete `transaction_id` column sice it's not required, and convert other columns into boolean to save space**:  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Final dataset shape: (9465, 94)\n",
+      "Dataset columns types after one-hot-encoding:\n",
+      "Adjustment                  bool\n",
+      "Afternoon with the baker    bool\n",
+      "Alfajores                   bool\n",
+      "Argentina Night             bool\n",
+      "Art Tray                    bool\n",
+      "                            ... \n",
+      "Tshirt                      bool\n",
+      "Valentine's card            bool\n",
+      "Vegan Feast                 bool\n",
+      "Vegan mincepie              bool\n",
+      "Victorian Sponge            bool\n",
+      "Length: 94, dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = df.drop(columns=['transaction_id'])\n",
+    "df = df.astype(bool)\n",
+    "assert df.shape == CONSTANTS.PREPROCESSED_DATASET_SHAPE, f\"Expected shape {CONSTANTS.PREPROCESSED_DATASET_SHAPE}, but got {df.shape}\" \n",
+    "print(f\"Final dataset shape: {df.shape}\")\n",
+    "print(f\"Dataset columns types after one-hot-encoding:\")\n",
+    "print(df.dtypes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Let's save preprocessed dataset in a .csv file**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_csv(CONSTANTS.PREPROCESSED_DATASET_PATH)"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -63,7 +472,15 @@
   "name": "python3"
  },
  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },