(Organize folders) Add data preprocessing

fd1ccd04 · Almouhannad Hafez · 8158976c · fd1ccd04 · fd1ccd04
Commit fd1ccd04 authored Nov 12, 2024 by Almouhannad Hafez
Hide whitespace changes
Inline Side-by-side

Showing with 500 additions and 2 deletions

.gitignore .gitignore +2 -2

1.data_preprocessing.ipynb 1.data_preprocessing.ipynb +498 -0

No files found.
--- a/.gitignore
+++ b/.gitignore
-__pycache__/constants.cpython-311.pyc
+__pycache__/constants.cpython-39.pyc
+__pycache__/helpers.cpython-39.pyc
 data/bread_basket_preprocessed.csv
-__pycache__/helpers.cpython-311.pyc
--- a/1.data_preprocessing.ipynb
+++ b/1.data_preprocessing.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ***Contents***\n",
+    "- **[Setup](#0-Setup)**\n",
+    "- **[Data Preprocessing](#1-Data-preprocessing)**\n",
+    "    - **[1. Load Dataset](#11-Load-dataset)**\n",
+    "    - **[2. Check Null Values](#12-Check-null-values)**\n",
+    "    - **[3. Check Duplicates](#13-Check-duplicates)**\n",
+    "    - **[4. Process Dataset Columns](#14-Process-dataset-columns)**\n",
+    "    - **[5. Convert to One-Hot Encoding](#15-Convert-to-one-hot-encoding)**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ***0. Setup***\n",
+    "[Back to contents](#Contents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Please note that the following cell may require working VPN to work**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# %pip install pandas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from helpers import HELPERS\n",
+    "from constants import CONSTANTS\n",
+    "# Some more magic so that the notebook will reload external python modules;\n",
+    "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2\n",
+    "%reload_ext autoreload"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# ***1. Data preprocessing***\n",
+    "[Back to contents](#Contents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.1. Load dataset***\n",
+    "[Back to contents](#Contents)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset loaded successfully with shape: (20507, 5)\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = None\n",
+    "df = HELPERS.read_dataset_from_csv(CONSTANTS.DATASET_PATH)\n",
+    "assert df.shape == CONSTANTS.DATASET_SHAPE, f\"Expected shape {CONSTANTS.DATASET_SHAPE}, but got {df.shape}\" \n",
+    "print(\"Dataset loaded successfully with shape:\", df.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.2. Check null values***\n",
+    "[Back to contents](#Contents)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Null values in each column:\n",
+      "Transaction        0\n",
+      "Item               0\n",
+      "date_time          0\n",
+      "period_day         0\n",
+      "weekday_weekend    0\n",
+      "dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Null values in each column:\")\n",
+    "print(df.isnull().sum())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Nothing to do since there is no null values**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.3. Check duplicates***\n",
+    "[Back to contents](#Contents)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of duplicates in dataset: 1620\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"Number of duplicates in dataset: {df.duplicated().sum()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**We have 1620 duplicated rows, let's remove them**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of duplicates in dataset: 0\n",
+      "New dataset shape: (18887, 5)\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = df.drop_duplicates()\n",
+    "print(f\"Number of duplicates in dataset: {df.duplicated().sum()}\")\n",
+    "print(f\"New dataset shape: {df.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Now, let's count number of unique items, and total transactions in the dataset**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of transactions in the dataset: 9465\n",
+      "Number of unique items in the dataset: 94\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"Number of transactions in the dataset: {df['Transaction'].nunique()}\")\n",
+    "print(f\"Number of unique items in the dataset: {df['Item'].nunique()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.4. Process dataset columns***\n",
+    "[Back to contents](#Contents)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset columns:\n",
+      "Transaction         int64\n",
+      "Item               object\n",
+      "date_time          object\n",
+      "period_day         object\n",
+      "weekday_weekend    object\n",
+      "dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"Dataset columns:\")\n",
+    "print(df.dtypes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**We have 5 columns**\n",
+    "1. **`Transaction`**: Transaction id\n",
+    "1. **`Item`**: Item name\n",
+    "1. **`date_time`**: Date of transaction\n",
+    "1. **`period_day`**: In which period of day (morning, afternoon, ...) the transaction is\n",
+    "1. **`weekday_weeken`**: In weekday or weekend the transaction is\n",
+    "\n",
+    "***Please note:*** **If a transaction contains multiple items, each one will be represented in a seperate row with same id**\n",
+    "\n",
+    "**We are inrested only in `Transaction` and `Item`, so we'll delete other columns and rename them**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset new columns:\n",
+      "transaction_id     int64\n",
+      "item_name         object\n",
+      "dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = df.loc[:, ['Transaction', 'Item']].rename(columns={\n",
+    "    'Transaction': 'transaction_id',\n",
+    "    'Item': 'item_name'\n",
+    "})\n",
+    "\n",
+    "print(f\"Dataset new columns:\")\n",
+    "print(df.dtypes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## ***1.5. Convert to one-hot-encoding***\n",
+    "[Back to contents](#Contents)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**We'll convert the dataset into one-hot-encoding as following**:  \n",
+    "> - Each row contain 94 features (columns) + 1 feature for transaction_id\n",
+    ">   - 94 is number of unique items\n",
+    ">   - Each feature value is boolean (true meaning that item is in the transaction and vice versa)\n",
+    "> - So, the new shape of dataset will be (9465, 95)\n",
+    "> - We're doing so to be able to use libraries for applying Apriori and FP Growth"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset before one-hot-encoding:\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>transaction_id</th>\n",
+       "      <th>item_name</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>Bread</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>Scandinavian</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>3</td>\n",
+       "      <td>Hot chocolate</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>3</td>\n",
+       "      <td>Jam</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>3</td>\n",
+       "      <td>Cookies</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   transaction_id      item_name\n",
+       "0               1          Bread\n",
+       "1               2   Scandinavian\n",
+       "3               3  Hot chocolate\n",
+       "4               3            Jam\n",
+       "5               3        Cookies"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "print(f\"Dataset before one-hot-encoding:\")\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "New dataset shape: (9465, 95)\n"
+     ]
+    }
+   ],
+   "source": [
+    "one_hot_encoded = pd.get_dummies(df['item_name'])\n",
+    "df = df[['transaction_id']].join(one_hot_encoded).groupby('transaction_id').sum()\n",
+    "df.reset_index(inplace=True)\n",
+    "print(f\"New dataset shape: {df.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Let's delete `transaction_id` column sice it's not required, and convert other columns into boolean to save space**:  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Final dataset shape: (9465, 94)\n",
+      "Dataset columns types after one-hot-encoding:\n",
+      "Adjustment                  bool\n",
+      "Afternoon with the baker    bool\n",
+      "Alfajores                   bool\n",
+      "Argentina Night             bool\n",
+      "Art Tray                    bool\n",
+      "                            ... \n",
+      "Tshirt                      bool\n",
+      "Valentine's card            bool\n",
+      "Vegan Feast                 bool\n",
+      "Vegan mincepie              bool\n",
+      "Victorian Sponge            bool\n",
+      "Length: 94, dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = df.drop(columns=['transaction_id'])\n",
+    "df = df.astype(bool)\n",
+    "assert df.shape == CONSTANTS.PREPROCESSED_DATASET_SHAPE, f\"Expected shape {CONSTANTS.PREPROCESSED_DATASET_SHAPE}, but got {df.shape}\" \n",
+    "print(f\"Final dataset shape: {df.shape}\")\n",
+    "print(f\"Dataset columns types after one-hot-encoding:\")\n",
+    "print(df.dtypes)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Let's save preprocessed dataset in a .csv file**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_csv(CONSTANTS.PREPROCESSED_DATASET_PATH, index=False)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ML",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.20"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}