Commit fd1ccd04 authored by Almouhannad Hafez's avatar Almouhannad Hafez

(Organize folders) Add data preprocessing

parent 8158976c
__pycache__/constants.cpython-311.pyc __pycache__/constants.cpython-39.pyc
__pycache__/helpers.cpython-39.pyc
data/bread_basket_preprocessed.csv data/bread_basket_preprocessed.csv
__pycache__/helpers.cpython-311.pyc
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ***Contents***\n",
"- **[Setup](#0-Setup)**\n",
"- **[Data Preprocessing](#1-Data-preprocessing)**\n",
" - **[1. Load Dataset](#11-Load-dataset)**\n",
" - **[2. Check Null Values](#12-Check-null-values)**\n",
" - **[3. Check Duplicates](#13-Check-duplicates)**\n",
" - **[4. Process Dataset Columns](#14-Process-dataset-columns)**\n",
" - **[5. Convert to One-Hot Encoding](#15-Convert-to-one-hot-encoding)**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ***0. Setup***\n",
"[Back to contents](#Contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Please note that the following cell may require working VPN to work**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# %pip install pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from helpers import HELPERS\n",
"from constants import CONSTANTS\n",
"# Some more magic so that the notebook will reload external python modules;\n",
"# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"%reload_ext autoreload"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ***1. Data preprocessing***\n",
"[Back to contents](#Contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.1. Load dataset***\n",
"[Back to contents](#Contents)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset loaded successfully with shape: (20507, 5)\n"
]
}
],
"source": [
"df = None\n",
"df = HELPERS.read_dataset_from_csv(CONSTANTS.DATASET_PATH)\n",
"assert df.shape == CONSTANTS.DATASET_SHAPE, f\"Expected shape {CONSTANTS.DATASET_SHAPE}, but got {df.shape}\" \n",
"print(\"Dataset loaded successfully with shape:\", df.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.2. Check null values***\n",
"[Back to contents](#Contents)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Null values in each column:\n",
"Transaction 0\n",
"Item 0\n",
"date_time 0\n",
"period_day 0\n",
"weekday_weekend 0\n",
"dtype: int64\n"
]
}
],
"source": [
"print(\"Null values in each column:\")\n",
"print(df.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Nothing to do since there is no null values**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.3. Check duplicates***\n",
"[Back to contents](#Contents)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of duplicates in dataset: 1620\n"
]
}
],
"source": [
"print(f\"Number of duplicates in dataset: {df.duplicated().sum()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We have 1620 duplicated rows, let's remove them**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of duplicates in dataset: 0\n",
"New dataset shape: (18887, 5)\n"
]
}
],
"source": [
"df = df.drop_duplicates()\n",
"print(f\"Number of duplicates in dataset: {df.duplicated().sum()}\")\n",
"print(f\"New dataset shape: {df.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Now, let's count number of unique items, and total transactions in the dataset**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of transactions in the dataset: 9465\n",
"Number of unique items in the dataset: 94\n"
]
}
],
"source": [
"print(f\"Number of transactions in the dataset: {df['Transaction'].nunique()}\")\n",
"print(f\"Number of unique items in the dataset: {df['Item'].nunique()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.4. Process dataset columns***\n",
"[Back to contents](#Contents)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset columns:\n",
"Transaction int64\n",
"Item object\n",
"date_time object\n",
"period_day object\n",
"weekday_weekend object\n",
"dtype: object\n"
]
}
],
"source": [
"print(f\"Dataset columns:\")\n",
"print(df.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We have 5 columns**\n",
"1. **`Transaction`**: Transaction id\n",
"1. **`Item`**: Item name\n",
"1. **`date_time`**: Date of transaction\n",
"1. **`period_day`**: In which period of day (morning, afternoon, ...) the transaction is\n",
"1. **`weekday_weeken`**: In weekday or weekend the transaction is\n",
"\n",
"***Please note:*** **If a transaction contains multiple items, each one will be represented in a seperate row with same id**\n",
"\n",
"**We are inrested only in `Transaction` and `Item`, so we'll delete other columns and rename them**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset new columns:\n",
"transaction_id int64\n",
"item_name object\n",
"dtype: object\n"
]
}
],
"source": [
"df = df.loc[:, ['Transaction', 'Item']].rename(columns={\n",
" 'Transaction': 'transaction_id',\n",
" 'Item': 'item_name'\n",
"})\n",
"\n",
"print(f\"Dataset new columns:\")\n",
"print(df.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.5. Convert to one-hot-encoding***\n",
"[Back to contents](#Contents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We'll convert the dataset into one-hot-encoding as following**: \n",
"> - Each row contain 94 features (columns) + 1 feature for transaction_id\n",
"> - 94 is number of unique items\n",
"> - Each feature value is boolean (true meaning that item is in the transaction and vice versa)\n",
"> - So, the new shape of dataset will be (9465, 95)\n",
"> - We're doing so to be able to use libraries for applying Apriori and FP Growth"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset before one-hot-encoding:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>transaction_id</th>\n",
" <th>item_name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Bread</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Scandinavian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>Hot chocolate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" <td>Jam</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>3</td>\n",
" <td>Cookies</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" transaction_id item_name\n",
"0 1 Bread\n",
"1 2 Scandinavian\n",
"3 3 Hot chocolate\n",
"4 3 Jam\n",
"5 3 Cookies"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(f\"Dataset before one-hot-encoding:\")\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New dataset shape: (9465, 95)\n"
]
}
],
"source": [
"one_hot_encoded = pd.get_dummies(df['item_name'])\n",
"df = df[['transaction_id']].join(one_hot_encoded).groupby('transaction_id').sum()\n",
"df.reset_index(inplace=True)\n",
"print(f\"New dataset shape: {df.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's delete `transaction_id` column sice it's not required, and convert other columns into boolean to save space**: "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Final dataset shape: (9465, 94)\n",
"Dataset columns types after one-hot-encoding:\n",
"Adjustment bool\n",
"Afternoon with the baker bool\n",
"Alfajores bool\n",
"Argentina Night bool\n",
"Art Tray bool\n",
" ... \n",
"Tshirt bool\n",
"Valentine's card bool\n",
"Vegan Feast bool\n",
"Vegan mincepie bool\n",
"Victorian Sponge bool\n",
"Length: 94, dtype: object\n"
]
}
],
"source": [
"df = df.drop(columns=['transaction_id'])\n",
"df = df.astype(bool)\n",
"assert df.shape == CONSTANTS.PREPROCESSED_DATASET_SHAPE, f\"Expected shape {CONSTANTS.PREPROCESSED_DATASET_SHAPE}, but got {df.shape}\" \n",
"print(f\"Final dataset shape: {df.shape}\")\n",
"print(f\"Dataset columns types after one-hot-encoding:\")\n",
"print(df.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's save preprocessed dataset in a .csv file**"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(CONSTANTS.PREPROCESSED_DATASET_PATH, index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "ML",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.20"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment