Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Sign in
Toggle navigation
D
DM-Project
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
almohanad.hafez
DM-Project
Commits
fb8fcf89
You need to sign in or sign up before continuing.
Commit
fb8fcf89
authored
Oct 29, 2024
by
Almouhannad
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Add preprocessing
parent
82cd5af1
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
429 additions
and
6 deletions
+429
-6
.gitignore
.gitignore
+1
-0
README.md
README.md
+3
-1
constants.py
constants.py
+4
-1
hw1.ipynb
hw1.ipynb
+421
-4
No files found.
.gitignore
0 → 100644
View file @
fb8fcf89
__pycache__/constants.cpython-311.pyc
README.md
View file @
fb8fcf89
...
...
@@ -4,10 +4,12 @@
***Dataset link: [The Bread Basket](https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket)**
*
> ***This project contains a jupyter notebook `hw.ipynb` containing the following steps:***
> ***This project contains a jupyter notebook `hw
1
.ipynb` containing the following steps:***
> 1. **Setup requirements**
> 1. **Data preprocessing**
> 1. **Extracting rules using**
> - **Apriori**
> - **FP Growth**
> 1. **Performance comparison between the two algorithms**
> ***This project contains a python file `constants.ipynb` containing some fixed values used in `hw.ipynb`, refered as `CONSTANTS` class***
constants.py
View file @
fb8fcf89
class
CONSTANTS
:
DATASET_PATH
=
'data/bread_basket.csv'
\ No newline at end of file
DATASET_PATH
=
'data/bread_basket.csv'
DATASET_SHAPE
=
(
20507
,
5
)
PREPROCESSED_DATASET_PATH
=
'data/bread_basket_preprocessed.csv'
PREPROCESSED_DATASET_SHAPE
=
(
9465
,
94
)
\ No newline at end of file
hw1.ipynb
View file @
fb8fcf89
...
...
@@ -7,24 +7,35 @@
"# ***0. Setup***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Please note that the following cell may require working VPN to work**"
]
},
{
"cell_type": "code",
"execution_count":
null
,
"execution_count":
1
,
"metadata": {},
"outputs": [],
"source": [
"pip install pandas"
"%pip install pandas\n",
"%pip install mlxtend"
]
},
{
"cell_type": "code",
"execution_count":
null
,
"execution_count":
2
,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"import constants"
"from mlxtend.frequent_patterns import apriori, association_rules\n",
"from mlxtend.frequent_patterns import fpgrowth\n",
"\n",
"from constants import CONSTANTS"
]
},
{
...
...
@@ -34,6 +45,404 @@
"# ***1. Data preprocessing***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.1. Load dataset***"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset loaded successfully with shape: (20507, 5)\n"
]
}
],
"source": [
"df = pd.read_csv(CONSTANTS.DATASET_PATH)\n",
"df.shape\n",
"assert df.shape == CONSTANTS.DATASET_SHAPE, f\"Expected shape {CONSTANTS.DATASET_SHAPE}, but got {df.shape}\" \n",
"print(\"Dataset loaded successfully with shape:\", df.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.2. Check null values***"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Null values in each column:\n",
"Transaction 0\n",
"Item 0\n",
"date_time 0\n",
"period_day 0\n",
"weekday_weekend 0\n",
"dtype: int64\n"
]
}
],
"source": [
"print(\"Null values in each column:\")\n",
"print(df.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Nothing to do since there is no null values**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.3. Check duplicates***"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of duplicates in dataset: 1620\n"
]
}
],
"source": [
"print(f\"Number of duplicates in dataset: {df.duplicated().sum()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We have 1620 duplicated rows, let's remove them**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of duplicates in dataset: 0\n",
"New dataset shape: (18887, 5)\n"
]
}
],
"source": [
"df = df.drop_duplicates()\n",
"print(f\"Number of duplicates in dataset: {df.duplicated().sum()}\")\n",
"print(f\"New dataset shape: {df.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Now, let's count number of unique items, and total transactions in the dataset**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of transactions in the dataset: 9465\n",
"Number of unique items in the dataset: 94\n"
]
}
],
"source": [
"print(f\"Number of transactions in the dataset: {df['Transaction'].nunique()}\")\n",
"print(f\"Number of unique items in the dataset: {df['Item'].nunique()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.4. Process dataset columns***"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset columns:\n",
"Transaction int64\n",
"Item object\n",
"date_time object\n",
"period_day object\n",
"weekday_weekend object\n",
"dtype: object\n"
]
}
],
"source": [
"print(f\"Dataset columns:\")\n",
"print(df.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We have 5 columns**\n",
"1. **`Transaction`**: Transaction id\n",
"1. **`Item`**: Item name\n",
"1. **`date_time`**: Date of transaction\n",
"1. **`period_day`**: In which period of day (morning, afternoon, ...) the transaction is\n",
"1. **`weekday_weeken`**: In weekday or weekend the transaction is\n",
"\n",
"***Please note:*** **If a transaction contains multiple items, each one will be represented in a seperate row with same id**\n",
"\n",
"**We are inrested only in `Transaction` and `Item`, so we'll delete other columns and rename them**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset new columns:\n",
"transaction_id int64\n",
"item_name object\n",
"dtype: object\n"
]
}
],
"source": [
"df = df.loc[:, ['Transaction', 'Item']].rename(columns={\n",
" 'Transaction': 'transaction_id',\n",
" 'Item': 'item_name'\n",
"})\n",
"\n",
"print(f\"Dataset new columns:\")\n",
"print(df.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ***1.5. Convert to one-hot-encoding***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**We'll convert the dataset into one-hot-encoding as following**: \n",
"> - Each row contain 94 features (columns) + 1 feature for transaction_id\n",
"> - 94 is number of unique items\n",
"> - Each feature value is boolean (true meaning that item is in the transaction and vice versa)\n",
"> - So, the new shape of dataset will be (9465, 95)\n",
"> - We're doing so to be able to use libraries for applying Apriori and FP Growth"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset before one-hot-encoding:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>transaction_id</th>\n",
" <th>item_name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Bread</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Scandinavian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>Hot chocolate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" <td>Jam</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>3</td>\n",
" <td>Cookies</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" transaction_id item_name\n",
"0 1 Bread\n",
"1 2 Scandinavian\n",
"3 3 Hot chocolate\n",
"4 3 Jam\n",
"5 3 Cookies"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(f\"Dataset before one-hot-encoding:\")\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New dataset shape: (9465, 95)\n"
]
}
],
"source": [
"one_hot_encoded = pd.get_dummies(df['item_name'])\n",
"df = df[['transaction_id']].join(one_hot_encoded).groupby('transaction_id').sum()\n",
"df.reset_index(inplace=True)\n",
"print(f\"New dataset shape: {df.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's delete `transaction_id` column sice it's not required, and convert other columns into boolean to save space**: "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Final dataset shape: (9465, 94)\n",
"Dataset columns types after one-hot-encoding:\n",
"Adjustment bool\n",
"Afternoon with the baker bool\n",
"Alfajores bool\n",
"Argentina Night bool\n",
"Art Tray bool\n",
" ... \n",
"Tshirt bool\n",
"Valentine's card bool\n",
"Vegan Feast bool\n",
"Vegan mincepie bool\n",
"Victorian Sponge bool\n",
"Length: 94, dtype: object\n"
]
}
],
"source": [
"df = df.drop(columns=['transaction_id'])\n",
"df = df.astype(bool)\n",
"assert df.shape == CONSTANTS.PREPROCESSED_DATASET_SHAPE, f\"Expected shape {CONSTANTS.PREPROCESSED_DATASET_SHAPE}, but got {df.shape}\" \n",
"print(f\"Final dataset shape: {df.shape}\")\n",
"print(f\"Dataset columns types after one-hot-encoding:\")\n",
"print(df.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Let's save preprocessed dataset in a .csv file**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv(CONSTANTS.PREPROCESSED_DATASET_PATH)"
]
},
{
"cell_type": "markdown",
"metadata": {},
...
...
@@ -63,7 +472,15 @@
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment