Gadgets Sales Analysis

Business Request & User Stories

The business request for this data analyst project was an executive sales report for sales managers. Based on the request that was made from the business we following user stories were defined to fulfill delivery and ensure that acceptance criteria’s were maintained throughout the project.

GSalesAnalysis

Gadget Sales Analysis

In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Import necessary libraries

In [5]:
import os
import pandas as pd
In [3]:
path = "/content/drive/MyDrive/Colab Notebooks/SalesAnalysis/Sales_Data"
files = [file for file in os.listdir(path) if not file.startswith('.')] # Ignore hidden files

all_months_data = pd.DataFrame()

for file in files:
    current_data = pd.read_csv(path+"/"+file)
    all_months_data = pd.concat([all_months_data, current_data])
    
all_months_data.to_csv("all_data_copy.csv", index=False)
all_months_data.to_csv("all_data.csv", index=False)
In [6]:
all_data = pd.read_csv("all_data.csv")
all_data.head()
Out[6]:
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 278797 Wired Headphones 1 11.99 11/21/19 09:54 46 Park St, New York City, NY 10001
1 278798 USB-C Charging Cable 2 11.95 11/17/19 10:03 962 Hickory St, Austin, TX 73301
2 278799 Apple Airpods Headphones 1 150.0 11/19/19 14:56 464 Cherry St, Los Angeles, CA 90001
3 278800 27in FHD Monitor 1 149.99 11/25/19 22:24 649 10th St, Seattle, WA 98101
4 278801 Bose SoundSport Headphones 1 99.99 11/09/19 13:56 522 Hill St, Boston, MA 02215

Clean up the data!

Figuring out what we need to clean and debugging errors.

Drop rows of NAN
In [7]:
# Find NAN
nan_df = all_data[all_data.isna().any(axis=1)]
display(nan_df.head())

all_data = all_data.dropna(how='all')
all_data.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
69 NaN NaN NaN NaN NaN NaN
138 NaN NaN NaN NaN NaN NaN
544 NaN NaN NaN NaN NaN NaN
546 NaN NaN NaN NaN NaN NaN
781 NaN NaN NaN NaN NaN NaN
Out[7]:
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 278797 Wired Headphones 1 11.99 11/21/19 09:54 46 Park St, New York City, NY 10001
1 278798 USB-C Charging Cable 2 11.95 11/17/19 10:03 962 Hickory St, Austin, TX 73301
2 278799 Apple Airpods Headphones 1 150.0 11/19/19 14:56 464 Cherry St, Los Angeles, CA 90001
3 278800 27in FHD Monitor 1 149.99 11/25/19 22:24 649 10th St, Seattle, WA 98101
4 278801 Bose SoundSport Headphones 1 99.99 11/09/19 13:56 522 Hill St, Boston, MA 02215
Get rid of text in order date column
In [8]:
all_data = all_data[all_data['Order Date'].str[0:2]!='Or']

Make columns correct type

In [9]:
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])
all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])

Augment data with additional columns

Add month column

In [10]:
all_data['Month'] = all_data['Order Date'].str[0:2]
all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()
Out[10]:
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
0 278797 Wired Headphones 1 11.99 11/21/19 09:54 46 Park St, New York City, NY 10001 11
1 278798 USB-C Charging Cable 2 11.95 11/17/19 10:03 962 Hickory St, Austin, TX 73301 11
2 278799 Apple Airpods Headphones 1 150.00 11/19/19 14:56 464 Cherry St, Los Angeles, CA 90001 11
3 278800 27in FHD Monitor 1 149.99 11/25/19 22:24 649 10th St, Seattle, WA 98101 11
4 278801 Bose SoundSport Headphones 1 99.99 11/09/19 13:56 522 Hill St, Boston, MA 02215 11

Add month column (alternative method)

In [11]:
# all_data['Month 2'] = pd.to_datetime(all_data['Order Date']).dt.month
# all_data.head()

Add city column

In [12]:
def get_city(address):
    return address.split(",")[1].strip(" ")

def get_state(address):
    return address.split(",")[2].split(" ")[1]

all_data['City'] = all_data['Purchase Address'].apply(lambda x: f"{get_city(x)}  ({get_state(x)})")
all_data.head()
Out[12]:
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month City
0 278797 Wired Headphones 1 11.99 11/21/19 09:54 46 Park St, New York City, NY 10001 11 New York City (NY)
1 278798 USB-C Charging Cable 2 11.95 11/17/19 10:03 962 Hickory St, Austin, TX 73301 11 Austin (TX)
2 278799 Apple Airpods Headphones 1 150.00 11/19/19 14:56 464 Cherry St, Los Angeles, CA 90001 11 Los Angeles (CA)
3 278800 27in FHD Monitor 1 149.99 11/25/19 22:24 649 10th St, Seattle, WA 98101 11 Seattle (WA)
4 278801 Bose SoundSport Headphones 1 99.99 11/09/19 13:56 522 Hill St, Boston, MA 02215 11 Boston (MA)

Data Exploration!

Question 1: What was the best month for sales? How much was earned that month?

In [13]:
all_data['Sales'] = all_data['Quantity Ordered'].astype('int') * all_data['Price Each'].astype('float')
In [14]:
all_data.groupby(['Month']).sum()
Out[14]:
Quantity Ordered Price Each Sales
Month
1 10903 1.811768e+06 1.822257e+06
2 13449 2.188885e+06 2.202022e+06
3 17005 2.791208e+06 2.807100e+06
4 20558 3.367671e+06 3.390670e+06
5 18667 3.135125e+06 3.152607e+06
6 15253 2.562026e+06 2.577802e+06
7 16072 2.632540e+06 2.647776e+06
8 13448 2.230345e+06 2.244468e+06
9 13109 2.084992e+06 2.097560e+06
10 22703 3.715555e+06 3.736727e+06
11 19798 3.180601e+06 3.199603e+06
12 28114 4.588415e+06 4.613443e+06
In [15]:
import matplotlib.pyplot as plt

months = range(1,13)
print(months)

plt.bar(months,all_data.groupby(['Month']).sum()['Sales'])
plt.xticks(months)
plt.ylabel('Sales in USD ($)')
plt.xlabel('Month number')
plt.show()
range(1, 13)

Question 2: What city sold the most product?

In [16]:
all_data.groupby(['City']).sum()
Out[16]:
Quantity Ordered Price Each Month Sales
City
Atlanta (GA) 16602 2.779908e+06 104794 2.795499e+06
Austin (TX) 11153 1.809874e+06 69829 1.819582e+06
Boston (MA) 22528 3.637410e+06 141112 3.661642e+06
Dallas (TX) 16730 2.752628e+06 104620 2.767975e+06
Los Angeles (CA) 33289 5.421435e+06 208325 5.452571e+06
New York City (NY) 27932 4.635371e+06 175741 4.664317e+06
Portland (ME) 2750 4.471893e+05 17144 4.497583e+05
Portland (OR) 11303 1.860558e+06 70621 1.870732e+06
San Francisco (CA) 50239 8.211462e+06 315520 8.262204e+06
Seattle (WA) 16553 2.733296e+06 104941 2.747755e+06
In [17]:
import matplotlib.pyplot as plt

keys = [city for city, df in all_data.groupby(['City'])]

plt.bar(keys,all_data.groupby(['City']).sum()['Sales'])
plt.ylabel('Sales in USD ($)')
plt.xlabel('Month number')
plt.xticks(keys, rotation='vertical', size=8)
plt.show()

Question 3: What time should we display advertisements to maximize likelihood of customer's buying product?

In [18]:
# Add hour column
all_data['Hour'] = pd.to_datetime(all_data['Order Date']).dt.hour
# all_data['Minute'] = pd.to_datetime(all_data['Order Date']).dt.minute
all_data['Count'] = 1
all_data.head()
Out[18]:
Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month City Sales Hour Count
0 278797 Wired Headphones 1 11.99 11/21/19 09:54 46 Park St, New York City, NY 10001 11 New York City (NY) 11.99 9 1
1 278798 USB-C Charging Cable 2 11.95 11/17/19 10:03 962 Hickory St, Austin, TX 73301 11 Austin (TX) 23.90 10 1
2 278799 Apple Airpods Headphones 1 150.00 11/19/19 14:56 464 Cherry St, Los Angeles, CA 90001 11 Los Angeles (CA) 150.00 14 1
3 278800 27in FHD Monitor 1 149.99 11/25/19 22:24 649 10th St, Seattle, WA 98101 11 Seattle (WA) 149.99 22 1
4 278801 Bose SoundSport Headphones 1 99.99 11/09/19 13:56 522 Hill St, Boston, MA 02215 11 Boston (MA) 99.99 13 1
In [19]:
keys = [pair for pair, df in all_data.groupby(['Hour'])]

plt.plot(keys, all_data.groupby(['Hour']).count()['Count'])
plt.xticks(keys)
plt.grid()
plt.show()

# My recommendation is slightly before 11am or 7pm

Question 4: What products are most often sold together?

In [20]:
# https://stackoverflow.com/questions/43348194/pandas-select-rows-if-id-appear-several-time
df = all_data[all_data['Order ID'].duplicated(keep=False)]

# Referenced: https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby
df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
df2 = df[['Order ID', 'Grouped']].drop_duplicates()
In [21]:
# Referenced: https://stackoverflow.com/questions/52195887/counting-unique-pairs-of-numbers-into-a-python-dictionary
from itertools import combinations
from collections import Counter

count = Counter()

for row in df2['Grouped']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list, 2)))

for key,value in count.most_common(10):
    print(key, value)
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361
('iPhone', 'Apple Airpods Headphones') 360
('Google Phone', 'Bose SoundSport Headphones') 220
('USB-C Charging Cable', 'Wired Headphones') 160
('Vareebadd Phone', 'Wired Headphones') 143
('Lightning Charging Cable', 'Wired Headphones') 92

Question 5: What product sold the most? Why do you think it sold the most?

In [22]:
product_group = all_data.groupby('Product')
quantity_ordered = product_group.sum()['Quantity Ordered']

keys = [pair for pair, df in product_group]
plt.bar(keys, quantity_ordered)
plt.xticks(keys, rotation='vertical', size=8)
plt.show()
In [23]:
# Referenced: https://stackoverflow.com/questions/14762181/adding-a-y-axis-label-to-secondary-y-axis-in-matplotlib

prices = all_data.groupby('Product').mean()['Price Each']

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
ax1.bar(keys, quantity_ordered, color='g')
ax2.plot(keys, prices, color='b')

ax1.set_xlabel('Product Name')
ax1.set_ylabel('Quantity Ordered', color='g')
ax2.set_ylabel('Price ($)', color='b')
ax1.set_xticklabels(keys, rotation='vertical', size=8)

fig.show()