date : all data are aggregated by date
shown : the number of ads shown on a given day all over the web. Impressions are free. That is, companies pay only if a user clicks on the ad, not to show it
clicked : the number of clicks on the ads. This is what companies pay for. By clicking on the ad, the user is brought to the site
converted : the number of conversions on the site coming from ads. To be counted, a conversion has to happen on the same day as the ad click.
avg_cost_per_click : on an average, how much it cost each of those clicks
total_revenue : how much revenue came from the conversions
ad : we have several different ad groups. This shows which ad group we are considering
import warnings
warnings.simplefilter('ignore')
import logging, sys
logging.disable(sys.maxsize)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('ad_table.csv',parse_dates=['date'])
data.head(5)
data.info()
data.describe()
we need a metric to measure how good certain group of ad is.
here, i’d like to go with profit per unit.
profit = total_avenue- (# of clicked * avg_cost_per_click)
profit per unit = profit/# of ads shown
pros:it’s a direct measurement of profitability.very intuitive and easy to compute.
cons:avg profit per group doesn’t reflect the wholistic market share of certain group. the most profitable group is not necessarily the group that makes the largest profit.
data['profit'] = data['total_revenue']-data['avg_cost_per_click']*data['clicked']
data.head()
def per_unit_profit(data):
profit = data['profit'].sum()
shown = data['shown'].sum()
pup = profit/shown
return pup
grouped =data.groupby('ad').apply(per_unit_profit).reset_index()
grouped = grouped.rename(columns={0: 'unit_profit'})
grouped = grouped.sort_values(by='unit_profit', ascending=False)
grouped.head(10)
The top 5 ads group are group_16,group_2,group_14,group_31,group_27, according to my choice of metric
Here i’m using GAM (general additive model) since our dataset is a timeseries. Note that ARIMA (Autoregressive Integrated Moving Average) and Long short-term memory (LSTM) networks both do well in timeseries analysis. However,ARIMA model requires rigid assumptions(To use ARIMA, trends should have regular periods, as well as constant mean and variance.) and neuron network lacks interpretability.
GAM is like regression, except that it is a sum of different functions rather than variables. therefore, it can isolate different component functions and help us interpret patterns like week,month,annual,holidays,special event and etc.what’s more, it is easy to implement.
from fbprophet import Prophet
# extract date and shown
df_1 = data.loc[data['ad']=='ad_group_1',['date','shown']]
df_1_gam=df_1.rename(columns={'date':'ds','shown':'y'})
df_1_gam.head()
# fit in model
m=Prophet()
m.fit(df_1_gam)
# make predictions
future = m.make_future_dataframe(periods=30)
future
# make prediction
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()
prediction_grp_1 = forecast[forecast['ds']=='2015-12-15']['yhat'].to_string(index=False)
prediction_grp_1
# ' 77558.452285'
# interactive forecasting plot of grp_1_ad shown
from fbprophet.plot import plot_plotly
import plotly.offline as py
py.init_notebook_mode()
fig = plot_plotly(m, forecast)
py.iplot(fig)
# component functions
fig2 = m.plot_components(forecast)
# now we predict all other groups
prediction = pd.DataFrame({'ad':[],'Dec_15_shown':[]})
for i in range(1,41):
ad_group = 'ad_group_' + str(i)
df_x = data.loc[data['ad']==ad_group,['date','shown']]
df_x_gam=df_x.rename(columns={'date':'ds','shown':'y'})
m=Prophet()
m.fit(df_x_gam)
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)
prediction_grp_x = forecast[forecast['ds']=='2015-12-15']['yhat'].to_string(index=False)
prediction=prediction.append({'ad':ad_group,'Dec_15_shown':prediction_grp_x},ignore_index=True)
prediction
# visualization
fig, ax = plt.subplots(figsize=(15, 8))
for i in range(1, 41):
ad_group = 'ad_group_' + str(i)
vals = data[data['ad'] == ad_group].sort_values(by='date')['avg_cost_per_click'].values
ax.plot(vals, label=ad_group)
ax.legend()
plt.tight_layout()
plt.show()
def cost_stats(df):
""" function to calculate the avg_cost_per_click trend """
tmp = df.sort_values(by='date')['avg_cost_per_click'].values
ratio = tmp[1:] / tmp[:-1]
ratio_mean = np.mean(ratio)
ratio_min = np.min(ratio)
ratio_25 = np.percentile(ratio, 25)
ratio_50 = np.percentile(ratio, 50)
ratio_75 = np.percentile(ratio, 75)
ratio_max = np.max(ratio)
return pd.Series([ratio_mean, ratio_min, ratio_25, ratio_50, ratio_75, ratio_max],
index=['mean', 'min', '25%', '50%', '75%', 'max'])
stats = data.groupby('ad').apply(cost_stats)
stats.head()
# visualization
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(12, 10), sharex=True)
sns.distplot(stats['25%'], bins=40, ax=ax[0], label='25%', hist_kws=hist_kws)
ax[0].legend(fontsize=12)
sns.distplot(stats['50%'], bins=40, ax=ax[1], label='50%', hist_kws=hist_kws)
ax[1].legend(fontsize=12)
sns.distplot(stats['75%'], bins=40, ax=ax[2], label='75%', hist_kws=hist_kws)
ax[2].legend(fontsize=12)
plt.tight_layout()
plt.show()