广告点击数据分析与预测

Goal

  1. choose a metric to describe the performance of the ad group.
  2. what are the pros and cons of this metric use your metrics to identify top 5 ad group.for each group, predict how many ads will be shown on Dec 15th.
  3. cluster ads into 3 group: avg_cost/click is uptrend,flat and
    downtrend.

Columns

date : all data are aggregated by date

shown : the number of ads shown on a given day all over the web. Impressions are free. That is, companies pay only if a user clicks on the ad, not to show it

clicked : the number of clicks on the ads. This is what companies pay for. By clicking on the ad, the user is brought to the site

converted : the number of conversions on the site coming from ads. To be counted, a conversion has to happen on the same day as the ad click.

avg_cost_per_click : on an average, how much it cost each of those clicks

total_revenue : how much revenue came from the conversions

ad : we have several different ad groups. This shows which ad group we are considering

Packages

import warnings
warnings.simplefilter('ignore')
import logging, sys
logging.disable(sys.maxsize)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

Load Data

data = pd.read_csv('ad_table.csv',parse_dates=['date'])
data.head(5)

广告点击数据分析与预测_第1张图片

data.info()

广告点击数据分析与预测_第2张图片

data.describe()

广告点击数据分析与预测_第3张图片

Metric

we need a metric to measure how good certain group of ad is.
here, i’d like to go with profit per unit.
profit = total_avenue- (# of clicked * avg_cost_per_click)
profit per unit = profit/# of ads shown

pros:it’s a direct measurement of profitability.very intuitive and easy to compute.

cons:avg profit per group doesn’t reflect the wholistic market share of certain group. the most profitable group is not necessarily the group that makes the largest profit.

data['profit'] = data['total_revenue']-data['avg_cost_per_click']*data['clicked']
data.head()

广告点击数据分析与预测_第4张图片

def per_unit_profit(data):
    profit = data['profit'].sum()
    shown = data['shown'].sum()
    pup = profit/shown
    
    return pup
grouped =data.groupby('ad').apply(per_unit_profit).reset_index()
grouped = grouped.rename(columns={0: 'unit_profit'})
grouped = grouped.sort_values(by='unit_profit', ascending=False)
grouped.head(10)

广告点击数据分析与预测_第5张图片
The top 5 ads group are group_16,group_2,group_14,group_31,group_27, according to my choice of metric

Forcasting clicks on Dec 15th

Here i’m using GAM (general additive model) since our dataset is a timeseries. Note that ARIMA (Autoregressive Integrated Moving Average) and Long short-term memory (LSTM) networks both do well in timeseries analysis. However,ARIMA model requires rigid assumptions(To use ARIMA, trends should have regular periods, as well as constant mean and variance.) and neuron network lacks interpretability.
GAM is like regression, except that it is a sum of different functions rather than variables. therefore, it can isolate different component functions and help us interpret patterns like week,month,annual,holidays,special event and etc.what’s more, it is easy to implement.

from fbprophet import Prophet
# extract date and shown
df_1 = data.loc[data['ad']=='ad_group_1',['date','shown']]
df_1_gam=df_1.rename(columns={'date':'ds','shown':'y'})
df_1_gam.head()

广告点击数据分析与预测_第6张图片

# fit in model
m=Prophet()
m.fit(df_1_gam)
# make predictions
future = m.make_future_dataframe(periods=30)
future

广告点击数据分析与预测_第7张图片

# make prediction
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()

广告点击数据分析与预测_第8张图片

prediction_grp_1 = forecast[forecast['ds']=='2015-12-15']['yhat'].to_string(index=False)
prediction_grp_1
# ' 77558.452285'
# interactive forecasting plot of grp_1_ad shown
from fbprophet.plot import plot_plotly
import plotly.offline as py
py.init_notebook_mode()
fig = plot_plotly(m, forecast)
py.iplot(fig)

广告点击数据分析与预测_第9张图片

# component functions
fig2 = m.plot_components(forecast)

广告点击数据分析与预测_第10张图片

# now we predict all other groups
prediction = pd.DataFrame({'ad':[],'Dec_15_shown':[]})

for i in range(1,41):
    ad_group = 'ad_group_' + str(i)
    df_x = data.loc[data['ad']==ad_group,['date','shown']]
    df_x_gam=df_x.rename(columns={'date':'ds','shown':'y'})
    m=Prophet()
    m.fit(df_x_gam)
    future = m.make_future_dataframe(periods=30)
    forecast = m.predict(future)
    prediction_grp_x = forecast[forecast['ds']=='2015-12-15']['yhat'].to_string(index=False)

    prediction=prediction.append({'ad':ad_group,'Dec_15_shown':prediction_grp_x},ignore_index=True)
prediction

广告点击数据分析与预测_第11张图片
广告点击数据分析与预测_第12张图片

Cluster

# visualization
fig, ax = plt.subplots(figsize=(15, 8))
for i in range(1, 41):
    ad_group = 'ad_group_' + str(i)
    vals = data[data['ad'] == ad_group].sort_values(by='date')['avg_cost_per_click'].values
    ax.plot(vals, label=ad_group)

ax.legend()
plt.tight_layout()
plt.show()

def cost_stats(df):
    """ function to calculate the avg_cost_per_click trend """
    tmp = df.sort_values(by='date')['avg_cost_per_click'].values
    ratio = tmp[1:] / tmp[:-1]
    
    ratio_mean = np.mean(ratio)
    ratio_min = np.min(ratio)
    ratio_25 = np.percentile(ratio, 25)
    ratio_50 = np.percentile(ratio, 50)
    ratio_75 = np.percentile(ratio, 75)
    ratio_max = np.max(ratio)
    
    return pd.Series([ratio_mean, ratio_min, ratio_25, ratio_50, ratio_75, ratio_max], 
                     index=['mean', 'min', '25%', '50%', '75%', 'max'])

stats = data.groupby('ad').apply(cost_stats)
stats.head()

广告点击数据分析与预测_第13张图片

# visualization
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}

fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(12, 10), sharex=True)
sns.distplot(stats['25%'], bins=40, ax=ax[0], label='25%', hist_kws=hist_kws)
ax[0].legend(fontsize=12)
sns.distplot(stats['50%'], bins=40, ax=ax[1], label='50%', hist_kws=hist_kws)
ax[1].legend(fontsize=12)
sns.distplot(stats['75%'], bins=40, ax=ax[2], label='75%', hist_kws=hist_kws)
ax[2].legend(fontsize=12)
plt.tight_layout()
plt.show()

广告点击数据分析与预测_第14张图片
广告点击数据分析与预测_第15张图片

你可能感兴趣的:(数据分析)