广告欺诈是数字营销需要面临的重要挑战之一,点击会欺诈浪费广告主大量金钱,同时对点击数据会产生误导作用。本次比赛提供了约50万次点击数据。特别注意:我们对数据进行了模拟生成,对某些特征含义进行了隐藏,并进行了脱敏处理。
请预测用户的点击行为是否为正常点击,还是作弊行为。点击欺诈预测适用于各种信息流广告投放,banner广告投放,以及百度网盟平台,帮助商家鉴别点击欺诈,锁定精准真实用户。
大赛提供50万的训练数据以及15万的测试数据。目标是预测该笔数据是否存在反欺诈行为。
字段 | 类型 | 说明 |
---|---|---|
sid | string | 样本id/请求会话sid |
package | string | 媒体信息,包名(已加密) |
version | string | 媒体信息,app版本 |
android_id | string | 媒体信息,对外广告位ID(已加密) |
media_id | string | 媒体信息,对外媒体ID(已加密) |
apptype | int | 媒体信息,app所属分类 |
timestamp | bigint | 请求到达服务时间,单位ms |
location | int | 用户地理位置编码(精确到城市) |
fea_hash | int | 用户特征编码(具体物理含义略去) |
fea1_hash | int | 用户特征编码(具体物理含义略去) |
cus_type | int | 用户特征编码(具体物理含义略去) |
ntt | int | 网络类型 0-未知, 1-有线网, 2-WIFI, 3-蜂窝网络未知, 4-2G, 5-3G, 6–4G |
carrier | string | 设备使用的运营商 0-未知, 46000-移动, 46001-联通, 46003-电信 |
os | string | 操作系统,默认为android |
osv | string | 操作系统版本 |
lan | string | 设备采用的语言,默认为中文 |
dev_height | int | 设备高 |
dev_width | int | 设备宽 |
dev_ppi | int | 屏幕分辨率 |
label | int | 是否存在反欺诈 |
通过数据label可以得知,该命题是一个二分类任务。可使用机器学习算法或者MLP进行求解。
解题方案可分为两部分:
下面将列出大致的建模方案,具体可查看源码:gitee仓库
机器学习无非就是特征工程+祖传参数的问题。通常经过下为了快速出第一版本的Baseline,我们常常会使用LGB(lightgbm)起步。这个算法的最大的特点就是保证准确率的同时还很快。
空值处理
经调研发现,在lan和osv上面出现空值。
# 字符串类型 需要转换为数值(labelencoder)
object_cols = train.select_dtypes(include='object').columns
# 缺失值个数
temp = train.isnull().sum()
# 有缺失值的字段: lan, osv
temp[temp>0]
# 获取分析字段
features = train.columns.tolist()
features.remove('label')
print(features)
连续值与分类值
接着分析连续值与分类值。最终发现对osv需要进行转换处理,对fea_hash与fea1_hash初步先求字符长度处理
for feature in features:
print(feature, train[feature].nunique())
osv处理方法
# 处理osv
def trans_osv(osv):
global result
osv = str(osv).replace(' ','').replace('.','').replace('Android_','').replace('十核20G_HD','').replace('Android','').replace('W','')
if osv == 'nan' or osv == 'GIONEE_YNGA':
result = 810
elif osv.count('-') >0:
result = int(osv.split('-')[0])
elif osv == 'f073b_changxiang_v01_b1b8_20180915':
result = 810
elif osv == '%E6%B1%9F%E7%81%B5OS+50':
result = 500
else:
result = int(osv)
if result < 10:
result = result * 100
elif result < 100:
result = result * 10
return int(result)
最后测试与训练集的转换
# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['osv'] = features['osv'].apply(trans_osv)
test_features = test[col]
# 构造fea_hash_len特征
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['osv'] = test_features['osv'].apply(trans_osv)
使用默认参数的lgb进行建模,最终成绩:88.094
#train['os'].value_counts()
# 使用LGBM训练
import lightgbm as lgb
model = lgb.LGBMClassifier()
# 模型训练
model.fit(features.drop(['timestamp', 'version'], axis=1), train['label'])
result = model.predict(test_features.drop(['timestamp', 'version'], axis=1))
#features['version'].value_counts()
res = pd.DataFrame(test['sid'])
res['label'] = result
res.to_csv('./baseline.csv', index=False)
res
下面列出做过的方案列表,具体版本对比见文末模型结果。具体查看源码:gitee仓库
本次深度学习方法着重使用百度的飞桨作为基础框架完成
针对数据处理模块,大致与机器学习的类似。但由于使用到深度学习,所以在处理完成以后需要对数据进行归一化处理。
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# 数据加载
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test = test.iloc[:, 1:]
train = train.iloc[:, 1:]
train
# ##### Object类型: lan, os, osv, version, fea_hash
# ##### 有缺失值的字段: lan, osv
# In[2]:
# ['os', 'osv', 'lan', 'sid’]
features = train.columns.tolist()
features.remove('label')
print(features)
# In[3]:
for feature in features:
print(feature, train[feature].nunique())
# In[4]:
# 对osv进行数据清洗
def osv_trans(x):
x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')
if str(x).find('.') > 0:
temp_index1 = x.find('.')
if x.find(' ') > 0:
temp_index2 = x.find(' ')
else:
temp_index2 = len(x)
if x.find('-') > 0:
temp_index2 = x.find('-')
result = x[0:temp_index1] + '.' + x[temp_index1 + 1:temp_index2].replace('.', '')
try:
return float(result)
except:
print(x + '#########')
return 0
try:
return float(x)
except:
print(x + '#########')
return 0
# train['osv'] => LabelEncoder ?
# 采用众数,进行缺失值的填充
train['osv'].fillna('8.1.0', inplace=True)
# 数据清洗
train['osv'] = train['osv'].apply(osv_trans)
# 采用众数,进行缺失值的填充
test['osv'].fillna('8.1.0', inplace=True)
# 数据清洗
test['osv'] = test['osv'].apply(osv_trans)
# In[5]:
# train['os'].value_counts()
train['lan'].value_counts()
# lan_map = {'zh-CN': 1, }
train['lan'].value_counts().index
lan_map = {'zh-CN': 1, 'zh_CN': 2, 'Zh-CN': 3, 'zh-cn': 4, 'zh_CN_#Hans': 5, 'zh': 6, 'ZH': 7, 'cn': 8, 'CN': 9,
'zh-HK': 10, 'tw': 11, 'TW': 12, 'zh-TW': 13, 'zh-MO': 14, 'en': 15, 'en-GB': 16, 'en-US': 17, 'ko': 18,
'ja': 19, 'it': 20, 'mi': 21}
train['lan'] = train['lan'].map(lan_map)
test['lan'] = test['lan'].map(lan_map)
test['lan'].value_counts()
# In[6]:
# 对于有缺失的lan 设置为22
train['lan'].fillna(22, inplace=True)
test['lan'].fillna(22, inplace=True)
# In[7]:
remove_list = ['os', 'sid']
col = features
for i in remove_list:
col.remove(i)
col
# In[8]:
# train['timestamp'].value_counts()
# train['timestamp'] = pd.to_datetime(train['timestamp'])
# train['timestamp']
from datetime import datetime
# lambda 是一句话函数,匿名函数
train['timestamp'] = train['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
# 1559892728241.7212
# 1559871800477.1477
# 1625493942.538375
# import time
# time.time()
test['timestamp'] = test['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
test['timestamp']
# In[9]:
def version_trans(x):
if x == 'V3':
return 3
if x == 'v1':
return 1
if x == 'P_Final_6':
return 6
if x == 'V6':
return 6
if x == 'GA3':
return 3
if x == 'GA2':
return 2
if x == 'V2':
return 2
if x == '50':
return 5
return int(x)
train['version'] = train['version'].apply(version_trans)
test['version'] = test['version'].apply(version_trans)
train['version'] = train['version'].astype('int')
test['version'] = test['version'].astype('int')
# In[10]:
# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features
test_features = test[col]
# 构造fea_hash_len特征
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features
# 对训练集的timestamp提取时间多尺度
# 创建时间戳索引
temp = pd.DatetimeIndex(features['timestamp'])
features['year'] = temp.year
features['month'] = temp.month
features['day'] = temp.day
features['week_day'] = temp.weekday # 星期几
features['hour'] = temp.hour
features['minute'] = temp.minute
# 求时间的diff
start_time = features['timestamp'].min()
features['time_diff'] = features['timestamp'] - start_time
features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds / 3600 / 24
features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]
# 创建时间戳索引
temp = pd.DatetimeIndex(test_features['timestamp'])
test_features['year'] = temp.year
test_features['month'] = temp.month
test_features['day'] = temp.day
test_features['week_day'] = temp.weekday # 星期几
test_features['hour'] = temp.hour
test_features['minute'] = temp.minute
# 求时间的diff
# start_time = features['timestamp'].min()
test_features['time_diff'] = test_features['timestamp'] - start_time
test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds / 3600 / 24
# test_features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]
test_features['time_diff']
# In[12]:
# test['version'].value_counts()
# features['version'].value_counts()
features['dev_height'].value_counts()
features['dev_width'].value_counts()
# 构造面积特征
features['dev_area'] = features['dev_height'] * features['dev_width']
test_features['dev_area'] = test_features['dev_height'] * test_features['dev_width']
# In[13]:
"""
Thinking:是否可以利用 dev_ppi 和 dev_area构造新特征
features['dev_ppi'].value_counts()
features['dev_area'].astype('float') / features['dev_ppi'].astype('float')
"""
# features['ntt'].value_counts()
features['carrier'].value_counts()
features['package'].value_counts()
# version - osv APP版本与操作系统版本差
features['osv'].value_counts()
features['version_osv'] = features['osv'] - features['version']
test_features['version_osv'] = test_features['osv'] - test_features['version']
# In[14]:
features = features.drop(['timestamp'], axis=1)
test_features = test_features.drop(['timestamp'], axis=1)
# In[16]:
# 特征归一化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features1 = scaler.fit_transform(features)
test_features1 = scaler.transform(test_features)
import paddle
from paddle import nn
from paddle.io import Dataset, DataLoader
import numpy as np
paddle.device.set_device('gpu:0')
# 自定义dataset
class MineDataset(Dataset):
def __init__(self, X, y):
super(MineDataset, self).__init__()
self.num_samples = len(X)
self.X = X
self.y = y
def __getitem__(self, idx):
return self.X.iloc[idx].values.astype('float32'), np.array(self.y.iloc[idx]).astype('int64')
def __len__(self):
return self.num_samples
from sklearn.model_selection import train_test_split
train_x, val_x, train_y, val_y = train_test_split(features1, train['label'], test_size=0.2, random_state=42)
train_x = pd.DataFrame(train_x, columns=features.columns)
val_x = pd.DataFrame(val_x, columns=features.columns)
train_y = pd.DataFrame(train_y, columns=['label'])
val_y = pd.DataFrame(val_y, columns=['label'])
train_dataloader = DataLoader(MineDataset(train_x, train_y),
batch_size=1024,
shuffle=True,
drop_last=True,
num_workers=2)
val_dataloader = DataLoader(MineDataset(val_x, val_y),
batch_size=1024,
shuffle=True,
drop_last=True,
num_workers=2)
test_dataloader = DataLoader(MineDataset(test_features1, pd.Series([0 for i in range(len(test_features1))])),
batch_size=1024,
shuffle=True,
drop_last=True,
num_workers=2)
第一版本网络仅使用简单的全连接层网络。网络结构从250到2的塔石结构,每个线性层之间经过relu和dropout层。
class ClassifyModel(nn.Layer):
def __init__(self, features_len):
super(ClassifyModel, self).__init__()
self.fc1 = nn.layer.Linear(in_features=features_len, out_features=250)
self.ac1 = nn.layer.ReLU()
self.drop1 = nn.layer.Dropout(p=0.02)
self.fc2 = nn.layer.Linear(in_features=250, out_features=100)
self.ac2 = nn.layer.ReLU()
self.drop2 = nn.layer.Dropout(p=0.02)
self.fc3 = nn.layer.Linear(in_features=100, out_features=50)
self.ac3 = nn.layer.ReLU()
self.drop3 = nn.layer.Dropout(p=0.02)
self.fc4 = nn.layer.Linear(in_features=50, out_features=25)
self.ac4 = nn.layer.ReLU()
self.drop4 = nn.layer.Dropout(p=0.02)
self.fc5 = nn.layer.Linear(in_features=25, out_features=2)
self.out = nn.layer.Sigmoid()
def forward(self, input):
x = self.fc1(input)
x = self.ac1(x)
x = self.drop1(x)
x = self.fc2(x)
x = self.ac2(x)
x = self.drop2(x)
x = self.fc3(x)
x = self.ac3(x)
x = self.drop3(x)
x = self.fc4(x)
x = self.ac4(x)
x = self.drop4(x)
x = self.fc5(x)
output = self.out(x)
return output
# 初始化模型
model = ClassifyModel(int(len(features.columns)))
# 训练模式
model.train()
# 定义优化器
opt = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())
loss_fn = nn.CrossEntropyLoss()
EPOCHS = 10 # 设置外层循环次数
for epoch in range(EPOCHS):
for iter_id, mini_batch in enumerate(train_dataloader):
x_train = mini_batch[0]
y_train = mini_batch[1]
# 前向传播
y_pred = model(x_train)
# 计算损失
loss = nn.functional.loss.cross_entropy(y_pred, y_train)
# 打印loss
avg_loss = paddle.mean(loss)
if iter_id % 20 == 0:
acc = paddle.metric.accuracy(y_pred, y_train)
print("epoch: {}, iter: {}, loss is: {}, acc is: {}".format(epoch, iter_id, avg_loss.numpy(), acc.numpy()))
# 反向传播
avg_loss.backward()
# 最小化loss,更新参数
opt.step()
# 清除梯度
opt.clear_grad()
同样,由于篇幅原因,下面两个方案可参考源码:gitee仓库
注意使用Embedding前,请先运行Embedding分析.ipynb生成对应字典文件
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
ML | ML第一版本 | 1. 初步建模 2. 不参与建模的特征 [‘os’, ‘version’, ‘lan’, 'sid’] 3. 默认参数LGB |
88.094 |
ML第二版本 | 1. 基于第一版本 2. 引入version,简单转化使用timestamp 3. 测试默认参数LGB与XGB |
88.2133 | |
ML第三版本 | 1. 基于第二版本 2. 引入lan 3. 对osv和version做差 4. lgb祖传参数 |
88.9487 | |
ML第四版本 | 1. 基于第三版本 2. 5折lgb 3. 5折xgb 4. 融合 |
89.0293 89.0253 89.054 |
|
ML第五版本 | 1.基于第三版本 2.添加像素比、像素大小、像素分辨率比 3. 5折lgb 4. 5折xgb 5. 融合 |
89.1873 89.108 89.1713 |
|
Paddle | Paddle第一版本 | 1. 基于ML第三版本特征工程 2. 简单基于paddle搭建网络 |
未上传结果 |
Paddle第二版本 | 1. 基于第一版本 2. 添加embedding字典创建(在Embedding分析.ipynb) 3.基于embedding的混合基础模型 |
88.71 | |
Paddle第三版本 | 1. 基于第二版本 2. 添加DeepFM部分模型,然后合并 |
87.816 | |
TensorFlow | TF第一版本 | 1. 基于ML第三版本特征工程 2. 简单基于TensorFlow搭建网络 |
未上传结果 |
FM | FM第一版本 | 1. 基于FM模型的第一次简单建模 | 57.2147 |
最终排名得分
源码地址
https://gitee.com/turkeymz/coggle/tree/master/coggle_202112/mlp