【12月Top 2】MarTech Challenge 点击反欺诈预测

背景

广告欺诈是数字营销需要面临的重要挑战之一,点击会欺诈浪费广告主大量金钱,同时对点击数据会产生误导作用。本次比赛提供了约50万次点击数据。特别注意:我们对数据进行了模拟生成,对某些特征含义进行了隐藏,并进行了脱敏处理。

请预测用户的点击行为是否为正常点击,还是作弊行为。点击欺诈预测适用于各种信息流广告投放,banner广告投放,以及百度网盟平台,帮助商家鉴别点击欺诈,锁定精准真实用户。

  • 比赛地址:https://aistudio.baidu.com/aistudio/competition/detail/52/0/introduction
  • 比赛数据集:https://download.csdn.net/download/turkeym4/72338032#

数据与任务

大赛提供50万的训练数据以及15万的测试数据。目标是预测该笔数据是否存在反欺诈行为。

字段 类型 说明
sid string 样本id/请求会话sid
package string 媒体信息,包名(已加密)
version string 媒体信息,app版本
android_id string 媒体信息,对外广告位ID(已加密)
media_id string 媒体信息,对外媒体ID(已加密)
apptype int 媒体信息,app所属分类
timestamp bigint 请求到达服务时间,单位ms
location int 用户地理位置编码(精确到城市)
fea_hash int 用户特征编码(具体物理含义略去)
fea1_hash int 用户特征编码(具体物理含义略去)
cus_type int 用户特征编码(具体物理含义略去)
ntt int 网络类型 0-未知, 1-有线网, 2-WIFI, 3-蜂窝网络未知, 4-2G, 5-3G, 6–4G
carrier string 设备使用的运营商 0-未知, 46000-移动, 46001-联通, 46003-电信
os string 操作系统,默认为android
osv string 操作系统版本
lan string 设备采用的语言,默认为中文
dev_height int 设备高
dev_width int 设备宽
dev_ppi int 屏幕分辨率
label int 是否存在反欺诈

通过数据label可以得知,该命题是一个二分类任务。可使用机器学习算法或者MLP进行求解。

解题思路

解题方案可分为两部分:

  • 使用机器学习算法的二分类预测:LGB/XGB/CatBoost
  • 使用深度学习算法的二分类预测:MLP/Wide & Deep/DeepFM

下面将列出大致的建模方案,具体可查看源码:gitee仓库

机器学习

机器学习无非就是特征工程+祖传参数的问题。通常经过下为了快速出第一版本的Baseline,我们常常会使用LGB(lightgbm)起步。这个算法的最大的特点就是保证准确率的同时还很快。

特征处理

空值处理
经调研发现,在lan和osv上面出现空值。

# 字符串类型 需要转换为数值(labelencoder)
object_cols = train.select_dtypes(include='object').columns

# 缺失值个数
temp = train.isnull().sum()
# 有缺失值的字段: lan, osv
temp[temp>0]
# 获取分析字段
features = train.columns.tolist()
features.remove('label')
print(features)

连续值与分类值
接着分析连续值与分类值。最终发现对osv需要进行转换处理,对fea_hash与fea1_hash初步先求字符长度处理

for feature in features:
    print(feature, train[feature].nunique())

osv处理方法

# 处理osv
def trans_osv(osv):
    global result
    osv = str(osv).replace(' ','').replace('.','').replace('Android_','').replace('十核20G_HD','').replace('Android','').replace('W','')
    if osv == 'nan' or osv == 'GIONEE_YNGA':
        result = 810
    elif osv.count('-') >0:
        result = int(osv.split('-')[0])
    elif osv == 'f073b_changxiang_v01_b1b8_20180915':
        result = 810
    elif osv == '%E6%B1%9F%E7%81%B5OS+50':
        result = 500
    else:
        result = int(osv)
        
    if result < 10:
        result = result * 100
    elif  result < 100:
        result = result * 10
        
    return int(result)

最后测试与训练集的转换

# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['osv'] = features['osv'].apply(trans_osv)


test_features = test[col]
# 构造fea_hash_len特征
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['osv'] = test_features['osv'].apply(trans_osv)

建模

使用默认参数的lgb进行建模,最终成绩:88.094

#train['os'].value_counts()
# 使用LGBM训练
import lightgbm as lgb
model = lgb.LGBMClassifier()
# 模型训练
model.fit(features.drop(['timestamp', 'version'], axis=1), train['label'])
result = model.predict(test_features.drop(['timestamp', 'version'], axis=1))
#features['version'].value_counts()
res = pd.DataFrame(test['sid'])
res['label'] = result
res.to_csv('./baseline.csv', index=False)
res

优化方向

下面列出做过的方案列表,具体版本对比见文末模型结果。具体查看源码:gitee仓库

  1. 添加version的转换使用
  2. 添加timestamp详细使用,增加周末以及diff特征
  3. 添加osvversion的差
  4. 添加lan的准换使用
  5. 添加屏幕比屏幕面积像素比
  6. 使用祖传lgb、祖传xgb等自定义参数模型
  7. 对模型进行5折交叉训练
  8. 多模型5折交叉训练融合

深度学习

本次深度学习方法着重使用百度的飞桨作为基础框架完成

特征处理

针对数据处理模块,大致与机器学习的类似。但由于使用到深度学习,所以在处理完成以后需要对数据进行归一化处理。

import pandas as pd
import warnings

warnings.filterwarnings('ignore')

# 数据加载
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test = test.iloc[:, 1:]
train = train.iloc[:, 1:]
train

# ##### Object类型: lan, os, osv, version, fea_hash
# ##### 有缺失值的字段: lan, osv

# In[2]:


# ['os', 'osv', 'lan', 'sid’]
features = train.columns.tolist()
features.remove('label')
print(features)

# In[3]:


for feature in features:
    print(feature, train[feature].nunique())


# In[4]:


# 对osv进行数据清洗
def osv_trans(x):
    x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')
    if str(x).find('.') > 0:
        temp_index1 = x.find('.')
        if x.find(' ') > 0:
            temp_index2 = x.find(' ')
        else:
            temp_index2 = len(x)

        if x.find('-') > 0:
            temp_index2 = x.find('-')

        result = x[0:temp_index1] + '.' + x[temp_index1 + 1:temp_index2].replace('.', '')
        try:
            return float(result)
        except:
            print(x + '#########')
            return 0
    try:
        return float(x)
    except:
        print(x + '#########')
        return 0


# train['osv'] => LabelEncoder ?
# 采用众数,进行缺失值的填充
train['osv'].fillna('8.1.0', inplace=True)
# 数据清洗
train['osv'] = train['osv'].apply(osv_trans)

# 采用众数,进行缺失值的填充
test['osv'].fillna('8.1.0', inplace=True)
# 数据清洗
test['osv'] = test['osv'].apply(osv_trans)

# In[5]:


# train['os'].value_counts()
train['lan'].value_counts()
# lan_map = {'zh-CN': 1, }
train['lan'].value_counts().index
lan_map = {'zh-CN': 1, 'zh_CN': 2, 'Zh-CN': 3, 'zh-cn': 4, 'zh_CN_#Hans': 5, 'zh': 6, 'ZH': 7, 'cn': 8, 'CN': 9,
           'zh-HK': 10, 'tw': 11, 'TW': 12, 'zh-TW': 13, 'zh-MO': 14, 'en': 15, 'en-GB': 16, 'en-US': 17, 'ko': 18,
           'ja': 19, 'it': 20, 'mi': 21}
train['lan'] = train['lan'].map(lan_map)
test['lan'] = test['lan'].map(lan_map)
test['lan'].value_counts()

# In[6]:


# 对于有缺失的lan 设置为22
train['lan'].fillna(22, inplace=True)
test['lan'].fillna(22, inplace=True)

# In[7]:


remove_list = ['os', 'sid']
col = features
for i in remove_list:
    col.remove(i)
col

# In[8]:


# train['timestamp'].value_counts()
# train['timestamp'] = pd.to_datetime(train['timestamp'])
# train['timestamp']
from datetime import datetime

# lambda 是一句话函数,匿名函数
train['timestamp'] = train['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
# 1559892728241.7212
# 1559871800477.1477
# 1625493942.538375
# import time
# time.time()
test['timestamp'] = test['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
test['timestamp']


# In[9]:


def version_trans(x):
    if x == 'V3':
        return 3
    if x == 'v1':
        return 1
    if x == 'P_Final_6':
        return 6
    if x == 'V6':
        return 6
    if x == 'GA3':
        return 3
    if x == 'GA2':
        return 2
    if x == 'V2':
        return 2
    if x == '50':
        return 5
    return int(x)


train['version'] = train['version'].apply(version_trans)
test['version'] = test['version'].apply(version_trans)
train['version'] = train['version'].astype('int')
test['version'] = test['version'].astype('int')

# In[10]:


# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features

test_features = test[col]
# 构造fea_hash_len特征
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features



# 对训练集的timestamp提取时间多尺度
# 创建时间戳索引
temp = pd.DatetimeIndex(features['timestamp'])
features['year'] = temp.year
features['month'] = temp.month
features['day'] = temp.day
features['week_day'] = temp.weekday  # 星期几
features['hour'] = temp.hour
features['minute'] = temp.minute

# 求时间的diff
start_time = features['timestamp'].min()
features['time_diff'] = features['timestamp'] - start_time
features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds / 3600 / 24
features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]

# 创建时间戳索引
temp = pd.DatetimeIndex(test_features['timestamp'])
test_features['year'] = temp.year
test_features['month'] = temp.month
test_features['day'] = temp.day
test_features['week_day'] = temp.weekday  # 星期几
test_features['hour'] = temp.hour
test_features['minute'] = temp.minute

# 求时间的diff
# start_time = features['timestamp'].min()
test_features['time_diff'] = test_features['timestamp'] - start_time
test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds / 3600 / 24
# test_features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]
test_features['time_diff']

# In[12]:


# test['version'].value_counts()
# features['version'].value_counts()
features['dev_height'].value_counts()
features['dev_width'].value_counts()
# 构造面积特征
features['dev_area'] = features['dev_height'] * features['dev_width']
test_features['dev_area'] = test_features['dev_height'] * test_features['dev_width']

# In[13]:


"""
Thinking:是否可以利用 dev_ppi 和 dev_area构造新特征
features['dev_ppi'].value_counts()
features['dev_area'].astype('float') / features['dev_ppi'].astype('float')
"""
# features['ntt'].value_counts()
features['carrier'].value_counts()
features['package'].value_counts()
# version - osv APP版本与操作系统版本差
features['osv'].value_counts()
features['version_osv'] = features['osv'] - features['version']
test_features['version_osv'] = test_features['osv'] - test_features['version']

# In[14]:


features = features.drop(['timestamp'], axis=1)
test_features = test_features.drop(['timestamp'], axis=1)

# In[16]:


# 特征归一化
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features1 = scaler.fit_transform(features)
test_features1 = scaler.transform(test_features)

生成Dataset和Dataloader

import paddle
from paddle import nn
from paddle.io import Dataset, DataLoader
import numpy as np
paddle.device.set_device('gpu:0')

# 自定义dataset
class MineDataset(Dataset):
    def __init__(self, X, y):
        super(MineDataset, self).__init__()
        self.num_samples = len(X)
        self.X = X
        self.y = y

    def __getitem__(self, idx):
        return self.X.iloc[idx].values.astype('float32'), np.array(self.y.iloc[idx]).astype('int64')

    def __len__(self):
        return self.num_samples

from sklearn.model_selection import train_test_split


train_x, val_x, train_y, val_y = train_test_split(features1, train['label'], test_size=0.2, random_state=42)

train_x = pd.DataFrame(train_x, columns=features.columns)
val_x = pd.DataFrame(val_x, columns=features.columns)
train_y = pd.DataFrame(train_y, columns=['label'])
val_y = pd.DataFrame(val_y, columns=['label'])


train_dataloader = DataLoader(MineDataset(train_x, train_y),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

val_dataloader = DataLoader(MineDataset(val_x, val_y),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

test_dataloader = DataLoader(MineDataset(test_features1, pd.Series([0 for i in range(len(test_features1))])),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

网络搭建

第一版本网络仅使用简单的全连接层网络。网络结构从250到2的塔石结构,每个线性层之间经过relu和dropout层。

class ClassifyModel(nn.Layer):

    def __init__(self, features_len):
        super(ClassifyModel, self).__init__()

        self.fc1 = nn.layer.Linear(in_features=features_len, out_features=250)
        self.ac1 = nn.layer.ReLU()
        self.drop1 = nn.layer.Dropout(p=0.02)

        self.fc2 = nn.layer.Linear(in_features=250, out_features=100)
        self.ac2 = nn.layer.ReLU()
        self.drop2 = nn.layer.Dropout(p=0.02)

        self.fc3 = nn.layer.Linear(in_features=100, out_features=50)
        self.ac3 = nn.layer.ReLU()
        self.drop3 = nn.layer.Dropout(p=0.02)

        self.fc4 = nn.layer.Linear(in_features=50, out_features=25)
        self.ac4 = nn.layer.ReLU()
        self.drop4 = nn.layer.Dropout(p=0.02)

        self.fc5 = nn.layer.Linear(in_features=25, out_features=2)
        self.out = nn.layer.Sigmoid()

    def forward(self, input):
        x = self.fc1(input)
        x = self.ac1(x)
        x = self.drop1(x)

        x = self.fc2(x)
        x = self.ac2(x)
        x = self.drop2(x)

        x = self.fc3(x)
        x = self.ac3(x)
        x = self.drop3(x)

        x = self.fc4(x)
        x = self.ac4(x)
        x = self.drop4(x)

        x = self.fc5(x)
        output = self.out(x)
        return output

网络训练

# 初始化模型
model = ClassifyModel(int(len(features.columns)))
# 训练模式
model.train()
# 定义优化器
opt = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())
loss_fn = nn.CrossEntropyLoss()

EPOCHS = 10   # 设置外层循环次数
for epoch in range(EPOCHS):
    for iter_id, mini_batch in enumerate(train_dataloader):
        x_train = mini_batch[0]
        y_train = mini_batch[1]
        # 前向传播
        y_pred = model(x_train)
        # 计算损失
        loss = nn.functional.loss.cross_entropy(y_pred, y_train)
        # 打印loss
        avg_loss = paddle.mean(loss)
        if iter_id % 20 == 0:
            acc = paddle.metric.accuracy(y_pred, y_train)
            print("epoch: {}, iter: {}, loss is: {}, acc is: {}".format(epoch, iter_id, avg_loss.numpy(), acc.numpy()))

        # 反向传播
        avg_loss.backward()
        # 最小化loss,更新参数
        opt.step()
        # 清除梯度
        opt.clear_grad()

优化方向

同样,由于篇幅原因,下面两个方案可参考源码:gitee仓库
注意使用Embedding前,请先运行Embedding分析.ipynb生成对应字典文件

  1. 采用基于Embedding的Wide & Deep
  2. 采用基于FM的DeepFM

各版本模型分数结果

分类 模型 详情 分数
ML ML第一版本 1. 初步建模
2. 不参与建模的特征 [‘os’, ‘version’, ‘lan’, 'sid’]
3. 默认参数LGB
88.094
ML第二版本 1. 基于第一版本
2. 引入version,简单转化使用timestamp
3. 测试默认参数LGB与XGB
88.2133
ML第三版本 1. 基于第二版本
2. 引入lan
3. 对osv和version做差
4. lgb祖传参数
88.9487
ML第四版本 1. 基于第三版本
2. 5折lgb
3. 5折xgb
4. 融合
89.0293
89.0253
89.054
ML第五版本 1.基于第三版本
2.添加像素比、像素大小、像素分辨率比
3. 5折lgb
4. 5折xgb
5. 融合
89.1873
89.108
89.1713
Paddle Paddle第一版本 1. 基于ML第三版本特征工程
2. 简单基于paddle搭建网络
未上传结果
Paddle第二版本 1. 基于第一版本
2. 添加embedding字典创建(在Embedding分析.ipynb)
3.基于embedding的混合基础模型
88.71
Paddle第三版本 1. 基于第二版本
2. 添加DeepFM部分模型,然后合并
87.816
TensorFlow TF第一版本 1. 基于ML第三版本特征工程
2. 简单基于TensorFlow搭建网络
未上传结果
FM FM第一版本 1. 基于FM模型的第一次简单建模 57.2147

最终排名得分
在这里插入图片描述
源码地址
https://gitee.com/turkeymz/coggle/tree/master/coggle_202112/mlp

你可能感兴趣的:(竞赛总结,人工智能,深度学习,机器学习,点击反欺诈,神经网络)