如题,利用Hive对航空公司客户数据进行探索分析、数据预处理并建立LRFMC模型,再利用Hadoop集群实现Kmeans对客户进行分群。如重要保持客户、重要发展客户、重要挽留客户、一般客户和低价值客户,再根据不同的客户群体来指定相应的优惠政策来实现利益最大化。
在用到的数据集中包含了62988条数据,其中包括客户基本信息(会员卡号、入会时间等)、乘机信息(观测窗口的票价收入、平均折扣率等)、积分信息(积分兑换次数等)共44个字段。
数据格式如下:
54993,2006-11-2,2008-12-24,男,6,.,北京,CN,31,2014-3-31,210,505308,0,74460,239560,234188,580717,558440.14,2014-3-31,26.25,63163.5,2,1,3.483253589,18,3352,
36640,34,0.961639043,103,107,246197,259111,74460,39992,114452,111100,619760,370211,0.50952381,0.49047619,0.487220691,0.51277733,50
28065,2007-2-19,2007-8-3,男,6,,北京,CN,42,2014-3-31,140,362480,0,41288,171483,167434,293678,367777.2,2014-3-25,17.5,45310,2,7,5.194244604,17,0,12000,29,
1.25231444,68,72,177358,185122,41288,12000,53288,53288,415768,238410,0.514285714,0.485714286,0.489289094,0.510708147,33
55106,2007-2-1,2007-8-30,男,6,.,北京,CN,40,2014-3-31,135,351159,0,39711,163618,164982,283712,355966.5,2014-3-21,16.875,43894.875,10,11,5.298507463,18,
3491,12000,20,1.254675516,65,70,169072,182087,39711,15491,55202,51711,406361,233798,0.518518519,0.481481481,0.481467137,0.518530015,26
首先,创建Hive数据库,在数据库中新建表,用来存储数据:
create database air;
use air;
create table air_data_base(
member_no string,
ffp_date string,
first_flight_date string,
gender string,
ffp_tier int,
work_city string,
work_province string,
work_country string,
age int,
load_time string,
flight_count int,
bp_sum bigint,
ep_sum_yr_1 int,
ep_sum_yr_2 bigint,
sum_yr_1 bigint,
sum_yr_2 bigint,
seg_km_sum bigint,
weighted_seg_km double,
last_flight_date string,
avg_flight_count double,
avg_bp_sum double,
begin_to_first int,
last_to_end int,
avg_interval float,
max_interval int,
add_points_sum_yr_1 bigint,
add_points_sum_yr_2 bigint,
exchange_count int,
avg_discount float,
p1y_flight_count int,
l1y_flight_count int,
p1y_bp_sum bigint,
l1y_bp_sum bigint,
ep_sum bigint,
add_point_sum bigint,
eli_add_point_sum bigint,
l1y_eli_add_points bigint,
points_sum bigint,
l1y_points_sum float,
ration_l1y_flight_count float,
ration_p1y_flight_count float,
ration_p1y_bps float,
ration_l1y_bps float,
point_notflight int
)
row format delimited fields terminated by ',';
将数据导入刚建好的表格中:
load data local inpath '/opt/air_data_base.txt' overwrite into table air_data_base;
根据具体业务逻辑,发现在这44个字段中真正能用到的只有FFP_DATE(入会时间)、LOAD_TIME(观测窗口结束时间)、FLIGHT_COUNT(乘机次数)、SUM_YR_1(票价收入1)、SEG_KM_SUM(飞行里程数)、LAST_FLIGHT_DATE(最后一次乘机时间)、AVG_DISCOUNT(平均折扣率)。
1. 数据探索:
统计SUM_YR_1、SEG_KM_SUM、AVG_DISCOUNT三个字段的空值记录数,保存到null_count表中。
create table null_count as select * from
(select count(*) as sum_yr_1_null_count from air_data_base where sum_yr_1 is null)
sum_yr_1,
(select count(*) as seg_km_sum_null from air_data_base where seg_km_sum is null)
seg_km_sum,
(select count(*) as avg_discount_null from air_data_base where avg_discount is null)
avg_discount;
统计SUM_YR_1、SEG_KM_SUM、AVG_DISCOUNT三个字段的最小值并保存到min_count表中。
create table min_count as select
min(sum_yr_1) sum_yr_1,
min(seg_km_sum) seg_km_sum,
min(avg_discount) avg_discount
from air_data_base;
查看null_count表和min_count表结果如下:
属性名称 | SUM_YR_1 | SEG_KM_SUM | AVG_DISCOUNT |
空值记录数 | 591 | 0 | 0 |
最小值 | 0 | 368 | 0.0 |
通过数据探索分析,发现数据中存在缺失值,但这类数据所占总数据的比例很小,所以直接舍弃。
清洗规则如下:
1)丢弃票价为空的记录:
create table sum_yr_1_not_null as
select * from air_data_base where
sum_yr_1 is not null;
create tabe avg_discount_not_0 as select *
from sum_yr_1_not_null where
avg_discount <> 0;
3) 在上一步基础上,丢弃票价为0、平均折扣率不为0、总飞行公里数大于0的记录:
create table sum_0_seg_avg_not_0 as
select * from avg_discount_not_0
where !(sum_yr_1=0 and avg_discount <> 0
and seg_km_sum > 0);
数据清洗结果如下:
表名 | 记录数 |
sum_yr_1_not_null | 62397 |
avg_discount_not_0 | 62386 |
sum_0_seg_avg_not_0 | 61587 |
为了建立LRFMC模型,从清洗后的数据集中选择与指标相关的6个属性:LOAD_TIME、FFP_DATE、LAST_TO_END、FLIGHT_COUNT、SEG_KM_SUM、AVG_DISCOUNT。
create table flfasl as select
ffp_date,load_time,flight_count,avg_discount,seg_km_sum,last_to_end
from sum_0_seg_avg_not_0;
选取flfasl表中前五条查看如下:
2006-11-2 2014-3-31 210 0.96163905 580717 1
2007-2-19 2014-3-31 140 1.2523144 293678 7
2007-2-1 2014-3-31 135 1.2546755 283712 11
2008-8-22 2014-3-31 23 1.0908695 281336 97
2009-4-10 2014-3-31 152 0.9706579 309928 5
4. 数据变换:
1) 属性构造:
构造LRFMC 5个指标:
4.1 L的构造:
会员入会时间距离观测窗口结束的月数 = 观测窗口的结束时间 - 入会时间 [单位: 月]
L = LOAD_TIME - FFP_DATE
4.2 R的构造:
客户最近一次乘坐公司飞机距观测窗口结束的月数 = 最后一次乘机时间至观测窗口末端时长[单位: 月]
R = LAST_TO_END
4.3 F的构造:
客户再观测窗口内乘坐公司飞机的次数 = 观测窗口的飞行次数[单位: 次]
F = FLIGHT_COUNT
4.4 M的构造:
客户再观测时间内在公司累计的飞行里程 = 观测窗口总飞行公里数[单位: 公里]
M = SEG_KM_SUM
4.5 C的构造
客户在观测时间内乘坐舱位所对应的折扣系数的平均值 = 平均折扣率 [单位: 无]
C = AVG_DISCOUNT
create table lrfmc as select
round((unix_timestamp(LOAD_TIME,'yyyyMMdd')-unix_timestamp(FFP_DATE,'yyyyMMdd'))/(30*24*60*60),2) as l,
round(last_to_end/30,2) as r,
FLIGHT_COUNT as f,
SEG_KM_SUM as m,
round(AVG_DISCOUNT,2) as c
from flfasl;
2) 数据标准化
对5个指标数据分布进行分析,发现5个指标的取值范围数据差异较大,为了消除数量级数据带来的影响,需要对数据进行标准化处理。
属性名称 | L | R | F | M | C |
最小值 | 9.13 | 0.03 | 2 | 368 | 0.11 |
最大值 | 118.67 | 24.37 | 213 | 580717 | 1.5 |
5. Kmeans聚类,实现客户价值分群
根据业务逻辑,确定将客户大致分为五类,将k=5,以及标准化后的数据,利用之前建立的Kmeans模型,可算出这五类客户群体的聚类中心。根据聚类中心结果,再结合航空公司的业务逻辑,可得如下结果:
客户群 | 排名 | 排名含义 |
客户群2(F和M值对应最高,C值次高) | 1 | 重要保持客户 |
客户群3(C值最大) | 2 | 重要发展客户 |
客户群1 | 3 | 重要挽留客户 |
客户群4 | 4 | 一般客户 |
客户群5 | 5 | 低价值客户 |