以京东为例找出搜索平台上用户每天搜索排名5名的产品,The hottest!
用户登录京东网站,在搜索栏搜索的时候,将用户每天搜索排名前5名的商品列出来。
l SparkSQLUserlogsHottest.log测试数据文件包括时间、用户id、商品、地点、设备信息 10000条数据
l 创建JavaSparkContext及HiveContext
l sc.textFile读入测试数据文件,生成JavaRDD
l sc.broadcast定义广播变量,用于测试数据的查询及过滤。
l lines0.filter使用匿名接口函数Function的call方法过滤出包含广播变量的数据,生成JavaRDD
l 测试验证点:
测试数据广播变量的过滤是否成功?打印出lines的数据验证。
l lines.mapToPair 使用匿名接口函数PairFunction的call方法对lines中的数据按"\t"进行分割,分割以后将(date#Item#userID)三个字段合并为Key, 将 Value值设置为1,形成 K,V键值对。返回的结果pairs类型为JavaPairRDD
l 测试验证点:
测试(date#Item#userID)三个字段合并为Key成功了吗?k,v是否符合预期?
l pairs.reduceByKey使用匿名接口函数Function2的call方法对pairs中数据进行reduce汇总,将pairs具有相同key值的数据,累加统计数值,返回的结果reduceedPairs 的类型为JavaPairRDD
l 测试验证点:
测试(date#Item#userID) k,v的reduce统计是否成功?
将reduceedRow中的每行数据拆分,reduceedRow的key值分割为三个字段:时间、用户id、商品,将reduceedRow的商品累计点击数值value记为count,然后将这四个字段时间、用户id、商品、count再拼接成json格式放到peopleInformations列表里面,打印输出。peopleInformations类型为List
l sc.parallelize(peopleInformations)将scala空间的变量peopleInformations转换为Spark RDD空间的变量peopleInformationsRDD,类型为JavaRDD
l sqlContext.read().json(peopleInformationsRDD)通过内容为JSON的RDD来构造peopleInformationsDF,类型为DataFrame
l DataFrame注册成为临时表
peopleInformationsDF.registerTempTable("peopleInformations")
l 窗口函数:使用子查询的方式完成目标数据的提取,在目标数据内幕使用窗口函数row_number来进行分组排序, PARTITION BY :指定窗口函数分组的Key, ORDER BY:分组后进行排序;
String sqlText = "SELECT UserID,Item, count "
+ "FROM ("
+ "SELECT "
+ "UserID,Item, count,"
+ "row_number() OVER (PARTITION BY UserID ORDER BY count DESC) rank"
+" FROM peopleInformations "
+ ") sub_peopleInformations "
+ "WHERE rank <= 3 " ;
Sql查询语句:
Ø 使用窗口函数,查询peopleInformations表,按照用户分组,将同一用户的每天的点击的商品数排名,形成一张子表sub_peopleInformations。
Ø 然后从排名后的子表sub_peopleInformations中,查询出前三名的用户点击的商品和计数。
l 测试验证点:
验证窗口函数sql语句执行是否成功?
sqlContext.sql(sqlText)执行sql查询,execellentNameAgeDF.show()显示sql查询结果。
l 将execellentNameAgeDF的结果保存为json文件格式
execellentNameAgeDF.write().format("json").save();
1、SparkSQLUserlogsHottestDataManually.java
package com.dt.imf.zuoye001;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Random;
import java.util.UUID;
public class SparkSQLUserlogsHottestDataManually {
public static void main(String[] args) {
long numberItems = 10000;
ganerateUserLogs(numberItems,"G:\\IMFBigDataSpark2016\\tesdata\\SparkSQLUserlogsHottest\\");
}
/**
* 生成userLog
* @param numberItems
* @param path
*/
private static void ganerateUserLogs(long numberItems, String path) {
StringBuffer userLogBuffer = new StringBuffer();
// String filename = getCountDate(null, "yyyyMMddHHmmss", -1) + ".log";
String filename = "SparkSQLUserlogsHottest.log";
// 元数据:Date、UserID、Item、City、Device;
for (int i = 0; i < numberItems; i++) {
String date = getCountDate(null, "yyyy-MM-dd", -1);
// String timestamp = getCountDate(null, "yyyy-MM-dd HH:mm:ss", -1);
String userID = ganerateUserID();
String ItemID = ganerateItemID();
String CityID = ganerateCityIDs();
String Device = ganerateDevice();
/* userLogBuffer.append(date + "\t" + timestamp + "\t" + userID
+ "\t" + pageID + "\t" + channelID + "\t" + action + "\n");*/
userLogBuffer.append(date + "\t" + userID
+ "\t" + ItemID + "\t" + CityID + "\t" + Device + "\n");
System.out.println(userLogBuffer);
WriteLog(path, filename, userLogBuffer + "");
}
}
public static void WriteLog(String path, String filename, String strUserLog)
{
FileWriter fw = null;
PrintWriter out = null;
try {
File writeFile = new File(path + filename);
if (!writeFile.exists())
writeFile.createNewFile();
else {
writeFile.delete();
}
fw = new FileWriter(writeFile, true);
out = new PrintWriter(fw);
out.print(strUserLog);
} catch (Exception e) {
e.printStackTrace();
try {
if (out != null)
out.close();
if (fw != null)
fw.close();
} catch (IOException ex) {
ex.printStackTrace();
}
} finally {
try {
if (out != null)
out.close();
if (fw != null)
fw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
/**
* 获得日期
* @param date
* @param patton
* @param step
* @return
*/
public static String getCountDate(String date, String patton, int step) {
SimpleDateFormat sdf = new SimpleDateFormat(patton);
Calendar cal = Calendar.getInstance();
if (date != null) {
try {
cal.setTime(sdf.parse(date));
} catch (ParseException e) {
e.printStackTrace();
}
}
cal.add(Calendar.DAY_OF_MONTH, step);
return sdf.format(cal.getTime());
}
/**
* 随机生成用户ID
* @return
*/
private static String ganerateUserID() {
Random random = new Random();
String[] userID = { "98415b9c-f3d4-45c3-bc7f-dce3126c6c0b", "7371b4bd-8535-461f-a5e2-c4814b2151e1",
"49852bfa-a662-4060-bf68-0dddde5feea1", "8768f089-f736-4346-a83d-e23fe05b0ecd",
"a76ff021-049c-4a1a-8372-02f9c51261d5", "8d5dc011-cbe2-4332-99cd-a1848ddfd65d",
"a2bccbdf-f0e9-489c-8513-011644cb5cf7", "89c79413-a7d1-462c-ab07-01f0835696f7",
"8d525daa-3697-455e-8f02-ab086cda7851", "c6f57c89-9871-4a92-9cbe-a2d76cd79cd0",
"19951134-97e1-4f62-8d5c-134077d1f955", "3202a063-4ebf-4f3f-a4b7-5e542307d726",
"40a0d872-45cc-46bc-b257-64ad898df281", "b891a528-4b5e-4ba7-949c-2a32cb5a75ec",
"0d46d52b-75a2-4df2-b363-43874c9503a2", "c1e4b8cf-0116-46bf-8dc9-55eb074ad315",
"6fd24ac6-1bb0-4ea6-a084-52cc22e9be42", "5f8780af-93e8-4907-9794-f8c960e87d34",
"692b1947-8b2e-45e4-8051-0319b7f0e438", "dde46f46-ff48-4763-9c50-377834ce7137" };
return userID[random.nextInt(20)];
}
/**
* 随机生成pageID
* @return
*/
private static String ganerateItemID() {
Random random = new Random();
//String[] ItemIDs = { "xiyiji", "binxiang", "kaiguan", "reshuiqi", "ranqizao", "dianshiji", "kongtiao" };
String[] ItemIDs = { "小米", "休闲鞋", "洗衣机", "显示器", "显卡", "洗衣液", "行车记录仪" };
return ItemIDs[random.nextInt(7)];
}
/**
* 随机生成channelID
* @return
*/
private static String ganerateCityIDs() {
Random random = new Random();
/*String[] CityNames = { "shanghai", "beijing", "ShenZhen", "HangZhou", "Tianjin", "Guangzhou", "Nanjing", "Changsha", "WuHan",
"jinan" }*/;
String[] CityNames = { "上海", "北京", "深圳", "广州", "纽约", "伦敦", "东京", "首尔", "莫斯科",
"巴黎" };
return CityNames[random.nextInt(10)];
}
/**
* 随机生成action
* @return
*/
private static String ganerateDevice() {
Random random = new Random();
String[] Devices = { "android", "iphone", "ipad" };
return Devices[random.nextInt(3)];
}
/**
* 生成用户Guid
* @param num
* @return
*/
private static String ganerateUserID(int num) {
StringBuffer userid = new StringBuffer();
for (int i = 0; i < num; i++) {
UUID uuid = UUID.randomUUID();
userid.append("\"" + uuid + "\",");
}
System.out.println(userid);
return userid + "";
}
}
2、SparkSQLUserlogsHottest.java
package com.dt.imf.zuoye001;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.function_return;
import org.apache.hadoop.io.IntWritable;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.hive.HiveContext;
import scala.Tuple2;
public class SparkSQLUserlogsHottest {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("SparkSQLUserlogsHottest");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new HiveContext(sc);
JavaRDD
/*元数据:Date、UserID、Item、City、Device;
(date#Item#userID)
*/
//定义广播变量
String devicebd ="iphone";
final Broadcast
// 过滤
// lines.filter();
JavaRDD
@Override
public Boolean call(String s) throws Exception {
return s.contains(broadcastdevice.value());
}
});
// 验证
List
for( String row : listRow000){
System.out.println(row);
}
//组拼字符串(date#Item#userID) 构建KV(date#Item#userID,1)
JavaPairRDD
private static final long serialVersionUID =1L ;
@Override
public Tuple2
String[] splitedLine =line.split("\t");
int one = 1;
String dataanditemanduserid = splitedLine[0] +"#"+ splitedLine[2]+"#"+String.valueOf(splitedLine[1]);
return new Tuple2
}
});
// 验证
List
for(Tuple2
System.out.println(row._1);
System.out.println(row._2);
}
//reducebykey,统计计数
JavaPairRDD
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2 ;
}
});
List
//动态组拼出JSON
List
for(Tuple2
//拆分三个字段
String[] rowSplitedLine =row._1.split("#");
String rowuserID = rowSplitedLine[2];
String rowitemID = rowSplitedLine[1];
String rowdateID = rowSplitedLine[0];
//拼接json 元数据:Date、UserID、Item、City、Device
String jsonzip= "{\"Date\":\""+ rowdateID
+"\", \"UserID\":\""+ rowuserID
+"\", \"Username\":\""+ rowuserID
+"\", \"Item\":\""+ rowitemID
+"\", \"count\":"+ row._2 +" }";
peopleInformations.add(jsonzip);
}
//打印验证peopleInformations
for(String row : peopleInformations){
System.out.println(row.toString());
//System.out.println(row._2);
}
//通过内容为JSON的RDD来构造DataFrame
JavaRDD
DataFrame peopleInformationsDF = sqlContext.read().json(peopleInformationsRDD);
//注册成为临时表
peopleInformationsDF.registerTempTable("peopleInformations");
/* 使用子查询的方式完成目标数据的提取,在目标数据内幕使用窗口函数row_number来进行分组排序:
* PARTITION BY :指定窗口函数分组的Key;
* ORDER BY:分组后进行排序;
*/
String sqlText = "SELECT UserID,Item, count "
+ "FROM ("
+ "SELECT "
+ "UserID,Item, count,"
+ "row_number() OVER (PARTITION BY UserID ORDER BY count DESC) rank"
+" FROM peopleInformations "
+ ") sub_peopleInformations "
+ "WHERE rank <= 3 " ;
DataFrame execellentNameAgeDF = sqlContext.sql(sqlText);
execellentNameAgeDF.show();
execellentNameAgeDF.write().format("json").save("G:\\IMFBigDataSpark2016\\tesdata\\SparkSQLUserlogsHottest\\Result20140419_2");
}
}