Spark SQL网站搜索综合案例实战

以京东为例找出搜索平台上用户每天搜索排名5名的产品,The hottest

  用户登录京东网站,在搜索栏搜索的时候,将用户每天搜索排名前5名的商品列出来。

 

 

一:生成模拟京东用户搜索的测试数据。

l SparkSQLUserlogsHottest.log测试数据文件包括时间、用户id、商品、地点、设备信息 10000条数据

 

二:根据测试数据实现搜索平台上用户每天搜索排名5名的产品。

l 创建JavaSparkContext及HiveContext

l sc.textFile读入测试数据文件,生成JavaRDD类型的数据集合line0

l sc.broadcast定义广播变量,用于测试数据的查询及过滤。

l lines0.filter使用匿名接口函数Functioncall方法过滤出包含广播变量的数据,生成JavaRDD 类型的数据集合lines

测试验证点:

测试数据广播变量的过滤是否成功?打印出lines的数据验证。

 

l lines.mapToPair 使用匿名接口函数PairFunctioncall方法对lines中的数据按"\t"进行分割,分割以后将(date#Item#userID)三个字段合并为Key, Value值设置为1,形成   KV键值对。返回的结果pairs类型为JavaPairRDD。相当于hadoop mapreduce中的map,将某天某用户点击某商品的次数计数为1

测试验证点:

测试(date#Item#userID)三个字段合并为Key成功了吗?kv是否符合预期?

 

l pairs.reduceByKey使用匿名接口函数Function2call方法对pairs中数据进行reduce汇总,将pairs具有相同key值的数据,累加统计数值,返回的结果reduceedPairs 的类型为JavaPairRDD。相当于hadoop mapreduce中的reduce,将某天某用户点击某商品的所有点击次数累加汇总。

测试验证点:

测试(date#Item#userID) kvreduce统计是否成功?

reduceedRow中的每行数据拆分,reduceedRowkey值分割为三个字段:时间、用户id、商品,将reduceedRow的商品累计点击数值value记为count,然后将这四个字段时间、用户id、商品、count再拼接成json格式放到peopleInformations列表里面,打印输出。peopleInformations类型为List,为json格式。

 

l sc.parallelize(peopleInformations)scala空间的变量peopleInformations转换为Spark RDD空间的变量peopleInformationsRDD,类型为JavaRDD

l sqlContext.read().json(peopleInformationsRDD)通过内容为JSONRDD来构造peopleInformationsDF,类型为DataFrame

l DataFrame注册成为临时表

peopleInformationsDF.registerTempTable("peopleInformations")

窗口函数:使用子查询的方式完成目标数据的提取,在目标数据内幕使用窗口函数row_number来进行分组排序, PARTITION BY :指定窗口函数分组的KeyORDER BY:分组后进行排序;

        String sqlText = "SELECT UserID,Item, count "

         + "FROM ("

           + "SELECT "

           + "UserID,Item, count,"

         + "row_number() OVER (PARTITION BY UserID ORDER BY count DESC) rank"

             +" FROM peopleInformations "

         + ") sub_peopleInformations "

         + "WHERE rank <= 3 " ;

   Sql查询语句:

Ø 使用窗口函数,查询peopleInformations表,按照用户分组,将同一用户的每天的点击的商品数排名,形成一张子表sub_peopleInformations。

Ø 然后从排名后的子表sub_peopleInformations中,查询出前三名的用户点击的商品和计数。

测试验证点:

验证窗口函数sql语句执行是否成功?

sqlContext.sql(sqlText)执行sql查询,execellentNameAgeDF.show()显示sql查询结果。

 

execellentNameAgeDF的结果保存为json文件格式

execellentNameAgeDF.write().format("json").save(;

三:源代码SparkSQLUserlogsHottestDataManually.java和SparkSQLUserlogsHottest.java

 

1SparkSQLUserlogsHottestDataManually.java

 

package com.dt.imf.zuoye001;

import java.io.File;

import java.io.FileWriter;

import java.io.IOException;

import java.io.PrintWriter;

import java.text.ParseException;

import java.text.SimpleDateFormat;

import java.util.Calendar;

import java.util.Random;

import java.util.UUID;

 

public class SparkSQLUserlogsHottestDataManually {

 

public static void main(String[] args) {

long numberItems = 10000;

ganerateUserLogs(numberItems,"G:\\IMFBigDataSpark2016\\tesdata\\SparkSQLUserlogsHottest\\");

       

}

 

/**

 * 生成userLog

 * @param numberItems

 * @param path

 */

private static void ganerateUserLogs(long numberItems, String path) {

 

StringBuffer userLogBuffer = new StringBuffer();

// String filename = getCountDate(null, "yyyyMMddHHmmss", -1) + ".log";

String filename = "SparkSQLUserlogsHottest.log";

// 元数据:DateUserIDItemCityDevice

 

for (int i = 0; i < numberItems; i++) {

String date = getCountDate(null, "yyyy-MM-dd", -1);

// String timestamp = getCountDate(null, "yyyy-MM-dd HH:mm:ss", -1);

String userID = ganerateUserID();

String ItemID = ganerateItemID();

String CityID = ganerateCityIDs();

String Device = ganerateDevice();

/* userLogBuffer.append(date + "\t" + timestamp + "\t" + userID

+ "\t" + pageID + "\t" + channelID + "\t" + action + "\n");*/

userLogBuffer.append(date + "\t" + userID

+ "\t" + ItemID + "\t" + CityID + "\t" + Device + "\n");

System.out.println(userLogBuffer);

WriteLog(path, filename, userLogBuffer + "");

}

}

 

public static void WriteLog(String path, String filename, String strUserLog)

{

FileWriter fw = null;

PrintWriter out = null;

try {

File writeFile = new File(path + filename);

if (!writeFile.exists())

writeFile.createNewFile();

else {

writeFile.delete();

}

fw = new FileWriter(writeFile, true);

out = new PrintWriter(fw);

out.print(strUserLog);

} catch (Exception e) {

e.printStackTrace();

try {

if (out != null)

out.close();

if (fw != null)

fw.close();

} catch (IOException ex) {

ex.printStackTrace();

}

} finally {

try {

if (out != null)

out.close();

if (fw != null)

fw.close();

} catch (IOException e) {

e.printStackTrace();

}

}

}

/**

 * 获得日期

 * @param date

 * @param patton

 * @param step

 * @return

 */

public static String getCountDate(String date, String patton, int step) {

SimpleDateFormat sdf = new SimpleDateFormat(patton);

Calendar cal = Calendar.getInstance();

if (date != null) {

try {

cal.setTime(sdf.parse(date));

} catch (ParseException e) {

e.printStackTrace();

}

}

cal.add(Calendar.DAY_OF_MONTH, step);

return sdf.format(cal.getTime());

}

/**

 * 随机生成用户ID

 * @return

 */

private static String ganerateUserID() {

 

Random random = new Random();

 

String[] userID = { "98415b9c-f3d4-45c3-bc7f-dce3126c6c0b", "7371b4bd-8535-461f-a5e2-c4814b2151e1",

"49852bfa-a662-4060-bf68-0dddde5feea1", "8768f089-f736-4346-a83d-e23fe05b0ecd",

"a76ff021-049c-4a1a-8372-02f9c51261d5", "8d5dc011-cbe2-4332-99cd-a1848ddfd65d",

"a2bccbdf-f0e9-489c-8513-011644cb5cf7", "89c79413-a7d1-462c-ab07-01f0835696f7",

"8d525daa-3697-455e-8f02-ab086cda7851", "c6f57c89-9871-4a92-9cbe-a2d76cd79cd0",

"19951134-97e1-4f62-8d5c-134077d1f955", "3202a063-4ebf-4f3f-a4b7-5e542307d726",

"40a0d872-45cc-46bc-b257-64ad898df281", "b891a528-4b5e-4ba7-949c-2a32cb5a75ec",

"0d46d52b-75a2-4df2-b363-43874c9503a2", "c1e4b8cf-0116-46bf-8dc9-55eb074ad315",

"6fd24ac6-1bb0-4ea6-a084-52cc22e9be42", "5f8780af-93e8-4907-9794-f8c960e87d34",

"692b1947-8b2e-45e4-8051-0319b7f0e438", "dde46f46-ff48-4763-9c50-377834ce7137" };

return userID[random.nextInt(20)];

}

 

/**

 * 随机生成pageID

 * @return

 */

private static String ganerateItemID() {

 

Random random = new Random();

 

//String[] ItemIDs = { "xiyiji", "binxiang", "kaiguan", "reshuiqi", "ranqizao", "dianshiji", "kongtiao" };

String[] ItemIDs = { "小米", "休闲鞋", "洗衣机", "显示器", "显卡", "洗衣液", "行车记录仪" };

return ItemIDs[random.nextInt(7)];

}

/**

 * 随机生成channelID

 * @return

 */

private static String ganerateCityIDs() {

 

Random random = new Random();

 

/*String[] CityNames = { "shanghai", "beijing", "ShenZhen", "HangZhou", "Tianjin", "Guangzhou", "Nanjing", "Changsha", "WuHan",

"jinan" }*/;

String[] CityNames = { "上海", "北京", "深圳", "广州", "纽约", "伦敦", "东京", "首尔", "莫斯科",

"巴黎" };

return CityNames[random.nextInt(10)];

}

 

/**

 * 随机生成action

 * @return

 */

private static String ganerateDevice() {

 

Random random = new Random();

 

String[] Devices = { "android", "iphone", "ipad" };

return Devices[random.nextInt(3)];

}

/**

 * 生成用户Guid

 * @param num

 * @return

 */

private static String ganerateUserID(int num) {

StringBuffer userid = new StringBuffer();

for (int i = 0; i < num; i++) {

UUID uuid = UUID.randomUUID();

userid.append("\"" + uuid + "\",");

}

System.out.println(userid);

return userid + "";

}

 

}

 

 

2SparkSQLUserlogsHottest.java

 

 

package com.dt.imf.zuoye001;

 

import java.util.ArrayList;

import java.util.Arrays;

import java.util.Iterator;

import java.util.List;

 

import org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.function_return;

import org.apache.hadoop.io.IntWritable;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.api.java.function.VoidFunction;

import org.apache.spark.broadcast.Broadcast;

import org.apache.spark.sql.DataFrame;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.RowFactory;

import org.apache.spark.sql.SQLContext;

import org.apache.spark.sql.types.DataTypes;

import org.apache.spark.sql.types.StructField;

import org.apache.spark.sql.types.StructType;

import org.apache.spark.sql.hive.HiveContext;

 

import scala.Tuple2;

 

public class SparkSQLUserlogsHottest {

 

public static void main(String[] args) {

SparkConf conf = new SparkConf().setMaster("local").setAppName("SparkSQLUserlogsHottest");

JavaSparkContext sc = new JavaSparkContext(conf);

SQLContext sqlContext = new HiveContext(sc);

 

JavaRDD lines0 = sc.textFile("G:\\IMFBigDataSpark2016\\tesdata\\SparkSQLUserlogsHottest\\SparkSQLUserlogsHottest.test.log");

 

/*元数据:DateUserIDItemCityDevice

          (date#Item#userID)

*/

//定义广播变量

  String  devicebd ="iphone";

  final Broadcast   broadcastdevice =sc.broadcast(devicebd);  

// 过滤

// lines.filter();

  JavaRDD lines =lines0.filter(new Function() {

@Override

public Boolean call(String s) throws Exception {

return s.contains(broadcastdevice.value());

}

});

  

    // 验证

  List  listRow000 = lines.collect();

for(  String row : listRow000){

System.out.println(row);

}

      

  

 //组拼字符串(date#Item#userID)  构建KV(date#Item#userID1)

 JavaPairRDD pairs = lines.mapToPair(new PairFunction() {

        private static final long serialVersionUID =1L ;         

        @Override

public Tuple2 call(String line) throws Exception {

String[] splitedLine =line.split("\t");

        int  one = 1;

String dataanditemanduserid = splitedLine[0] +"#"+ splitedLine[2]+"#"+String.valueOf(splitedLine[1]);

return new Tuple2(String.valueOf(dataanditemanduserid),Integer.valueOf(one));

}

});

 

 //  验证

    List>  listRow = pairs.collect();

for(Tuple2 row : listRow){

System.out.println(row._1);

System.out.println(row._2);

}

        

    

 //reducebykey,统计计数

       JavaPairRDD  reduceedPairs =pairs.reduceByKey(new Function2() {

@Override

public Integer call(Integer v1, Integer v2) throws Exception {

return v1 + v2 ;

}

});

      

    

       List>  reduceedRow = reduceedPairs.collect();

       

        //动态组拼出JSON

List peopleInformations = new ArrayList();

for(Tuple2 row : reduceedRow){

//拆分三个字段

    String[] rowSplitedLine =row._1.split("#");

String rowuserID = rowSplitedLine[2];

String rowitemID = rowSplitedLine[1];

String rowdateID = rowSplitedLine[0];

//拼接json  元数据:DateUserIDItemCityDevice

String jsonzip= "{\"Date\":\""+ rowdateID  

             +"\", \"UserID\":\""+ rowuserID  

             +"\", \"Username\":\""+ rowuserID

             +"\", \"Item\":\""+ rowitemID  

             +"\", \"count\":"+ row._2 +" }";

        peopleInformations.add(jsonzip);

}  

          

//打印验证peopleInformations

 

for(String row : peopleInformations){

System.out.println(row.toString());

//System.out.println(row._2);

}

//通过内容为JSONRDD来构造DataFrame

JavaRDD peopleInformationsRDD = sc.parallelize(peopleInformations);

DataFrame peopleInformationsDF = sqlContext.read().json(peopleInformationsRDD);

//注册成为临时表

peopleInformationsDF.registerTempTable("peopleInformations");

  /* 使用子查询的方式完成目标数据的提取,在目标数据内幕使用窗口函数row_number来进行分组排序:

      * PARTITION BY :指定窗口函数分组的Key

      * ORDER BY:分组后进行排序;

      */

 

        String sqlText = "SELECT UserID,Item, count "

         + "FROM ("

           + "SELECT "

             + "UserID,Item, count,"

             + "row_number() OVER (PARTITION BY UserID ORDER BY count DESC) rank"

             +" FROM peopleInformations "

         + ") sub_peopleInformations "

         + "WHERE rank <= 3 " ;

       

       

            

DataFrame execellentNameAgeDF = sqlContext.sql(sqlText);

execellentNameAgeDF.show();

execellentNameAgeDF.write().format("json").save("G:\\IMFBigDataSpark2016\\tesdata\\SparkSQLUserlogsHottest\\Result20140419_2");

}

 

}

  

 

 

你可能感兴趣的:(spark)