目录
摘要
内容
本篇文章记录各区域热门商品统计-Spark SQL数据倾斜解决方案。
由于Spark的这种都是基于RDD的特性,Spark SQL,原本是用纯的SQL来实现的,用纯RDD,也能够实现一模一样的功能。
之前使用在Spark Core中的数据倾斜解决方案,全部都可以直接套用在Spark SQL上。
之前讲解的方案,如果是用纯的Spark SQL来实现,应该如何来实现。
1、聚合源数据:Spark Core和Spark SQL没有任何的区别
2、过滤导致倾斜的key:在sql中用where条件
3、提高shuffle并行度:groupByKey(1000),spark.sql.shuffle.partitions(默认是200)
sqlContext.setConf("spark.sql.shuffle.partitions", "1000");
4、双重group by:改写SQL,两次group by
/** * 双重group by */ String _sql = "SELECT " + "product_id_area," + "count(click_count) click_count," + "group_concat_distinct(city_infos) city_infos " + "FROM ( " + "SELECT " + "remove_random_prefix(product_id_area) product_id_area," + "click_count," + "city_infos " + "FROM ( " + "SELECT " + "product_id_area," + "count(*) click_count," + "group_concat_distinct(concat_long_string(city_id,city_name,':')) city_infos " + "FROM ( " + "SELECT " + "random_prefix(concat_long_string(product_id,area,':'), 10) product_id_area," + "city_id," + "city_name " + "FROM tmp_click_product_basic " + ") t1 " + "GROUP BY product_id_area " + ") t2 " + ") t3 " + "GROUP BY product_id_area ";
RandomPrefixUDF.java
package graduation.java.spark.product; import org.apache.spark.sql.api.java.UDF2; import java.util.Random; /** * FileName: RandomPrefixUDF * Author: hadoop * Email: [email protected] * Date: 19-4-2 上午10:52 * Description: * random_profix() * 拼接一个随机前缀 * */ public class RandomPrefixUDF implements UDF2
{ private static final long serialVersionUID =1L; @Override public String call(String val, Integer integer) throws Exception { Random random = new Random(); int randNum = random.nextInt(integer); return randNum+"_"+val; } }
RemoveRandomPrefixUDF.java
package graduation.java.spark.product; import org.apache.spark.sql.api.java.UDF1; /** * FileName: RemoveRandomPrefixUDF * Author: hadoop * Email: [email protected] * Date: 19-4-2 上午10:49 * Description: * 去除随机前缀 */ public class RemoveRandomPrefixUDF implements UDF1
{ private static final long serialVersionUID = 1L; @Override public String call(String s) throws Exception { String[] valSplited = s.split("_"); return valSplited[1]; } }
5、reduce join转换为map join:spark.sql.autoBroadcastJoinThreshold(默认是10485760 )
可以自己将表做成RDD,自己手动去实现map join
Spark SQL内置的map join,默认是如果有一个小表,是在10M以内,默认就会将该表进行broadcast,然后执行map join;调节这个阈值,比如调节到20M、50M、甚至1G。20971520
sqlContext.setConf("spark.sql.autoBroadcastJoinThreshold", "20971520");
6、采样倾斜key并单独进行join:纯Spark Core的一种方式,sample、filter等算子
7、随机key与扩容表:Spark SQL+Spark Core
JavaRDD
rdd = sqlContext.sql("select * from product_info").javaRDD(); JavaRDD
flattedRDD = rdd.flatMap(new FlatMapFunction
() { private static final long serialVersionUID = 1L; @Override public Iterator
call(Row row) throws Exception { List
list = new ArrayList
(); for(int i = 0; i < 10; i ++) { long productid = row.getLong(0); String _productid = i + "_" + productid; Row _row = RowFactory.create(_productid, row.get(1), row.get(2)); list.add(_row); } return list.iterator(); } }); StructType _schema = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("product_id", DataTypes.StringType, true), DataTypes.createStructField("product_name", DataTypes.StringType, true), DataTypes.createStructField("product_status", DataTypes.StringType, true))); Dataset _df = sqlContext.createDataFrame(flattedRDD, _schema); _df.registerTempTable("tmp_product_info"); String _sql = "SELECT " + "tapcc.area," + "remove_random_prefix(tapcc.product_id) product_id," + "tapcc.click_count," + "tapcc.city_infos," + "pi.product_name," + "if(get_json_object(pi.extend_info,'product_status')=0,'自营商品','第三方商品') product_status " + "FROM (" + "SELECT " + "area," + "random_prefix(product_id, 10) product_id," + "click_count," + "city_infos " + "FROM tmp_area_product_click_count " + ") tapcc " + "JOIN tmp_product_info pi ON tapcc.product_id=pi.product_id "; Dataset ds = sqlContext.sql(sql); ds.registerTempTable("tmp_area_fullprod_click_count");