110.Spark大型电商项目-各区域热门商品统计-Spark SQL数据倾斜解决方案

目录

摘要

内容 


本篇文章记录各区域热门商品统计-Spark SQL数据倾斜解决方案。

摘要

 

内容 

  1. 聚合源数据
  2. 过滤导致倾斜的key
  3. 提高shuffle并行度:spark.sql.shuffle.partitions
  4. 双重group by
  5. reduce join转换为map join:spark.sql.autoBroadcastJoinThreshold
  6. 采样倾斜key并单独进行join
  7. 随机key与扩容表

由于Spark的这种都是基于RDD的特性,Spark SQL,原本是用纯的SQL来实现的,用纯RDD,也能够实现一模一样的功能。

之前使用在Spark Core中的数据倾斜解决方案,全部都可以直接套用在Spark SQL上。

之前讲解的方案,如果是用纯的Spark SQL来实现,应该如何来实现。

1、聚合源数据:Spark Core和Spark SQL没有任何的区别
2、过滤导致倾斜的key:在sql中用where条件
3、提高shuffle并行度:groupByKey(1000),spark.sql.shuffle.partitions(默认是200)

  sqlContext.setConf("spark.sql.shuffle.partitions", "1000"); 


4、双重group by:改写SQL,两次group by

    /**
     * 双重group by
     */

String _sql =
        "SELECT "
            + "product_id_area,"
            + "count(click_count) click_count,"
            + "group_concat_distinct(city_infos) city_infos "
        + "FROM ( "
            + "SELECT "
                + "remove_random_prefix(product_id_area) product_id_area,"
                + "click_count,"
                + "city_infos "
            + "FROM ( "
                + "SELECT "
                    + "product_id_area,"
                    + "count(*) click_count,"
                    + "group_concat_distinct(concat_long_string(city_id,city_name,':')) city_infos "
                + "FROM ( "
                    + "SELECT "
                        + "random_prefix(concat_long_string(product_id,area,':'), 10) product_id_area,"
                        + "city_id,"
                        + "city_name "
                    + "FROM tmp_click_product_basic "
                + ") t1 "
                + "GROUP BY product_id_area "
            + ") t2 "
        + ") t3 "
        + "GROUP BY product_id_area ";

RandomPrefixUDF.java

package graduation.java.spark.product;


import org.apache.spark.sql.api.java.UDF2;

import java.util.Random;

/**
 * FileName: RandomPrefixUDF
 * Author:   hadoop
 * Email:    [email protected]
 * Date:     19-4-2 上午10:52
 * Description:
 * random_profix()
 * 拼接一个随机前缀
 *
 */
public class RandomPrefixUDF implements UDF2 {
    private static final long serialVersionUID =1L;
    @Override
    public String call(String val, Integer integer) throws Exception {
        Random random = new Random();
        int randNum = random.nextInt(integer);
        return randNum+"_"+val;
    }
}

RemoveRandomPrefixUDF.java

package graduation.java.spark.product;

import org.apache.spark.sql.api.java.UDF1;

/**
 * FileName: RemoveRandomPrefixUDF
 * Author:   hadoop
 * Email:    [email protected]
 * Date:     19-4-2 上午10:49
 * Description:
 * 去除随机前缀
 */
public class RemoveRandomPrefixUDF  implements UDF1 {
    private static final long serialVersionUID = 1L;
    @Override
    public String call(String s) throws Exception {
        String[] valSplited = s.split("_");
        return valSplited[1];
    }
}


5、reduce join转换为map join:spark.sql.autoBroadcastJoinThreshold(默认是10485760 )
  可以自己将表做成RDD,自己手动去实现map join
  Spark SQL内置的map join,默认是如果有一个小表,是在10M以内,默认就会将该表进行broadcast,然后执行map join;调节这个阈值,比如调节到20M、50M、甚至1G。20971520

sqlContext.setConf("spark.sql.autoBroadcastJoinThreshold", "20971520");


6、采样倾斜key并单独进行join:纯Spark Core的一种方式,sample、filter等算子
7、随机key与扩容表:Spark SQL+Spark Core

JavaRDD rdd = sqlContext.sql("select * from product_info").javaRDD();
    JavaRDD flattedRDD = rdd.flatMap(new FlatMapFunction() {

        private static final long serialVersionUID = 1L;

        @Override
        public Iterator call(Row row) throws Exception {
            List list = new ArrayList();

            for(int i = 0; i < 10; i ++) {
                long productid = row.getLong(0);
                String _productid = i + "_" + productid;

                Row _row = RowFactory.create(_productid, row.get(1), row.get(2));
                list.add(_row);
            }

            return list.iterator();
        }

    });

StructType _schema = DataTypes.createStructType(Arrays.asList(
        DataTypes.createStructField("product_id", DataTypes.StringType, true),
        DataTypes.createStructField("product_name", DataTypes.StringType, true),
        DataTypes.createStructField("product_status", DataTypes.StringType, true)));

Dataset _df = sqlContext.createDataFrame(flattedRDD, _schema);
_df.registerTempTable("tmp_product_info");

String _sql =
        "SELECT "
            + "tapcc.area,"
            + "remove_random_prefix(tapcc.product_id) product_id,"
            + "tapcc.click_count,"
            + "tapcc.city_infos,"
            + "pi.product_name,"
            + "if(get_json_object(pi.extend_info,'product_status')=0,'自营商品','第三方商品') product_status "
        + "FROM ("
            + "SELECT "
                + "area,"
                + "random_prefix(product_id, 10) product_id,"
                + "click_count,"
                + "city_infos "
            + "FROM tmp_area_product_click_count "
        + ") tapcc "
        + "JOIN tmp_product_info pi ON tapcc.product_id=pi.product_id ";

    Dataset ds = sqlContext.sql(sql);
    ds.registerTempTable("tmp_area_fullprod_click_count");

 

你可能感兴趣的:(大数据,spark,电商用户行为分析)