3.0 阿里云大数据项目实战开发

任务目标:

从阿里云数据库中读取表1:hdp6_result 和表2:hdp6_locationresult
对取出的数据进行简单处理后,在数据库中新建表3:hdp5_resultsave
并将数据以分区表的形式存入表3当中。

1.数据库读取流程

步骤1:链接Spark master

SparkConf conf = new SparkConf()
                    .setAppName("ResultData");
                    .setMaster("local");
JavaSparkContext ctx = new JavaSparkContext(conf);  

步骤2:初始化OdpsContext,并进行数据库的select

OdpsContext sqlContext = new OdpsContext(ctx);
String sql = "select * from hdp6_result limit 100";
DataFrame df = sqlContext.sql(sql);
df.show(100);
df.printSchema();

2.多表查询,返回一个dataframe

方法一:SQL语句多表查询,返回一个dataframe

String sql = "select result_value,lower_tolerance,upper_tolerance,"
                    + "time_stamp,type_number from hdp6_result "
                    + "INNER join hdp6_locationresult ON "
                    + "hdp6_result.location_result_uid = hdp6_locationresult.location_result_uid";
            DataFrame df = sqlContext.sql(sql);
            df.show();

重要事项!!!:

经过验证,SQL语句多表查询的方式,并不适用于阿里云ODPS,可能是由于多表查询不利于官方进行计费。

方法二:每张表单独返回一个dataframe,将dataframe进行组合,返回一个新的dataframe

String sql1 = "select * from hdp6_result limit 100";
DataFrame df1 = sqlContext.sql(sql1);
df1.show(100);
df1.printSchema();

String sql2 = "select * from hdp6_locationresult limit 100";
DataFrame df2 = sqlContext.sql(sql2);
df2.show(100);
df2.printSchema();

DataFrame df0 = df2.join(df1, df1.col("location_result_uid").equalTo(df2.col("location_result_uid")));;
System.out.println("----3----");
df0.show(200);
df0.printSchema();

3.对数据进行简单处理并生成新的dataframe

        //---------------------------------------表格location_result列数据以getString形式取出
        List resultlist = df1.select("location_result").distinct().javaRDD().map(new Function() {
            public String call(Row row) {
                return String.valueOf(row.getString(0));
            }
        }).distinct().collect();
        //---------------------------------------开始循环处理数据
        for (String locationresult : resultlist) {
            df2 = df1.filter("location_result = '" + locationresult + "' ");//特定location的内容
            df2.cache();//df2持久化
            DataFrame valueByType0 = df2.groupBy("tolerance", "type_number")
                    .agg(avg("result_value"),count("result_value"),stddev("result_value"))
                    .filter(("count(result_value) >= " + PART_COUNT_LOW_LIMIT))
                    .orderBy("tolerance", "type_number");//拆出公差,型号,平均和数量,方差
            List valueByType = valueByType0.javaRDD().map(new Function() {
                public Row call(Row row) {
                    row.getString(0);//tolerance
                    row.getString(1);//type
                    row.getDouble(2);//avg
                    row.getLong(3);//count
                    row.getDouble(4);//std
                    return row;
                }
            }).collect();
            //-----------------------------------数组,String tolerance 0,String type 1,Double avg 2,Long count 3,Double std 4
            for (Row row : valueByType) {
                if (processed.contains(row.getString(0))) {
                    msg = "processed tolerance" + row.getString(0);
                    System.out.println(msg);
                } else if ((valueByType0.filter("tolerance = " + row.getString(0))).count() == 1) {
                    msg = "only one line" + row.getString(0);
                    System.out.println(msg);
                } else {
                    msg = "---------DATA Hangding--------";
                    System.out.println(msg);
                    processed.add(row.getString(0));//将该公差加入已处理列表
                    List valueByTypeOneTol = valueByType0.filter("tolerance = '" + row.getString(0)+"'").collectAsList();
                    double sum = 0;
                    long count = 0;
                    double stdsum = 0;
                    for (Row row1 : valueByTypeOneTol) //计算总平均
                    {
                        sum += row1.getDouble(2) * row1.getLong(3);
                        count += row1.getLong(3);
                        stdsum += Math.pow(row1.getDouble(4), 2);
                    }
                    msg = "---------Print result of avg&std--------";
                    System.out.println(msg);
                    double avg = sum / count;
                    double std = Math.pow((stdsum / valueByTypeOneTol.size()), 0.5);
                    System.out.println("avg = " + avg + ", std = " + std);

                    msg = "---------Result Judgement--------";
                    System.out.println(msg);
                    for (Row row2 : valueByTypeOneTol) {
                        System.out.println("delta avg  = " + (avg - row2.getDouble(2)));
                        String tol[] = row2.getString(0).split("/");
                        double tolSpan = Math.abs(Double.valueOf(tol[1]) - Double.valueOf(tol[0]));
                        System.out.println("tolSpan = " + tolSpan);
                        if (tolSpan <= 0) //判断公差是否为0
                        {
                            msg = "no tolerance or tolerance equals to 0";
                            System.out.println(msg);
                        } else {
                            if ((avg - row2.getDouble(2)) >= tolSpan * AVG_TOL)//均值判断 平均值 - 单个值 大为坏
                            {
                                msg = "typeno = " + row2.getString(1) + "avg need to be handled";
                                System.out.println(msg);
                                String typeno = row2.getString(1);
                                errorMessage = "type = " + typeno + "avg is less than the mean, please check if there is any problems in this station\n";
                            } else {
                                msg = "typeno = " + row.getString(1) + "do not need to be handled";
                                System.out.println(msg);
                            }
                            System.out.println("delta std  = " + (row.getDouble(4) - std));
                            if ((row.getDouble(4) - std) >= tolSpan * STD_TOL)//方差判断 单个值 - 平均值  大为坏
                            {
                                msg = "typeno = " + row.getString(1) + "std need to be handled";
                                System.out.println(msg);
                                String typeno = row2.getString(1);
                                errorMessage = "type = " + typeno + "std is greater than the mean, please check if there is any problems in this station\n";
                            } else {
                                msg = "typeno = " + row.getString(1) + "do not need to be handled";
                                System.out.println(msg);
                            }
                        }
                    }
                }
            }

4.新建表格用于存储处理后的数据

//----------------------------------------------新建一个表格
sqlContext.sql("Create table if not exists scxtest001 (uid varchar(500), time_stamp String) partitioned by (pt String)");

5.存储处理后的dataframe

方法一:以临时表的形式存入新建表格

注意,目前阿里云只有spark 2.3.0版本才支持建立临时视图

方法二:INSERTINTO和SAVEASTABLE的形式存入新建表格

关于上述两个方法,请关注我个人的git lab 项目:
https://gitlab.com/uaes-tef3/hdev6online

你可能感兴趣的:(3.0 阿里云大数据项目实战开发)