需求:将hive中的数据读取出来,写入es中。
环境:spark 2.0.2
SparkConf conf = new SparkConf().setAppName("appName").setMaster("local[*]");
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example hive")
.config(conf)
.enableHiveSupport() //支持hive
.getOrCreate();
org.apache.spark
spark-hive_2.10
1.2.1
或者
org.apache.spark
spark-hive_2.11
2.3.0
参考官方文档
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
SparkSession spark = ESMysqlSpark.getSession();
String querySql = "SELECT * FROM test.table";
spark.sql(querySql);
需求:合并两个字段,组成一个新的字符串。
可以先用udf
注册一个函数
spark.udf().register("mode", new UDF2() {
public String call(String types, Long time) throws Exception {
return types.replace(".", "") + String.valueOf(time);
}}, DataTypes.StringType);
求某字段的平均值(输出为int型)、某字段的最大/最小值、日期字段格式化输出等等。这种需求则都可以在hive语句中实现。
String querySql = String.format("SELECT mode(ip, unix_timestamp()) id," +
" ip, " +
"cast(avg(t1) as bigint) f1, " +
"cast(avg(t2) as bigint) f2, " +
"min(t3) minSpeed, " +
"max(t4) maxSpeed, " +
"from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') time " +
"FROM test.table " +
"where time > %s " +
"group by ip ", timeLimit);
通过 ds.show()查看数据是否正确
Dataset ds = spark.sql(querySql);
EsSparkSQL.saveToEs(ds, "sha_parking/t_speedInformation");
如果读取不到数据。先确认以下配置:
/etc/hosts
中确保 127.0.0.1 hostname
已添加$SPARK_HOME/conf/spark-env.sh
, 确保ip地址是否正确mysql -uroot -p
use hive;
select * from VERSION;
update VERSION set SCHEMA_VERSION='2.1.1' where VER_ID=1;