java spark on hive_hive-on-spark 安装 以及 scala 实例

hive 安装与 have-on-spark:

1,hive 默认是启用的 derby 数据库,在当前路径(hive/bin下)创建元数据

2,derby只能单用户使用,mysql 支持多用户使用!

安装hive:

1,下载 apache-hive-1.2.2-bin.tar.gz

2,配置 $HIVE_HOME/conf/hive-env.sh 指定 HADOOP_HOME

8f900a89c6347c561fdf2122f13be562.png

961ddebeb323a10fe0623af514929fc1.png

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements. See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership. The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License. You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# Set Hive and Hadoop environment variables here. These variables can be used

# to control the execution of Hive. It should be used by admins to configure

# the Hive installation (so that users do not have to set environment variables

# or set command line parameters to get correct behavior).

#

# The hive service being invoked (CLI etc.) is available via the environment

# variable SERVICE

# Hive Client memory usage can be an issue if a large number of clients

# are running at the same time. The flags below have been useful in

# reducing memory usage:

#

# if [ "$SERVICE" = "cli" ]; then

# if [ -z "$DEBUG" ]; then

# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"

# else

# export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"

# fi

# fi

# The heap size of the jvm stared by hive shell script can be controlled via:

#

# export HADOOP_HEAPSIZE=1024

#

# Larger heap size may be required when running queries over large number of files or partitions.

# By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be

# appropriate for hive server.

HADOOP_HOME=/home/hadoop/hadoop-2.3.0

# Set HADOOP_HOME to point to a specific hadoop install directory

# HADOOP_HOME=${bin}/../../hadoop

# Hive Configuration Directory can be controlled by:

export HIVE_CONF_DIR=/home/hadoop/hive/conf

# Folder containing extra libraries required for hive compilation/execution can be controlled by:

# export HIVE_AUX_JARS_PATH=

View Code

3,配置元数据信息 ,$HIVE_HOME/conf 新增 hive-site.xml 配置数据库地址,用户名,密码,与driver信息

8f900a89c6347c561fdf2122f13be562.png

961ddebeb323a10fe0623af514929fc1.png

javax.jdo.option.ConnectionURL

jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true

JDBC connect string for a JDBC metastore

javax.jdo.option.ConnectionDriverName

com.mysql.jdbc.Driver

Driver class name for a JDBC metastore

javax.jdo.option.ConnectionUserName

root

username to use against metastore database

javax.jdo.option.ConnectionPassword

x

password to use against metastore database

View Code

4,在beeline-log4j.properties配置hive.log.dir(日志,还是要看map reduce的日志)位置

8f900a89c6347c561fdf2122f13be562.png

961ddebeb323a10fe0623af514929fc1.png

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements. See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership. The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License. You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# Define some default values that can be overridden by system properties

hive.log.threshold=ALL

hive.root.logger=INFO,DRFA

hive.log.dir=/home/hadoop/hive

hive.log.file=hive.log

# Define the root logger to the system property "hadoop.root.logger".

log4j.rootLogger=${hive.root.logger}, EventCounter

# Logging Threshold

log4j.threshold=${hive.log.threshold}

#

# Daily Rolling File Appender

#

# Use the PidDailyerRollingFileAppend class instead if you want to use separate log files

# for different CLI session.

#

# log4j.appender.DRFA=org.apache.hadoop.hive.ql.log.PidDailyRollingFileAppender

log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender

log4j.appender.DRFA.File=${hive.log.dir}/${hive.log.file}

# Rollver at midnight

log4j.appender.DRFA.DatePattern=.yyyy-MM-dd

# 30-day backup

#log4j.appender.DRFA.MaxBackupIndex=30

log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout

# Pattern format: Date LogLevel LoggerName LogMessage

#log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n

# Debugging Pattern format

log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n

#

# console

# Add "console" to rootlogger above if you want to use this

#

log4j.appender.console=org.apache.log4j.ConsoleAppender

log4j.appender.console.target=System.err

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n

log4j.appender.console.encoding=UTF-8

#custom logging levels

#log4j.logger.xxx=DEBUG

#

# Event Counter Appender

# Sends counts of logging messages at different severity levels to Hadoop Metrics.

#

log4j.appender.EventCounter=org.apache.hadoop.hive.shims.HiveEventCounter

log4j.category.DataNucleus=ERROR,DRFA

log4j.category.Datastore=ERROR,DRFA

log4j.category.Datastore.Schema=ERROR,DRFA

log4j.category.JPOX.Datastore=ERROR,DRFA

log4j.category.JPOX.Plugin=ERROR,DRFA

log4j.category.JPOX.MetaData=ERROR,DRFA

log4j.category.JPOX.Query=ERROR,DRFA

log4j.category.JPOX.General=ERROR,DRFA

log4j.category.JPOX.Enhancer=ERROR,DRFA

# Silence useless ZK logs

log4j.logger.org.apache.zookeeper.server.NIOServerCnxn=WARN,DRFA

log4j.logger.org.apache.zookeeper.ClientCnxnSocketNIO=WARN,DRFA

View Code

关于mysql的配置:

1,mysql 开启支持集群模式

#*.*:所有库下的所有表 %:任何IP地址或主机都可以连接

GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root'WITH GRANT OPTION;

FLUSH PRIVILEGES;

2,将mysql-connector-java-5.1.28.jar 复制到 $HIVE_HOME/lib 之中

hive说明:hive 查询过程,是指从 hdfs路径之中加载文件的过程,以此类推

1,hive 存储的是元数据(derby或者mysql)与文件的对用关系,两者都存在就可以查到

2,hive在启动过程中会自动创建一个数据仓库,创建一个数据库(default除外)会在默认的仓库创建一个文件夹

#bin/hive 里创建一个数据库,hive 不区分大小写

#/user/hive/warehouse/sga.db 在hdfs之中默认的仓库位置创建 sga文件夹

create database Sga;#hdfs 之中/user/hive/warehouse/sga.db/stu 创建一个stu的文件夹

create table stu(id int,name string) row format delimited fields terminated by "\t";#./hdfs dfs -cat /user/hive/warehouse/sga.db/stu/students

load data local inpath "/home/hadoop/datas" into table stu;

hive常用的操作命令:

#-e 命令行执行sql语句

./hive -e "select * from stu;"

#-f 命令行传递 sql文件 重定向到一个txt文件之中

./hive -f fhive.sql >fhiveresult.txt#使用dfs 命令 查看 hdfs 文件操作系统

dfs -du -h /;#使用 ! 执行本地linux命令

!ls -a#在 bin/hive 之中配置参数

set hive.cli.print.header=true#命令行启动加载指定的配置信息

./hive -hiveconf hive.cli.print.current.db=true#hive历史命令保存位置

$HOST_NAME/.hivehistory

pom 导入

mysql

mysql-connector-java

5.1.38

org.apache.spark

spark-hive_2.11

${spark.version}

scala 代码实例

packageDay3importorg.apache.spark.sql.{DataFrame, Dataset, SparkSession}

object hivesparksql {

def main(args: Array[String]): Unit={

val spark= SparkSession.builder().master("local[*]")

.appName(this.getClass.getSimpleName)

.enableHiveSupport()//开启 sparksql 对 hive的支持

.getOrCreate()

System.setProperty("HADOOP_USER_NAME","root")

val results= spark.sql("show tables")

results.show()/*+--------+--------------+-----------+

|database| tableName|isTemporary|

+--------+--------------+-----------+

| default| dept| false|

| default| emp| false|

| default| stu| false|

| default| stu_partition| false|

| default| student| false|

| default| student_infos| false|

| default|student_scores| false|

+--------+--------------+-----------+

**/

//在 hdfs 之中 创建目录 /user/hive/warehouse/sxu

spark.sql("create table if not exists t_access(username string,mount string)")importspark.implicits._

val access:Dataset[String]=spark.createDataset(

List("jie,2019-11-01","cal,2011-11-01")

)

val accessdf=access.map({ t=>val lines:Array[String]=t.split(",")

(lines(0),lines(1))}).toDF("username","mount")//第一种写入自定义数据方式 使用临时表//accessdf.createTempView(viewName = "v_tmp")//spark.sql("insert into t_access select * from v_tmp")//第二种方式 把自定义数据写入表之中 tableName 是 数据库.表名 的格式

accessdf.write.insertInto("t_access")

spark.stop()

}}//scala 实例

你可能感兴趣的:(java,spark,on,hive)