Mac搭建Hadoop[HDFS+HIVE+SPARK]运行WordCount

HADOOP目前的版本有Apache社区版、CDH版和HDP版等,Apache社区版有些包依赖不一致会存在问题,CDH版本国内70%~80%的公司在用,因此这个笔记安装的都是CDH版本的hadoop。具体版本为cdh5.7.0,hadoop2.6.0,hive1.1.0

PATH设置错误:export PATH=/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin

编译Hadoop

1.SSH免密登录

参考mac安装教程,添加ssh localhost免密登录,目的是为了让hadoop免密登录。

System Preference -> Sharing 中,左边勾选 Remote Login,右边选择 All Users

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

2.安装Hadoop

Hadoop的安装教程参考官网

下载tar.gz文件,解压后添加HADOOP_HOME变量

vi ~/.bash_profile
export HADOOP_HOME=/Users/xiafengsheng/app/hadoop-2.6.0-cdh5.7.0
export PATH=$HADOOP_HOME/bin:$PATH
source ~/.bash_profile

3.更改配置文件

参考官网,更改$HADOOP_HOME/etc/hadoop 下的 core-site.xml /hadoop-env.sh /hdfs-site.xml

/mapred-site.xml /yarn-site.xml 在配置yarn时再更改

hadoop-env.sh

添加JAVA_HOME变量

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home

core-site.xml

添加fs.defaultFS 属性和hadoop.tmp.dir 属性,fs.defaultFS制定文件系统,hadoop.tmp.dir制定临时文件目录,默认的临时文件目录会在系统重启后被清楚,所以要设置,端口设置为8020是因为新版本的hadoop默认端口是8020


    
        fs.defaultFS
        hdfs://localhost:8020
    
    
        hadoop.tmp.dir
        /Users/xiafengsheng/app/tmp
    

hdfs-site.xml

添加dfs.replication属性,dfs.replication属性控制hadoop备份文件数量,由于是单节点,所以改为1


    
        dfs.replication
        1
    

4.启动hdfs

首次启动需要先格式化HDFS

bin/hdfs namenode -format

启动hdfs

sbin/start-dfs.sh

有两种方式可以查看是否启动成功

  • jps指令

    xiafengshengdeMacBook-Pro:hadoop xiafengsheng$ jps
    3329 Jps
    627 RemoteMavenServer
    1525 NameNode
    1909 RunJar
    1864 NodeManager
    1690 SecondaryNameNode
    634 Launcher
    1789 ResourceManager
    1597 DataNode
  • 登录Localhost:50070

5.其他笔记

Unable to load native-hadoop library for your platform… using builtin-java classes where applicable 解决方法

Hadoop2之后YARN取代MapReduce进行资源调度

配置YARN

关于YARN的调度,可以参考视频1-19.

$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar 有样例

更改配置文件

参考官网配置$HADOOP_HOME/etc/hadoop/目录下的mapred-site.xml和yarn-site.xml,如果文件不存在,但是有template模板,就复制直接复制模板然后更改

mapred-site.xml


    
        mapreduce.framework.name
        yarn
    

yarn-site.xml


    
        yarn.nodemanager.aux-services
        mapreduce_shuffle
    

启动YARN

启动yarn

sbin/start-yarn.sh

有两种方式可以查看是否启动成功

  • jps指令

    xiafengshengdeMacBook-Pro:hadoop xiafengsheng$ jps
    3329 Jps
    627 RemoteMavenServer
    1525 NameNode
    1909 RunJar
    1864 NodeManager
    1690 SecondaryNameNode
    634 Launcher
    1789 ResourceManager
    1597 DataNode
  • 登录Localhost:8088

编译HIVE

1.安装hive

hive的安装教程参考官网

下载完编译好的tar.gz文件解压后,添加HIVE_HOME环境变量,添加HIVE_HOME环境变量到PATH,添加好后直接进入步骤2,不用参考官网

vi ~/.bash_profile
export HIVE_HOME=/Users/xiafengsheng/app/hive-1.1.0-cdh5.7.0
export PATH=$HIVE_HOME/bin:$PATH
source ~/.bash_profile

2.设置Metadata

为什么要设置?因为Metadata默认存放在hive的Derby数据库,为了实现matadata的共享,采用mysql来保存metadata。共享metadata数据后,spark等也能操作hive数据。步骤如下。

  1. 下载Mysql驱动,选择Platform Independent

  2. 把Mysql驱动拷贝到$HIVE_HOME/lib/目录下(也可以设置软链接 )

    ln -s /usr/share/java/mysql-connector-java.jar  $HIVE_HOME/lib/mysql-connector-java.jar
  3. 给metadata创建数据database(此步骤可以忽略)

    $ mysql -u root -p
    Enter password:
    mysql> CREATE DATABASE metastore;
    mysql> USE metastore;
    mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.14.0.mysql.sql;
  4. 给metadata创建用户(同样可以忽略,直接选择root)

    mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'hivepassword'; 
    mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'hivepassword';
    mysql>  flush privileges;
  5. 在$HIVE_HOME/conf目录下创建 hive-site.xml文件,添加metadata配置


   
      javax.jdo.option.ConnectionURL
      jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true&useSSL=false
      metadata is stored in a MySQL server
   
   
      javax.jdo.option.ConnectionDriverName
      com.mysql.jdbc.Driver
      MySQL JDBC driver class
   
   
      javax.jdo.option.ConnectionUserName
      hiveuser
      user name for connecting to mysql server
   
   
      javax.jdo.option.ConnectionPassword
      hivepassword
      password for connecting to mysql server
   

步骤3不创建表时,必须把createDatabaseIfNotExist设置为true,mysql8要求设置SSL,添加&useSSL=false,这点参考了这篇文章

3.WordCount

create table hive_wc(context string);

load data inpath '/cihon/wc.txt' into table hive_wc;

select word,count(1) from hive_wc lateral view explode(split(context,' ')) wc as word group by word;

4.HIVE和MR和YARN

HIVESQL转换成MRjob,YARN来调度MRjob

YARN资源调度框架 MR计算框架

编译Spark

为什么用Spark?Run workloads 100x faster.

0.不编译Spark

Spark可以选择不编译,直接采用官网内嵌hadoop的版本(pre-bulit for apache hadoop 2.7 and later)进行单机测试,直接运行。

./bin/spark-shell.sh

当选择IDEA作为spark的IDE时,新建工程如果选择maven工程,配置好pom.xml文件后,IDEA也为自动帮忙下载各种依赖包。

如果电脑里没有scala,就需要在IDEA的plugins设置里添加scala

1.编译Spark

1.1前置需求

  1. Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+

  2. export MAVEN_OPTS=”-Xmx2g -XX:ReservedCodeCacheSize=512m”

1.2MVN编译

1.pom.xml

<properties>
    <java.version>1.8java.version>
    <maven.compiler.source>${java.version}maven.compiler.source>
    <maven.compiler.target>${java.version}maven.compiler.target>
    <maven.version>3.3.9maven.version>
    <sbt.project.name>sparksbt.project.name>
    <hadoop.version>2.6.5hadoop.version>
properties>

替换properties参数用-Dhadoop.version=2.4.0

<profile>
    <id>hadoop-2.7id>
    <properties>
        <hadoop.version>2.7.3hadoop.version>
        <curator.version>2.7.1curator.version>
    properties>
profile>

profile参数用-PHadoop-2.4

最终代码-普通模式

mvn -Psparkr -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0  -DskipTests clean package 

第一次运行时报错

Could not find artifact org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.7.0 in central (https://repo.maven.apache.org/maven2)

这是因为默认的maven源缺少2.6.0-cdh5.7.0,在pom.xml中添加cloudera源


  cloudera
  cloudera Repository
  https://repository.cloudera.com/artifactory/cloudera-repos/

最终代码-生成tgz包

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes -Dhadoop.version=2.6.0-cdh5.7.0

编译过程遇到报错

Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  : 
  pdflatex is not available

在R里面运行Sys.which(“pdflatex”)提示不存在,安装latex

编译完成后文件是spark-$VERSION-bin-$NAME.tgz

添加环境变量SPARK_HOME和更新PATH

export SPARK_HOME=/Users/xiafengsheng/app/spark-2.3.1-bin-custom-spark
export PATH=$SPARK_HOME/bin:$PATH

2.Sparklocal

local模式无需其他额外操作

进入local模式

spark-shell --master local[2]

3.Spark Standalone

Standalone部署模式也是1master+nworkder

需要配饰spark-env.sh文件

3.1 Spark-env.sh

You can optionally configure the cluster further by setting environment variables in conf/spark-env.sh. Create this file by starting with the conf/spark-env.sh.template

# Options for the daemons used in the standalone deploy mode
SPARK_MASTER_HOST=localhost
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
Environment Variable Meaning
SPARK_MASTER_HOST Bind the master to a specific hostname or IP address, for example a public one.
SPARK_WORKER_CORES Total number of cores to allow Spark applications to use on the machine (default: all available cores).
SPARK_WORKER_MEMORY Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m, 2g (default: total memory minus 1 GB); note that each application’s individual memory is configured using its spark.executor.memoryproperty.

配饰好后直接运行

./sbin/start-all.sh

运行后可以通过log日志查看spark的地址

3.2想要真正的分布式怎么操作?

操作conf/slaves

要使用启动脚本启动 Spark standalone 集群,你应该首先在 Spark 目录下创建一个叫做 conf/slaves 的文件,这个文件中必须包含所有你想要启动的 Spark workers 的机器的 hostname ,每个 hostname 占一行。 如果 conf/slaves 不存在,启动脚本默认启动单个机器(localhost),这对于测试是有效的。 注意, master 机器通过 ssh 访问所有的 worker 机器。

## 4.连接Spark集群

具体参数配置参考官网

spark-shell --master spark://localhost:7077

简单的wordcount

val file = spark.sparkContext.textFile("file:///Users/xiafengsheng/Desktop/wc.txt")
//不加file默认起始路径为spark路径
val wordCounts = file.flatMap(line => line.split(" ").map((word => (word,1)))).reduceByKey(_ + _)
wordCounts.collect
res2: Array[(String, Int)] = Array((0,9), (entries,3), (to,3), (of,3), (Showing,3))

5.SparkHive

在Spark集群模式或者Local模式连接HIVE数据库,这里用SparkR

sparkR --master local[*]
sparkR.session(enableHiveSupport = TRUE)
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")

执行sql语句时报错

spark Error creating transactional connection factory

原因是因为Hive采用Mysql配置Metadata,Spark默认没有连接Mysql的组件,使用driver-class-path 参数来添加mysql组件,有没有直接添加环境变量的方法?

--driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath
sparkR --driver-class-path /Users/xiafengsheng/app/hive-1.1.0-cdh5.7.0/lib/mysql-connector-java-5.1.46.jar

用SparkR连接Hive进行WordCount

collect(sql("select * from hive_wc"))
collect(sql("select word,count(1) from hive_wc lateral view explode(split(context,' ')) wc as word group by word"))

     word count(1)                                                              
1       0        9
2 entries        3
3      of        3
4      to        3
5 Showing        3

6.Spark YARN

sparkR --master yarn --deploy-mode client --driver-class-path /Users/xiafengsheng/app/hive-1.1.0-cdh5.7.0/lib/mysql-connector-java-5.1.46.jar

杂记

创建maven父子工程

https://blog.csdn.net/aiu_iu/article/details/71844568?utm_source=itdadao&utm_medium=referral

1

你可能感兴趣的:(笔记)