分布式大数据集群搭建

一、大数据相关组件及概念

flume: 高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。

kafka:消息队列

Redis:内存数据库

zookeeper:大数据集群管理

hadoop:hdfs(分布式存储) mapreduce(分布式离线计算) yarn(资源调度管理)

存储模型:
    HDFS是一个主从(master/slaves)架构
	由一个NameNode和一些DataNode组成
	面向文件包含:文件数据(data)和文件元数据(metadata)
	NameNode负责存储和管理文件元数据,并维护一个层次型的文件目录树
	DataNode负责存储文件数据(block块),并提供block的读写
	DataNode与NameNode维持心跳,并汇报自己持有的block信息
	Client和NameNode交互文件元数据,和DataNode交互文件block数据
角色功能:
	NameNode:
		完全基于内存存储文件元数据、目录结构、文件block的映射
		需要持久化方案保证数据的可靠性
		提供副本放置策略
	DataNode:
		基于本地磁盘存储block(文件形式)
		并保存block的校验和数据保证block的可靠性
		与NameNone保持心跳,汇报block列表状态
基本语法命令:上传的文件存储在从节点DataNode存储路径下,主节点不会存储
    hadoop fs -mkdir /input # 创建文件夹
    hadoop fs -put 1.txt /input # 上传文件到文件系统
    hadoop fs -put /root/bak/hadoopbak/profile.db/user_action/ usr/hive/warehouse/profile.db/
    hadoop fs -rm -f /test # 删除文件
    hadoop fs -rm -r /testdir # 删除文件夹
    hadoop fs -ls / # 查看
    hadoop fs -get test /usr/local/hadoop # 将hadoop上test文件夹下载到本地/usr/local/hadoop
    hadoop fs -cat /input/word.txt # 查看文件内容

hive:SQL方式进行MapReduce计算(hive的存储引擎是hdfs;计算引擎是MapReduce),数据仓库(无法实时读写)

mysql:业务数据库,或存储hive元数据

hbase:数据库(实时&分布式&高维 数据库,面向列的数据存储,实时读取),big data 

sqoop:数据同步工具

spark:大数据计算引擎(spark core;spark streaming;spark sql;spark mllib)

spark 由4类角色组成整个spark的运行时的环境:
    资源管理层面:
        .管理者:Master(管理整个集群的资源;类比于YARN的ResouceManager)
        .工作中:Worker(管理单个服务器的资源;类比于YARN的NodeManager)
    任务执行层:
        .某任务管理者:Driver(管理单个spark任务在运行的时候的工作;类比于YARN的ApplicationMaster)
        .某任务执行者:Executor(单个任务运行的时候的一堆工作者,干活的,类比于YARN的容器内运行的TASK)
        注:正常情况下Executor是干活的角色,不过在特殊的场景下(Local模式)Driver可以即管理又干活
    
    spark 运行:
        bin目录下的可执行文件:
            交互式解释器执行环境:
                ./pyspark --master local[*] # python方式启动本地模式
                ./spark-shell # scala方式启动
            代码提交运行模式:
                ./spark-submit --master local[*] /usr/softwaretmp/bigdata/spark/spark/spark-2.4.0-bin-hadoop2.7/examples/src/main/python/pi.py 10 # 提交运行已编写好的代码文件
            一个spark程序会被分成多个子任务(job)运行,每个job会分成多个阶段(state)来运行,每个state内会分出多个task(线程)来执行具体任务
    
    spark 算子:
        RDD算子:
            Transformation:转换算子
                map: rdd.map(func) 功能:map算子是将rdd数据一条条处理,返回新的rdd
                flatMap:rdd.flatMap(func) 功能:对rdd执行map操作,然后进行解除嵌套操作
                reduceByKey:rdd.reduceByKey(func) 功能:针对KV型RDD,自动按照key分组,然后根据提供的聚合逻辑,完成组内数据(value)的聚合操作
                mapValues:rdd.reduceByKey(func) 功能:针对二元元祖rdd,对其内部二元元祖的value执行map操作
                groupBy:rdd.groupBy(func) 功能:将rdd数据进行分组
                filter:rdd.filter(func) 功能:过滤想要的数据进行保留
                distinct:rdd.filter() 功能:对rdd数据进行去重,返回新rdd
                union:rdd.union(other_rdd) 功能:2个rdd合并成1个rdd返回
                join/leftOuterJoin/rightOuterJoin:rdd.join(other_rdd) 功能:对两个KV型rdd执行join/leftOuterJoin/rightOuterJoin操作
                intersection:rdd.intersection(other_rdd) 功能:求2个rdd的交集,返回一个新的rdd
                glom:rdd.glom() 功能:将rdd的数据按照分区加上嵌套
                groupByKey:rdd.groupByKey() 功能:针对KV型rdd,自动按照key分组
                sortBy:rdd.sortBy(func,ascending=False,numPartitions=1) 功能:对rdd数据基于指定的排序依据进行分区内排序(若全局有序,numPartitions设为1)
                sortByKey:rdd.sortByKey(ascending=False,numPartitions=1,keyfunc) 功能:针对KV型RDD,按照key进行排序(若全局有序,,numPartitions设为1)
                分区操作算子:
                    mapPartitions:rdd.map(func) 功能:和map类似,但是mapPartitions一次被传递的是一整个分区数据
                    partitionBy:rdd.partitionBy(参数1:重新分区后有几个分区,参数2:自定义分区规则,函数传入) 功能:对rdd进行自定义分区操作
                    repartition:rdd.repartition(N) 功能:对rdd的分区执行重新分区(仅数量)
            Action:动作(行动)算子
                countByKey:rdd.countByKey() 功能:统计key出现的次数(一般适用于KV型rdd)
                collect:rdd.collect() 功能:将rdd各个分区内的数据,统一收集到Driver中,形成一个List对象
                fold:rdd.fold(10,func) 功能:和reduce一样,接受传入逻辑进行聚合,聚合是带有初始值的。分区内聚合;分区间聚合
                first:rdd.first() 功能:取出rdd的第一个元素
                takeSample:rdd.takeSample(参数1:True/False,参数2:采样数,参数3:随机数种子) 功能:随机抽样(有放回/不放回)rdd的数据
                takeOrdered:rdd.takeOrdered(参数1:要几个数据,参数2:排序时数据更改) 功能:对rdd进行排序取前N个
                foreach:rdd.foreach(func) 功能:对rdd的每个元素执行所提供的逻辑操作(和map一个意思),但是这个没有返回值
                saveAsTextFile:rdd.saveAsTextFile("/./.") 功能:将rdd数据写入文本文件中
                分区操作算子:
                    foreachPartition:rdd.foreachPartition(func) 功能:和普通foreach一致,一次处理的是一整个分区数据

flink:实时计算引擎(stateful stream processing;Datastream/Dataset API;Table API;Flink SQL)

flink 集群角色:
    JobManager(JVM进程) master 
    TaskManager(JVM进程) slave 

anaconda:python变成环境

supervisor:python开发的进程管理工具

二、大数据安装及下载链接:

apache-hive-2.1.1-bin.tar.gz : 
    http://archive.apache.org/dist/hive/
hadoop-2.7.3.tar.gz : 
    http://archive.apache.org/dist/hadoop/common/
hbase-1.2.4-bin.tar.gz :
    http://archive.apache.org/dist/hbase/
jdk-8u171-linux-x64.tar.gz : 
    https://www.oracle.com/java/technologies/downloads/
mysql-connector-java-5.1.47-bin.jar : 
    https://dev.mysql.com/downloads/
scala-2.11.12.tgz : 
    https://www.scala-lang.org/download/2.11.12.html
spark-2.4.0-bin-hadoop2.7.tgz :
    http://archive.apache.org/dist/spark/
sqoop-1.4.7.bin.tar.gz : 
    http://archive.apache.org/dist/sqoop/
zookeeper-3.4.10.tar.gz : 
    https://archive.apache.org/dist/zookeeper/
apache-flume-1.8.0-bin.tar.gz :
    http://archive.apache.org/dist/flume/
redis : 
    https://redis.io/download/
Anaconda3-2021.05-Linux-x86_64.sh :
    https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/?C=M&O=D

三、部分相关说明

3.1、账号 & 密码 说明:

虚拟机:
    node01 主节点:
        用户名:root  密码:root
        用户名:itcast  密码:!QAZ@WSX3edc
    node02 从节点:
        用户名:root  密码:root
        用户名:itcast  密码:!QAZ@WSX3edc
    node03 从节点:
        用户名:root  密码:root
        用户名:itcast  密码:!QAZ@WSX3edc
mysql数据库:
    用户名:root   密码:123456

3.2、Web端口说明

hdfs远程连接namenode端口:9000  hdfs://node01:9000/input/word.txt
namenode的webUI端口:50070
yarn的web端口:http://192.168.52.66:18088
yarn集群子任务端口:http://192.168.52.66:4040
spark集群的web端口:8080
spark-job监控端口:4040

四、分布式集群安装

4.1、虚拟机安装

        虚拟机安装位置:

                D:\bigdata\Virtual Machines\node01

        创建虚拟机:

                创建新的虚拟机->自定义(高级)->下一步->稍后安装操作系统->Linux CentOS64->修改名称和位置->处理器配置->内存配置->网络连接(NAT)->下一步->下一步->创建新虚拟磁盘->最大磁盘大小->下一步->完成

        安装操作系统:

                安装centOS 7步骤:

                        CD/DVD(IDE)->导入镜像->开启此虚拟机->Install CentOS 7->选择语言(中文)->安装位置(直接点确定)/软件选择(带GUI的服务器)/网络和主机名(设置主机名,打开网络)->开始安装->ROOT密码(root)->重启->接受许可证->选择语言(前进)->选择时区(上海)->跳过->设置用户和密码->开始使用

                配置虚拟机网络服务:

cd /etc/sysconfig/network-scripts/

vim ifcfg-ens33
'''
    DEVICE=ens33
    TYPE=Ethernet
    ONBOOT=yes
    NM_CONTROLLED=yes
    BOOTPROTO=static
    IPADDR=192.168.52.66
    NETMASK=255.255.255.0
    GATEWAY=192.168.52.2
    DNS1=144.144.144.144
    DNS2=192.168.52.2
'''

service network restart # 重启网卡服务
ping www.baidu.com # ping外网测试

                基于虚拟机快照克隆多台虚拟机:(克隆两台从节点node02、node03)

                        右击节点(node01)->快照->快照管理器->拍摄快照->起名(base)->拍摄快照

                        右击节点(node01)->管理->克隆->下一步->现有快照(base)->创建链接克隆->修改虚拟机名称(node02)和路径->完成->关闭

                        开机,修改配置信息:

cd /etc/sysconfig/network-scripts/

vim ifcfg-ens33
'''
    DEVICE=ens33
    TYPE=Ethernet
    ONBOOT=yes
    NM_CONTROLLED=yes
    BOOTPROTO=static
    IPADDR=192.168.52.67
    NETMASK=255.255.255.0
    GATEWAY=192.168.52.2
    DNS1=144.144.144.144
    DNS2=192.168.52.2
'''

vim /etc/hostname # 修改主机名称
'''
    node02
'''

cat /etc/udev/rules.d/70-persistent-ipoib.rules

rm -f /etc/udev/rules.d/70-persistent-net.rules # 删除原网络物理地址生成文件

reboot # 重启 

ifconfig # 查看ip

ping www.baidu.com # ping外网测试

                配置修改yum源(国内源):

                        修改配置文件:

cd /etc/yum.repos.d/

ls

mkdir back

mv CentOS-Base.repo back/

wget -O /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo # 阿里云官网

ls

yum clean all # 清空本地依赖缓存

yum makecache # 将依赖缓存下载到本地:

4.2、服务器基础环境配置

        服务器说明:三台服务器,一台主节点(node01),两台从节点(node02,node03);

        防火墙关闭(三个节点):

systemctl stop firewalld.service # 防火墙暂时停止

systemctl disable firewalld.service # 防火墙永久关闭

systemctl status firewalld # 查看防火墙状态

        主机host映射(三个节点):

vim /etc/hosts # 编辑/etc/hosts文件,插入下面内容
'''
    192.168.52.66 node01
    192.168.52.67 node02
    192.168.52.68 node03
'''

        ssh配置免密登录(三个节点):

                节点各自免密等:(三个节点)

ssh-keygen # 生成认证秘钥

cd /root/.ssh/

ls

cat id_rsa.pub >> authorized_keys # 内容重定向到授权文件里

chmod 600 authorized_keys # 设置权限600

ssh localhost # 测试是否设置成功

exit # 退出

                主节点 -> 从节点 免密 设置:

                        从节点:node02,node03

scp node01:/root/.ssh/id_rsa.pub /root # 把主节点秘钥拷贝到从节点下

ls /root/ # 查看 

cat /root/id_rsa.pub >> authorized_keys # 重定向到授权文件里

                        主节点测试:在主节点下输入

ssh node02 # 免密登录node02

ssh node03

exit # 退出

                 从节点 -> 主节点 免密 设置:

                        主节点:node01

scp node02:/root/.ssh/id_rsa.pub /root

cat /root/id_rsa.pub >> authorized_keys

scp node03:/root/.ssh/id_rsa.pub /root

cat /root/id_rsa.pub >> authorized_keys

                        从节点测试:在从节点上输入

ssh node01 # 免密登录主节点

exit # 退出

        时间同步协议和定时任务:

                设置时区:三个节点

tzselect

5 # 亚洲

9 # 中国

1 # 北京

1 # 确定

                安装ntp服务:三个节点

yum install ntp

rpm -qa | grep ntp # 检验是否安装成功

service ntpd status # 查看状态,不要让他自动启动

service ntpd stop # 服务停止

systemctl enable ntpd.service #设置开机自启

                设置配置文件,使主节点同步其自身:主节点

vim /etc/ntp.conf # 修改配置文件,添加下面内容
'''
    server 127.127.1.0 # local clock
    fudge 127.127.1.0 stratum 10

    # 注释文件中server开头的行内容
'''

/bin/systemctl restart ntpd.service # 重启ntp服务

service ntpd status # 查看状态

date & ssh node02 "date" & ssh node03 "date" # 查看各个节点时间

                手动将从节点时区与主节点同步:在从节点输入(两个)

ntpdate node01

# 编写定时任务:
crontab -e
'''
    */1 * * * * /usr/sbin/ntpdate node01
'''

4.3、JDK环境配置:

        # yum install java  # 直接下载安装java,环境变量也配置好,不建议这么安装

        主节点上提前上传下载好的解压缩包,放到对应目录

cd usr/softwaretmp/bigdata/

mkdir java # 创建安装目录

mv jdk-8u171-linux-x64.tar.gz ../java/ # 移动压缩包到对应安装目录下

cd ../java/

tar -zxvf jdk-8u171-linux-x64.tar.gz # 解压缩安装包

scp -r jdk1.8.0_171/ node02:/usr/softwaretmp/bigdata/java  # 把主节点安装好的java文件夹拷贝到从节点相应目录下

scp -r jdk1.8.0_171/ node03:/usr/softwaretmp/bigdata/java

# tar -zcvf jdk1.8.0_171.tar.gz jdk1.8.0_171 # 把jdk1.8.0_171文件夹下的内容压缩成jdk1.8.0_171.tar.gz压缩包

        主从节点修改环境变量配置

vim /etc/profile # 修改环境变量
'''
    # set java environment
    export JAVA_HOME=/usr/softwaretmp/bigdata/java/jdk1.8.0_171
    export CLASSPATH=$JAVA_HOME/lib/
    export PATH=$PATH:$JAVA_HOME/bin
    export PATH JAVA_HOME CLASSPATH
'''

source /etc/profile # 使环境变量生效

        主从节点验证

java -version

4.4、zookeeper集群搭建:

        主节点安装、配置zookeeper:主节点

cd usr/softwaretmp/bigdata/

mv zookeeper-3.4.10.tar.gz ../zookeeper/

cd ../zookeeper/

tar -zxvf zookeeper-3.4.10.tar.gz

                新建数据和日志文件

cd /usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10

mkdir zkdata

mkdir zkdatalog

                配置文件zoo.cfg : 主节点配置

cd /usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10/conf/ # 进入配置文件夹

mv zoo_sample.cfg zoo.cfg # 拷贝配置文件,zookeeper启动时会找到这个文件作为默认配置文件

vim zoo.cfg
'''
    tickTime=2000
    initLimit=10
    syncLimit=5
    dataDir=/usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10/zkdata
    clientPort=2181
    dataLogDir=/usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10/zkdatalog
    server.1=node01:2888:3888
    server.2=node02:2888:3888
    server.3=node03:2888:3888
'''

                创建并配置文件myid:主从节点(集群中配置node01为1号服务器,node02为2号服务器,node03为3号服务器)

cd /usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10/zkdata

vim myid
'''
    1 # 主节点,对应zoo.cfg文件里的server.x的x
'''

                主节点远程复制分发安装文件到从节点:

scp -r /usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10 root@node02:/usr/softwaretmp/bigdata/zookeeper/

scp -r /usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10 root@node03:/usr/softwaretmp/bigdata/zookeeper/

        从节点修改相关配置:从节点

cd /usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10/zkdata

vim myid
'''
    2 # node02对应的myid
    # 3 # node03对应的myid
'''

        修改配置文件,配置zookeeper环境变量:主从节点(三台)

vim /etc/profile
'''
    # set zookeeper environment
    export ZOOKEEPER_HOME=/usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10
    PATH=$PATH:$ZOOKEEPER_HOME/bin
'''

source /etc/profile # 是环境变量生效

        zookeeper目录下启动zookeeper集群:主从节点(三台)

cd ..

bin/zkServer.sh start

bin/zkServer.sh status # 查看状态

4.5、Hadoop集群搭建

        主节点安装配置hadoop:主节点

                创建安装目录,并解压文件

cd /usr/softwaretmp/bigdata/

mkdir hadoop

mv hadoop-2.7.3.tar.gz ../hadoop/

cd ../hadoop/

tar -zxvf hadoop-2.7.3.tar.gz

                配置hadoop各组件

                        1、进入hadoop配置目录,编辑hadoop-env.sh环境配置文件

cd $HADOOP_HOME/etc/hadoop

echo $JAVA_HOME # 可查看java目录

vim hadoop-env.sh 
'''
    export JAVA_HOME=/usr/softwaretmp/bigdata/java/jdk1.8.0_171 # 修改java环境变量
'''

                        2、编辑core-site.xml文件

vim core-site.xml
'''
	
		
			fs.default.name
			hdfs://node01:9000
		
		
			hadoop.tmp.dir
			/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/hdfs/tmp
			A base for other temporary directories.
		
		
			io.file.buffer.size
			131072
		
		
			fs.checkpoint.period
			60
		
		
			fs.checkpoint.size
			67108864
		
	
	
'''

                        3、创建、编辑mapred-site.xml文件 

cp mapred-site.xml.template mapred-site.xml

vim mapred-site.xml
'''
	
		
			
			mapreduce.framework.name
			yarn
		
	
'''

                        4、修改yarn-site.xml

vim yarn-site.xml
'''
	
		
		
			yarn.resourcemanager.address
			node01:18040
		
		
			yarn.resourcemanager.scheduler.address
			node01:18030
		
		
			yarn.resourcemanager.webapp.address
			node01:18088
		
		
			yarn.resourcemanager.resource-tracker.address
			node01:18025
		
		
			yarn.resourcemanager.admin.address
			node01:18141
		
		
		
			yarn.nodemanager.aux-services
			mapreduce_shuffle
		
		
			yarn.nodemanager.auxservices.mapreduce.shuffle.class
			org.apache.hadoop.mapred.ShuffleHandler
		
		
		
			yarn.nodemanager.vmem-check-enabled
			false
			Whether virtual memory limits will be enforced for containers.
		
		
	
'''

                        5、编辑hdfs-site.xml配置文件:

vim hdfs-site.xml
'''
	
		
			
			dfs.replication
			2
		
		
			
			dfs.namenode.name.dir
			file:/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/hdfs/name
			true
		
		
			
			dfs.datanode.data.dir
			file:/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/hdfs/data
			true
		
		
			dfs.namenode.http-address
			node01:50070
		
		
			dfs.namenode.secondary.http-address
			node01:9001
		
		
			dfs.webhdfs.enabled
			true
		
		
			dfs.permissions
			false
		
	
'''

                        6、编写slaves文件,添加子节点slave1和slave2;编写master文件,添加主节点master

vim slaves
'''
    node02
    node03
'''

vim master
'''
    node01
'''

                主节点分发hadoop文件到node02、node02两个子节点:

scp -r /usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3 root@node02:/usr/softwaretmp/bigdata/hadoop/

scp -r /usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3 root@node03:/usr/softwaretmp/bigdata/hadoop/

        添加环境变量:主从节点(三台机器)

vim /etc/profile
'''
    # set HADOOP environment
    export HADOOP_HOME=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3
    export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
'''

source /etc/profile

        在master中格式化hadoop,开启hadoop:主节点(仅在node01中操作)

hadoop namenode -format # 格式化namenode

        主节点开启hadoop集群:主节点(仅在node01主机上开启操作命令,它会带起从节点启动)

cd /usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3 # 回到hadoop目录

sbin/start-all.sh # 主节点开启服务

jps # 查看进程

打开网址查看 :http://192.168.52.66:50070

4.6、Hbase集群搭建:

        主节点安装配置Hbase:主节点

                创建安装目录,并解压文件

cd /usr/softwaretmp/bigdata/

mkdir /hbase/

tar -zxvf hbase-1.2.4-bin.tar.gz

rm -rf /usr/hbase/hbase-1.2.4-bin.tar.gz # 删除压缩包

                 进入hbase配置目录conf,修改配置文件hbase-env.sh,添加配置变量:

cd /usr/softwaretmp/bigdata/hbase/hbase-1.2.4/conf

vim hbase-env.sh
'''
	export HBASE_MANAGES_ZK=false # 关闭自带的zookeeper集群
	export JAVA_HOME=/usr/softwaretmp/bigdata/java/jdk1.8.0_171
	export HBASE_CLASSPATH=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/etc/hadoop
'''

                配置conf/hbase-site.xml

vim hbase-site.xml
'''
	
		
			hbase.rootdir
			hdfs://node01:9000/hbase
		
		
			hbase.cluster.distributed
			true
		
		
			hbase.master
			hdfs://node01:6000
		
		
			hbase.zookeeper.quorum
			node01,node02,node03
		
		
			hbase.zookeeper.property.dataDir
			/usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10
		
	
'''

                配置 conf/regionservers

vim regionservers
'''
    node02
    node03
'''

                hadoop配置文件拷入hbase的目录下

cd /usr/softwaretmp/bigdata/hbase/hbase-1.2.4/conf

cp /usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/etc/hadoop/hdfs-site.xml .

cp /usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/etc/hadoop/core-site.xml .

                分发主节点hbase到子节点

scp -r /usr/softwaretmp/bigdata/hbase/hbase-1.2.4 root@node02:/usr/softwaretmp/bigdata/hbase/

scp -r /usr/softwaretmp/bigdata/hbase/hbase-1.2.4 root@node03:/usr/softwaretmp/bigdata/hbase/

        配置环境变量:主从节点(三台)

vim /etc/profile
'''
    # set hbase environment
    export HBASE_HOME=/usr/softwaretmp/bigdata/hbase/hbase-1.2.4
    export PATH=$PATH:$HBASE_HOME/bin
'''

source /etc/profile

        运行和测试:在主节点node01上执行(保证hadoop和zookeeper已开启)

bin/start-hbase.sh

jps

网页输入:ip:16010

4.7、HIVE数据仓库搭建

        在子节点node03上安装mysql:

                1、配置本地源,安装mysql server

cd /usr/local/src/

wget http://repo.mysql.com/mysql57-community-release-el7-8.noarch.rpm

yum -y localinstall mysql57-community-release-el7-8.noarch.rpm

yum -y install mysql-community-server

yum -y install mysql-server

如果报秘钥错误的解决办法:
	vim /etc/yum.repos.d/mysql-community.repo
	# 修改对应安装版本的gpgcheck=0即可,默认值为1
	'''
		[mysql57-community]
		name=MySQL 5.7 Community Server
		baseurl=http://repo.mysql.com/yum/mysql-5.7-community/el/7/$basearch/
		enabled=1
		gpgcheck=0
		gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-mysql
	'''

                2、启动服务

systemctl daemon-reload # 重载所有修改过的配置文件

systemctl start mysqld # 开启服务

systemctl enable mysqld # 开机自启

                3、获取安装后自动生成的随机密码,用于登录MySQL数据库

grep "temporary password" /var/log/mysqld.log # 获取初始密码

mysql -u root -p # 登录mysql

                4、mysql密码安全策略设置

set global validate_password_policy=0; # 设置密码强度为低级

set global validate_password_length=4; # 设置密码长度

alter user 'root'@'localhost' identified by '123456'; # 修改本地密码

\q # 退出

                5、设置远程登录

mysql -u root -p 123456 # 以新密码登录mysql

create user 'root'@'%' identified by '123456'; # 创建用户

grant all privileges on *.* to 'root'@'%' with grant option; # 允许远程连接

flush privileges; # 刷新权限

chkconfig mysqld on # 将mysql的服务注册为开机启动

                6、创建数据库test

create database test;

show databases;

        主节点(node01)中创建工作路径,解压安装包:node01作为客户端,node02作为服务器端,因为node01和node02节点都需要hive

cd /usr/softwaretmp/bigdata/

mkdir /hive/

tar -zxvf apache-hive-2.1.1-bin.tar.gz

scp -r /usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin root@node02:/usr/softwaretmp/bigdata/hive/ # node01中将hive远程复制到node02中

        修改配置文件,添加hive环境变量:主从节点(node01和node02节点)

vim /etc/profile
'''
    # set hive environment
    export HIVE_HOME=/usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin
    export PATH=$PATH:$HIVE_HOME/bin
'''

source /etc/profile

        解决版本冲突和jar包依赖问题

                客户端需要和hadoop通信,所以从hive的lib包中拷贝较高版本jline jar包到hadoop中lib位置:node01中执行

cp /usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin/lib/jline-2.12.jar /usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/share/hadoop/yarn/lib/

                服务器需要和mysql通信,所以服务器需要将mysql的依赖包放到hive的lib目录下:node02中进行

cd /usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin/lib
# 已下载,直接远程复制进去
# wget http://10.10.88.2:8000/bigdata/bigdata_tar/mysql-connection-java-5.1.47-bin.jar

        node02作为服务器端配置hive: node02节点

cd $HIVE_HOME/conf

cp hive-env.sh.template hive-env.sh

vim hive-env.sh
'''
	HADOOP_HOME=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3
	export HIVE_CONF_DIR=/usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin/conf
'''

vim hive-site.xml
'''
	
		
		
			hive.metastore.warehouse.dir
			/usr/softwaretmp/bigdata/hive_remote/warehouse
		
		
		
			javax.jdo.option.ConnectionURL
			jdbc:mysql://node03:3306/hive?createDatabaseIfNotExist=true&useSSL=false
		
		
		
			javax.jdo.option.ConnectionDriverName
			com.mysql.jdbc.Driver
		
		
		
			javax.jdo.option.ConnectionUserName
			root
		
		
		
			javax.jdo.option.ConnectionPassword
			123456
		
		
			hive.metastore.schema.verification
			false
		
		
			datanucleus.schema.autoCreateAll
			true
		
	
'''

        node01作为客户端配置hive:主节点node01

cd /usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin/conf/
vim hive-site.xml
'''
	
		
		
			hive.metastore.warehouse.dir
			/usr/softwaretmp/bigdata/hive_remote/warehouse
		
		
		
			hive.metastore.local
			false
		
		
		
			hive.metastore.uris
			thrift://node02:9083
		
	
'''
cp hive-env.sh.template hive-env.sh
vim hive-env.sh
'''
	HADOOP_HOME=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3
	export HIVE_CONF_DIR=/usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin/conf

        启动hive

cd /usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin

                1、启动hive server服务:node02节点

bin/hive --service metastore

                2、启动hive client:node01节点

bin/hive

                3、测试hive是否启动成功

show databases;

create database hive_db;

exit; # 退出

                4、查看master进程

jps

4.8、sqoop安装:只需在master节点安装

cd /usr/softwaretmp/bigdata/sqoop

tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

        配置环境变量:

vim /etc/profile
'''
    # set sqoop environment
    export SQOOP_HOME=/usr/softwaretmp/bigdata/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/
    export PATH=$SQOOP_HOME/bin:$PATH
'''

source /etc/profile

        修改配置文件:

cd /usr/softwaretmp/bigdata/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/conf

cp sqoop-env-template.sh sqoop-env.sh

vim sqoop-env.sh
'''
	export HADOOP_COMMON_HOME=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/
	export HADOOP_MAPRED_HOME=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/
	export HBASE_HOME=/usr/softwaretmp/bigdata/hbase/hbase-1.2.4/
	export HIVE_HOME=/usr/softwaretmp/bigdata/hive/apache-hive-2.1.1-bin/
	export ZOOCFGDIR=/usr/softwaretmp/bigdata/zookeeper/zookeeper-3.4.10/conf
'''

        mysql的jar放到sqoop的lib目录下(mysql-connector-java-5.1.35-bin.jar)

把mysql-connector-java-5.1.47-bin.jar拖到/usr/softwaretmp/bigdata/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0/lib/

        验证配置是否成功:

bin/sqoop help

bin/sqoop list-databases --connect jdbc:mysql://node03:3306/ --username root --password 123456

                数据同步实例代码

# mysql -> hive 全量同步数据
bin/sqoop import --connect jdbc:mysql://node03:3306/toutiao --username root --password 123456 
					--table user_profile --m 5 --hive-home /root/bigdata/hive --hive-import 
					--create-hive-table --hive-drop-import-delims --warehouse-dir /usr/hive/warehouse/toutiao.db 
					--hive-table toutiao.user_profile

# mysql -> hive 增量导入
bin/sqoop import --connect jdbc:mysql://node03:3306/toutiao --username root --password 123456 
					--table user_profile --m 5 --target-dir /usr/hive/warehouse/toutiao.db/user_profile
					--incremental lastmodified --check-column update_time 
					--merge-key user_id --last-value 'date +"%Y-%m-%d" -d "-1day"'

bin/sqoop import --connect jdbc:mysql://node03:3306/toutiao --username root --password 123456
					--table user_profile --m 5
					--query 'select article_id, user_id, channel_id, REPLACE(REPLACE(REPLACE(title, CHAR(13),""),CHAR((10),""),","," ") title,status,update_time from news_article_basic where $CONDITIONS'
					--split-by user_id
					--target-dir /usr/hive/warehouse/toutiao.db/user_profile
					--incremental lastmodified --check-column update_time
					--merge-key user_id --last-value 'date +"%Y-%m-%d" -d "-1day"'

4.9、flume安装:node01

cd /usr/softwaretmp/bigdata/flume

tar -zxvf apache-flume-1.8.0-bin.tar.gz

        配置环境变量:

vim /etc/profile
'''
	# set flume environment
	export FLUME_HOME=/usr/softwaretmp/bigdata/flume/apache-flume-1.8.0-bin
	export FLUME_CONF_DIR=$FLUME_HOME/conf
	export PATH=$FLUME_HOME/bin:$PATH
'''

source /etc/profile

        修改配置文件flume-env.sh:

cd flume/apache-flume-1.8.0-bin/conf

cp flume-env.sh.template flume-env.sh

vim flume-env.sh
'''
	export JAVA_HOME=/usr/softwaretmp/bigdata/java/jdk1.8.0_171
'''

        创建配置文件slave.conf:

touch slave.conf

vim slave.conf
'''
	a1.sources = r1
	a1.sinks = k1
	a1.channels = c1

	#具体定义source
	a1.sources.r1.type = spooldir
	# 创建此目录,保证里面空的
	a1.sources.r1.spoolDir = /usr/softwaretmp/bigdata/flume/logs

	#对于sink的配置描述 使用avro(输出到agent)日志做数据的消费
	a1.sinks.k1.type = avro
	# hostname是最终传给master节点的位置
	a1.sinks.k1.hostname = node01
	a1.sinks.k1.port = 44444#端口号

	#对于channel的配置描述 使用文件做数据的临时缓存 创建一个检查点的位置用于临时缓存提高安全性
	a1.channels.c1.type = file
	a1.channels.c1.checkpointDir = /usr/softwaretmp/bigdata/flume/checkpoint
	a1.channels.c1.dataDirs = /usr/softwaretmp/bigdata/flume/data

	#通过channel c1将source r1和sink k1关联起来
	a1.sources.r1.channels = c1
	a1.sinks.k1.channel = c1
'''

        创建文件目录:

cd /usr/softwaretmp/bigdata/flume

mkdir logs # 创建监听的日志文件目录

mkdir checkpoint # 创建缓存目录checkpoint

mkdir data # 创建缓存目录

        配置用户属性环境变量:

vi ~/.bash_profile
'''
	#flume
	export FLUME_HOME=/usr/softwaretmp/bigdata/flume/apache-flume-1.8.0-bin
	export PATH=$PATH:$FLUME_HOME/bin
'''

source ~/.bash_profile

        查看是否配置成功:

flume-ng version

# 若出现 错误: 找不到或无法加载主类 org.apache.flume.tools.GetJavaProperty
解决方法:找到下面这段,在最后添加 2>/dev/null | grep hbase 即可
	vim bin/flume-ng
	'''
		local HBASE_CLASSPATH=""
		......
		java.library.path 2>/dev/null | grep hbase)
	'''

        配置好的flume分发到从节点(node02,node03):

scp -r /usr/softwaretmp/bigdata/flume/apache-flume-1.8.0-bin/ root@node02:/usr/softwaretmp/bigdata/flume/

scp -r /usr/softwaretmp/bigdata/flume/apache-flume-1.8.0-bin/ root@node03:/usr/softwaretmp/bigdata/flume/

                创建文件目录:node02,node03

cd /usr/softwaretmp/bigdata/flume

mkdir logs # 创建监听的日志文件目录

mkdir checkpoint # 创建缓存目录checkpoint

mkdir data # 创建缓存目录

        master节点创建master.conf文件:node01

cd /usr/softwaretmp/bigdata/flume/apache-flume-1.8.0-bin/conf

touch master.conf

vim master.conf # 从节点上的数据,聚合起来,传到hdfs上面
'''
	a1.sources = r1
	a1.sinks = k1
	a1.channels = c1

	# 对于source的配置描述 监听avro
	a1.sources.r1.type = avro
	# 传入的主机名和端口号
	a1.sources.r1.bind = node01
	a1.sources.r1.port = 44444

	#定义拦截器,为消息添加时间戳
	a1.sources.r1.interceptors = i1
	a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

	#对于sink的配置描述 传递到hdfs上面
	a1.sinks.k1.type = hdfs
	#设置master的hdfs路径地址
	a1.sinks.k1.hdfs.path = hdfs://node01:9000/flume/%Y%m%d
	a1.sinks.k1.hdfs.filePrefix = events-
	a1.sinks.k1.hdfs.fileType = DataStream
	#不按照条数生成文件
	a1.sinks.k1.hdfs.rollCount = 0
	#HDFS上的文件达到128M时生成一个文件
	a1.sinks.k1.hdfs.rollSize = 134217728
	#HDFS上的文件达到60秒生成一个文件
	a1.sinks.k1.hdfs.rollInterval = 60

	#对于channel的配置描述 使用内存缓冲区域做数据的临时缓存
	a1.channels.c1.type = memory
	a1.channels.c1.capacity = 1000
	a1.channels.c1.transactionCapacity = 100
	#通过channel c1将source r1和sink k1关联起来
	a1.sources.r1.channels = c1
	a1.sinks.k1.channel = c1
'''

               上面的配置文件中 agent1.sinks.sink1.hdfs.path=hdfs://node01:9000/flume下,即将监听到的文件自动上传到hdfs的/flume下,所以要手动创建hdfs下的目录

hdfs dfs -mkdir /flume

                先看下hdfs的logs目录下,目前什么都没有

hdfs dfs -ls -R /flume

        运行测试:

                启动服务:主节点(node01)启动

flume-ng agent -n a1 -c conf -f /usr/softwaretmp/bigdata/flume/apache-flume-1.8.0-bin/conf/master.conf -Dflume.root.logger=INFO,console

                从节点启动:node02,node03

bin/flume-ng agent -n a1 -c conf -f /usr/softwaretmp/bigdata/flume/apache-flume-1.8.0-bin/conf/slave.conf -Dflume.root.logger=INFO,console

                从节点创建日志数据文件:node02,node03

cd /usr/softwaretmp/bigdata/flume/logs

vim flume_test.txt
'''
    {"actionTime":"2019-04-10 18:15:35","readTime":"","channelId":0,"param":{"action":"exposure","userId":"2","articleId":"[18577,14299]","algorithmCombine":"C2"}}
    {"actionTime":"2019-04-10 18:12:11","readTime":"2886","channelId":18,"param":{"action":"read","userId":"2","articleId":"18005","algorithmCombine":"C2"}}
    {"actionTime":"2019-04-10 18:15:32","readTime":"","channelId":18,"param":{"action":"click","userId":"2","articleId":"18005","algorithmCombine":"C2"}}
'''

                往flume_test.txt文件插入数据:

echo {"actionTime":"2019-04-10 18:15:32","readTime":"","channelId":18,"param":{"action":"click","userId":"2","articleId":"18005","algorithmCombine":"C2"}} >> flume_test.txt

tail -f collect.log # 查看正在改变的log文件

                然后发现hdfs的flume下自动上传了刚刚创建的文件

hdfs dfs -ls -R /flume

hdfs dfs -cat /flume/20220418/events-.1650292569824

                客户端查看

http://node01:50070/explorer.html#

                查看开启的flume进程

ps aux | grep flume

4.10、spark集群搭建:

        安装scala环境:

                主节点安装配置scala:node01

cd /usr/softwaretmp/bigdata/

mkdir scala

tar -zxvf scala-2.11.12.tgz

rm -rf /usr/softwaretmp/bigdata/scala/scala-2.11.12.tgz

                配置scala环境变量并生效:主从节点(node01,node02,node03)

vim /etc/profile
'''
    # set scala environment
    export SCALA_HOME=/usr/softwaretmp/bigdata/scala/scala-2.11.12
    export PATH=$SCALA_HOME/bin:$PATH
'''

source /etc/profile

                查看是否安装成功:主节点(node01)

scala -version

                复制到子节点:主节点(node01)

scp -r /usr/softwaretmp/bigdata/scala/scala-2.11.12 root@node02:/usr/softwaretmp/bigdata/scala/

scp -r /usr/softwaretmp/bigdata/scala/scala-2.11.12 root@node03:/usr/softwaretmp/bigdata/scala/

        安装spark

                主节点安装配置scala:主节点(node01)

cd /usr/softwaretmp/bigdata/

mkdir spark

tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz

rm -rf /usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7.tgz

                修改配置文件spark-env.sh:主节点(node01)

cd /usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/conf/

cp spark-env.sh.template spark-env.sh #复制conf下spark-env.sh文件

vim spark-env.sh
'''
	# export SPARK_MASTER_IP=node01 # 告知spark的master运行在哪个机器上,standalone模式配置,standalone Ha模式和yarn模式不需要
	export SCALA_HOME=/usr/softwaretmp/bigdata/scala/scala-2.11.12
	export SPARK_WORKER_MEMORY=1g
	export JAVA_HOME=/usr/softwaretmp/bigdata/java/jdk1.8.0_171 # 设置java安装目录
	export HADOOP_HOME=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3
	export HADOOP_CONF_DIR=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/etc/hadoop # 读取HDFS上文件
	export YARN_CONF_DIR=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/etc/hadoop # 运行yarn集群
	# YARN模式下只要配置HADOOP_CONF_DIR和YARN_CONF_DIR
'''

                配置spark从节点,修改slaves文件:主节点(node01,slaves文件只包含节点信息,其他注释不需要)

cp slaves.template slaves
vim slaves
'''
    node02
    node03
'''

                向所有子节点发送spark配置好的文件包:主节点(node01)

scp -r /usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7 root@node02:/usr/softwaretmp/bigdata/spark/

scp -r /usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7 root@node03:/usr/softwaretmp/bigdata/spark/

                配置spark环境变量:所有节点(node01,node02,node03)

vim /etc/profile
'''
    # set spark environment
    export SPARK_HOME=/usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7
    export PATH=$SPARK_HOME/bin:$PATH
'''

source /etc/profile

                开启spark环境:主节点(node01),注意是standalone模式还是yarn模式

                        standalone模式测试

/usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/sbin/start-all.sh
# 浏览器打开验证:ip:8080

                        yarn模式测试

bin/pyspark --master local[*] # 本地模式,所以资源启动

bin/pyspark --master spark://node01:7077 # 集群模式启动

bin/pyspark --master yarn # yarn模式启动

bin/pyspark --master yarn --deploy-mode client|cluster

# --deploy-mode 选项是指定部署模式,默认是客户端模式;client就是客户端模式,cluster就是集群模式;--deploy-mode仅可以用在yarn模式下
# Cluster模式即:Driver运行在YARN容器内部,和ApplicationMaster在同一个容器
# Client模式即:Driver运行在客户端进程中,比如Driver运行在spark-submit程序的进程中
# 举例:
	Client模式:
		bin/spark-submit --master yarn --deploy-mode client --driver-memory 512m --executor-memory 512m --num-executor 2 --total-executor-cores 3 /usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/example/src/main/python/pi.py 100
	Cluster模式:
		bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 512m --executor-memory 512m --num-executor 2 --total-executor-cores 3 /usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/example/src/main/python/pi.py 100

                spark on hive配置:根据原理,就是spark能够连上hive的Metastore就可以了,配置如下:

                        1、Metastore需要存在并开机

cd /usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/conf

vim hive-site.xml
'''
	
		
		
			hive.metastore.warehouse.dir
			/usr/softwaretmp/bigdata/hive_remote/warehouse
		
		
		
			hive.metastore.local
			false
		
		
		
			hive.metastore.uris
			thrift://node02:9083
		
	
'''

                        2、spark知道Metastore在哪里(IP端口号)

                                步骤2:将mysql的驱动jar包放入spark的jars目录

上传已下载好的mysql-connection-java-5.1.47-bin.jar 到 /usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/jars

                                步骤3:确保hive配置了Metastore相关服务,检查hive的配置文件目录:hive-site.xml

4.11、分布式anaconda安装

        anaconda3搭建:三个节点(node01,node02,node03)

cd /usr/softwaretmp/bigdata
scp -r /usr/softwaretmp/bigdata/anaconda/Anaconda3-2019.03-Linux-x86_64.sh root@node02:/usr/softwaretmp/bigdata/anaconda/
scp -r /usr/softwaretmp/bigdata/anaconda/Anaconda3-2019.03-Linux-x86_64.sh root@node03:/usr/softwaretmp/bigdata/anaconda/

sh ./Anaconda3-2020.07-Linux-x86_64.sh
'''
	回车 -> enter -> enter -> yes -> /usr/softwaretmp/bigdata/anaconda/anaconda3 -> yes -> exit
	重新登入
'''

vim /root/.condarc  # 更改国内源
'''
	channels:
		- defaults
	show_channel_urls: true
	default_channels:
		- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
		- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
		- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
	custom_channels:
		conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
		msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
		bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
		menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
		pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
		simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
'''

conda create -n pyspark python=3.6 # 创建虚拟环境
conda activate pyspark # 切换虚拟环境

vim /etc/profile # 配置环境变量,是pyspark调用anaconda虚拟环境的python解释器
'''
	export JAVA_HOME=/usr/softwaretmp/bigdata/java/jdk1.8.0_171
	export HADOOP_HOME=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3
	export SPARK_HOME=/usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7
	export PYSPARK_PYTHON=/usr/softwaretmp/bigdata/anaconda/anaconda3/envs/pyspark/bin/python3.6 # 新增的
	export HADOOP_CONF_DIR=/usr/softwaretmp/bigdata/hadoop/hadoop-2.7.3/etc/hadoop  # 新增的
'''
source /etc/profile

vim /root/.bashrc # 修改用户的个性化设置文件,添加环境变量
'''
	export JAVA_HOME=/usr/softwaretmp/bigdata/java/jdk1.8.0_171
	export PYSPARK_PYTHON=/usr/softwaretmp/bigdata/anaconda/anaconda3/envs/pyspark/bin/python3.6
'''

        虚拟环境中创建pyspark包:

conda activate pyspark # 切换虚拟环境
# pyspark是spark官方提供的一个python类库,内置了完全的spark API,可用来编写spark应用程序,并将其提交到spark集群中运行

pip install pyspark==2.4 -i https://pypi.tuna.tsinghua.edu.cn/simple # 国内清华源
pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

                测试:

python
'''
    from pyspark import SparkContext,SparkConf

    conf = SparkConf().setMaster("local[*]").setAppName("wordCountHelloWorld")
    sc = SparkContext(conf=conf)
    print(sc.parallelize([1,2,3,4,5]).map(lambda x: x + 1).collect())
'''

exit()

        虚拟环境安装jupyter notebook:node01

pip install Jupyter # 安装Jupyter notebook

jupyter notebook --generate-config # 生成Jupyter notebook 配置文件

jupyter notebook password # 配置Jupyter notebook密码 root

vim ~/.jupyter/jupyter_notebook_config.py # 修改配置文件
'''
	c.NotebookApp.allow_remote_access = True
	c.NotebookApp.open_browser = False # 不想在服务器上直接打开Jupyter Notebook,所以设置成False
	c.NotebookApp.ip = '*' # 所有绑定服务器的IP都能访问,若想只在特定ip访问,输入ip地址即可
	c.NotebookApp.allow_root = True # 为了安全,Jupyter默认不允许以root权限启动jupyter
	c.NotebookApp.notebook_dir = '/root/works' # 设置Jupyter的根目录
	c.NotebookApp.port = 8888 #端口可以更改
'''

                Jupyter notebook 更换kernel:
    

conda activate 环境名

conda install nb_conda_kernels

python -m ipykernel install --user --name 环境名称 --display-name "显示的名称"

                远程连接测试:

jupyter notebook/jupyter notebook --ip 0.0.0.0 -> 本地浏览器输入链接 -> 新建.notebook文件

若出现:500 : Internal Server Error -> AttributeError: module 'nbconvert.exporters' has no attribute 'WebPDFExporter'

解决办法:conda install nbconvert notebook

4.12、配置本地pycharm professional(专业版):本地windows

        创建项目:

                打开 -> create new project -> Existing interpreter -> ... -> 添加远程环境(SSH Interpreter) -> 
        输入(链接、用户、密码) -> 输入远程服务器上python地址

        新建文件进行测试:

                新建test.py -> 右键run

# coding:utf8
from pyspark import SparkContext,SparkConf

if __name__ == '__main__':
	# conf = SparkConf().setMaster("local[*]").setAppName("wordCountHelloWorld") # 本地模式
	conf = SparkConf().setAppName("wordCountHelloWorld") # 集群模式
	# 如果提交集群运行,除了主代码外,还依赖其他代码文件,需要设置参数:spark.submit.pyFiles,参数值可以是单个.py文件,也可以是.zip压缩包(有多个依赖文件时可以用zip压缩后上传)
	conf.set("spark.submit.pyFiles","other_py.py")
	sc = SparkContext(conf=conf)

	# file_rdd = sc.textFile("data/word")
	file_rdd = sc.textFile("hdfs://node01:9000/input/word.txt")

	word_rdd = file_rdd.flatMap(lambda line: line.split(" "))
	word_with_one_rdd = word_rdd.map(lambda x: (x, 1))
	result_rdd = word_with_one_rdd.reduceByKey(lambda a, b: a + b)
	# result_rdd = file_rdd.flatMap((lambda line: line.split(" "))).map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b)

	print(result_rdd.collect())

                在服务器上提交运行:node01

/usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --master local[*] /root/work/halloworld.py

/usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --master yarn /root/work/halloworld.py

/usr/softwaretmp/bigdata/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --master yarn --py_files ./defs.py /root/work/halloworld.py

                        # 榨干集群性能提交

cat /proc/cpuinfo | grep processor | wc -l # 查看CPU有几核

free -wh # 查看内存有多大

# 简单规划:1、吃掉6核CPU;2、吃掉12G内存;规划后:希望使用6个executor来干活,每个executor吃掉1核CPU 2G内存
bin/spark-submit --master yarn --py-files /root/work/defs.py --executor-memory 2g --executor-cores 1 --num-executors 6 /root/work/halloworld.py

你可能感兴趣的:(环境配置,人工智能,服务器)