centos下hadoop伪分布式配置及python测试

安装和配置java

$ yum -y install java-1.7.0-openjdk*
$ ls -lrt /usr/bin/java
# lrwxrwxrwx 1 root root 22 Apr 29 13:47 /usr/bin/java -> /etc/alternatives/java
$ ls -lrt /etc/alternatives/java
#lrwxrwxrwx 1 root root 76 Apr 29 13:47 /etc/alternatives/java -> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.221-2.6.18.0.el7_6.x86_64/jre/bin/java
$ echo 'export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.211-2.6.17.1.el7_6.x86_64/' >> /etc/bashrc
$ echo 'export JRE_HOME=$JAVA_HOME/jre' >> /etc/bashrc
$ echo 'export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH' >> /etc/bashrc
$ echo 'export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH' >> /etc/bashrc

设置ssh密钥访问

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

安装hadoop

$ mkdir ~/download
$ wget -P ~/download/ http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
$ tar zxf ~/download/hadoop-2.6.5.tar.gz -C /opt/
$ echo 'export HADOOP_HOME=/opt/hadoop-2.6.5' >> /etc/bashrc
$ echo 'export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar' >> /etc/bashrc
$ source /etc/bashrc
$ cd $HADOOP_HOME

配置hadoop

  1. 修改core-site.xml
$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

修改内容为

<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://localhost:9000value>
property>
configuration>
  1. 修改hdfs-site.xml
$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

修改内容为

<configuration>
<property>
<name>dfs.replicationname>
<value>1value>
property>
configuration>
  1. 启动hdfs服务
$ $HADOOP_HOME/bin/hdfs namenode -format
$ $HADOOP_HOME/sbin/start-dfs.sh
# 查看端口是否启动
$ netstat -ntpl|grep 9000
  1. 修改mapred-site.xml
$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

修改内容为

<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
configuration>
  1. 修改yarn-site.xml
$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

修改内容为

<configuration>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
configuration>
  1. 启动yarn服务
$ $HADOOP_HOME/sbin/start-yarn.sh
# 查看端口是否启动
$ netstat -ntpl|grep 8088
  1. 其他配置
$ echo 'alias hadoop=$HADOOP_HOME/bin/hadoop' >> /etc/bashrc
$ echo 'alias hdfs=$HADOOP_HOME/bin/hdfs' >> /etc/bashrc
$ echo 'export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools/jar' >> /etc/bashrc
$ source /etc/bashrc

安装Python3

$ yum install epel-release
$ yum install python36
$ echo 'alias python=python3' >> /etc/bashrc
$ source /etc/bashrc

编写python程序

  1. 写 mapper. py
#!/usr/bin/env python3
import sys
import re

for line in sys.stdin:
	words = line.strip().split()
	for word in words:
		word = word.lower()
		ws = re.findall("[a-z][a-z]*", word)
		for w in ws:
			print(w, 1)
  1. 写 reducer. py
#!/usr/bin/env python3
import sys

curr_w, curr_c, word = None, 0, None

for line in sys.stdin:
	word, cnt = line.strip().split()
	if curr_w == word:
		curr_c += int(cnt)
	else:
		if curr_w is not None:
			print(curr_w, curr_c)
		curr_c = int(cnt)
		curr_w = word

if curr_w == word:
	print(curr_w, curr_c)
  1. 赋予权限
$ chmod +x mapper.py
$ chmod +x reducer.py
  1. 测试程序
    首先下载3篇英文文章,这里从ChinaDaily上复制三篇,命名为p1.txt、p2.txt、p3.txt,注意编码格式应为utf-8。
$ cat p1.txt | ./mapper.py | sort | ./reducer.py | more
# a 11
# absorb 1
# according 1
# activated 1
# active 1
# activities 1
# added 1
# adopted 1
# after 5
# airport 3
# --More--

测试hadoop

  1. 放置输入文件
$ hdfs dfs -mkdir -p /user/`whoami`/input
$ hdfs dfs -put ~/p*.txt /user/`whoami`/input
  1. 写run. sh
$HADOOP_HOME/bin/hadoop jar $STREAM \
-files ./mapper.py,./reducer.py \
-mapper ./mapper.py \
-reducer ./reducer.py \
-input /user/`whoami`/input/p*.txt \
-output /user/`whoami`/output
  1. 运行run. sh
$ chmod +x run.sh
$ ./run.sh
# 显示进度
  1. 查看结果
$ hdfs dfs -cat output/part-00000

参考文献

  1. 高扬, 卫峥, 尹会生. 白话大数据与机器学习[M]. 机械工业出版社, 2016.
  2. 让python在hadoop上跑起来

问题解答

  1. [python]使用python实现Hadoop MapReduce程序:计算一组数据的均值和方差
  2. could only be replicated to 0 nodes, instead of 1 错误的解决方法

你可能感兴趣的:(云服务器,大数据,大数据,hadoop)