2018-09-25 参考博客Hadoop

坑1

新版树莓派自带的java环境变量JAVA_HOME是arm32版本

~~export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm-vfp-hflt/jre/~~
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/

坑2

hadoop 2.x 版本指定 namenode 主机地址时用到的 core-site.xml 变量是 fs.defaultFS (之前 1.x 老版本用的是 fs.default.name):




    fs.defaultFS
    hdfs://192.168.0.52:54310

以下为正文

Building a Raspberry Pi Hadoop cluster (Part 1)
Posted on 28 September 2015

原文: https://web.archive.org/web/20170221231927/http://www.becausewecangeek.com/building-a-raspberry-pi-hadoop-cluster-part-1/

Since I love everything related to Raspberry Pi's and am active as a data scientist I've decided that I needed my own Pi cluster. Luckily for me version 2 was already available when I started this project, so this cluster will have a little bit more oomph compared to clusters using the original Pi. It takes quite a bit of setup to get a Hadoop cluster going, so in this first part we will limit ourselves to a single node "cluster".

The cluster

If you have ever visited my house you will have seen a lot of LEGO laying around. LEGO is by far my most expensive hobby, so when I needed to design a Raspberry Pi cluster case I didn't have to look far. My cluster case would be build in bricks.
I started by "designing" the case in OneNote to get the dimensions right. If you need any help converting millimeters to studs I suggest checking out this amazing page.

OneNote sketch for creating the LEGO Raspberry Pi cluster

The case has room for 6 Raspberry Pi's, my Lacie Cloudbox NAS, a TP-Link TL-SG1008D 8-port gigabit switch and the power supply. I took a bit of a gamble and choose the Orico 9-port USB 2.0 HUB, hoping that it would provide enough juice for the Pi's. I have successfully ran 5 Pi's using this single hub.

I designed the case in Lego Digital Designer and ordered the missing bricks via BrickLink. About an hour of building later this beauty was ready to be deployed in my storeroom.

In case you're wondering, that's a Cubesensor base station at the top. We have the hardware, let's do some software.

Getting started

I'm assuming you already have a working Raspberry Pi 2 running Raspbian and that you know how to login via SSH. We will install the latest version of Hadoop, that is 2.7.1. Great, let's get started and login via SSH.

Creating a new group and user

sudo addgroup hadoop  
sudo adduser --ingroup hadoop hduser  
sudo adduser hduser sudo

Everything Hadoop will be happening via the hduser. Let's change to this user.

su hduser

Generating SSH keys

Although we are using a single node setup in this part, I decided to already create SSH keys. These will be the keys that the nodes use to talk to each other.

cd ~  
mkdir .ssh  
ssh-keygen -t rsa -P ""  
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

To verify that everyting is working, you can easily create a SSH tunnel to localhost.

ssh localhost

Installing the elephant in the room called Hadoop

wget ftp://apache.belnet.be/mirrors/ftp.apache.org/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz  
sudo mkdir /opt  
cd ~  
sudo tar -xvzf hadoop-2.7.1.tar.gz -C /opt/  
cd /opt  
sudo chown -R hduser:hadoop hadoop-2.7.1/

Depending on what you already did with your Pi, the /opt directory may already exist.

Hadoop is now installed, but we still need quite a bit of tinkering to get it configured right.

Setting a few environment variables
First we need to set a few environment variables. There are a few ways to do this, but I always do it by editing the .bashrc file.

nano ~/.bashrc

Add the following lines at the end of the file:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")  
export HADOOP_HOME=/opt/hadoop-2.7.1  
export HADOOP_MAPRED_HOME=$HADOOP_HOME  
export HADOOP_COMMON_HOME=$HADOOP_HOME  
export HADOOP_HDFS_HOME=$HADOOP_HOME  
export YARN_HOME=$HADOOP_HOME  
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop  
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop  
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Changes in the .bashrc are not applied when you save this file. You can either logout and login again to use those new environment variables or you can:
source ~/.bashrc
If everything is configured right, you should be able to print the installed version of Hadoop.

$hadoop version
Hadoop 2.7.1  
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a  
Compiled by jenkins on 2015-06-29T06:04Z  
Compiled with protoc 2.5.0

From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
This command was run using /opt/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
Configuring Hadoop 2.7.1
Let's go to the directory that contains all the configuration files of Hadoop. We want to edit the hadoop-env.sh file. For some reason we need to configure JAVA_HOME manually in this file, Hadoop seems to ignore our $JAVA_HOME.

cd $HADOOP_CONF_DIR  
nano hadoop-env.sh

Yes, I use nano. Look for the line saying JAVA_HOME and change it to your Java install directory. This was how the line looked after I changed it:

export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm-vfp-hflt/jre/

There are quite a few files that need to be edited now. These are XML files, you just have to paste the code bits below between the configuration tags.

nano core-site.xml

  
  fs.default.name
  hdfs://localhost:54310
  
  
  hadoop.tmp.dir
  /hdfs/tmp
  
nano hdfs-site.xml  
  
dfs.replication  
1  
  
cp mapred-site.xml.template mapred-site.xml  
nano mapred-site.xml  
  
    mapreduce.framework.name
    yarn
  
  
    mapreduce.map.memory.mb
    256
  
  
    mapreduce.map.java.opts
    -Xmx210m
  
  
    mapreduce.reduce.memory.mb
    256
  
  
    mapreduce.reduce.java.opts
    -Xmx210m
  
  
    yarn.app.mapreduce.am.resource.mb
    256

The first property tells us that we want to use Yarn as the MapReduce framework. The other properties are some specific settings for our Raspberry Pi. For example we tell that the Yarn Mapreduce Application Manager gets 256 megabytes of RAM and so does the Map and Reduce containers. These values allow us to actually run stuff, the default size is 1,5GB which our Pi can't deliver with its 1GB RAM.

nano yarn-site.xml

  
    yarn.nodemanager.aux-services
    mapreduce_shuffle
  
  
    yarn.nodemanager.resource.cpu-vcores
    4
  
  
    yarn.nodemanager.resource.memory-mb
    768
  
  
    yarn.scheduler.minimum-allocation-mb
    128
  
  
    yarn.scheduler.maximum-allocation-mb
    768
  
  
    yarn.scheduler.minimum-allocation-vcores
    1
  
  
    yarn.scheduler.maximum-allocation-vcores
    4

This file tells Hadoop some information about this node, like the maximum number of memory and cores that can be used. We limit the usable RAM to 768 megabytes, that leaves a bit of memory for the OS and Hadoop. A container will always receive a memory amount that is a multitude of the minimum allocation, 128 megabytes. For example a container that needs 450 megabytes, will get 512 megabytes assigned.

Preparing HDFS

sudo mkdir -p /hdfs/tmp  
sudo chown hduser:hadoop /hdfs/tmp  
chmod 750 /hdfs/tmp  
hdfs namenode -format  
Booting Hadoop
cd $HADOOP_HOME/sbin  
start-dfs.sh  
start-yarn.sh

If you want to verify that everything is working you can use the jps command. In the output of this command you can see that Hadoop components like the NameNode are running. The numbers can be ignored, they are process numbers.

7696 ResourceManager
7331 DataNode
7464 SecondaryNameNode
8107 Jps
7244 NameNode
Running a first MapReduce job
For our first job we need some data. I selected a bit of books from Project Gutenberg and concatenated them into one large file. There's a bit to like for everyone: Shakespeare, Homer, Edgar Allan Poe... The resulting file is about 16MB large.

wget http://www.gutenberg.org/cache/epub/11/pg11.txt  
wget http://www.gutenberg.org/cache/epub/74/pg74.txt  
wget http://www.gutenberg.org/cache/epub/1661/pg1661.txt  
wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt  
wget http://www.gutenberg.org/cache/epub/5200/pg5200.txt  
wget http://www.gutenberg.org/cache/epub/2591/pg2591.txt  
wget http://www.gutenberg.org/cache/epub/6130/pg6130.txt  
wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt  
wget http://www.gutenberg.org/cache/epub/8800/pg8800.txt  
wget http://www.gutenberg.org/cache/epub/345/pg345.txt  
wget http://www.gutenberg.org/cache/epub/1497/pg1497.txt  
wgethttp://www.gutenberg.org/cache/epub/135/pg135.txdsfsdf  
wget http://www.gutenberg.org/cache/epub/135/pg135.txt  
wget http://www.gutenberg.org/cache/epub/41/pg41.txt  
wget http://www.gutenberg.org/cache/epub/120/pg120.txt  
wget http://www.gutenberg.org/cache/epub/22381/pg22381.txt  
wget http://www.gutenberg.org/cache/epub/2600/pg2600.txt  
wget http://www.gutenberg.org/cache/epub/236/pg236.txt  
cat pg*.txt > books.txt

The books.txt file cannot be read by Hadoop from our traditional Linux file system, it needs to be stored on HDFS. We can easily copy it to HDFS.

hdfs dfs -copyFromLocal books.txt /books.txt

You can make sure that the copy operation went properly by listing the contents of the HDFS root directory.

hdfs dfs -ls /
Now let's count the occurence of words in this giant book. We are in luck, the kind developers of Hadoop provide an example that does exactly that.

hadoop jar /opt/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /books.txt /books-result
You can view the progress by surfing to http://:8088/cluster. After the job is done you can find the output in the book-result directory. We can view the results of this MapReduce (V2) job using the hdfs command:

hdfs dfs -cat /books-result/part-r* | head -n 20
Since we are talking about multiple books printing the entire list might take a while. If you look at the output you see that the Wordcount example has room for improvement. Uppercase and lowercase words are counted separately and also symbols and characters around a word make things messy. But time for a first benchmark: how long did our single node Raspberry Pi 2 work on this wordcount? The average execution time of 5 jobs was measured to be 3 minutes and 25 seconds.

Some more benchmark jobs
In a next part we will extend our Hadoop cluster to use all 4 Pi's and make it a real cluster. The same wordcount job will be run again and the execution time will be compared to our 3 minutes and 25 seconds. Just to be safe I include some larger test files as well.
I found some public datasets based on the IMDB here, so I used the plot.gz file. This results in an uncompressed text file of about 350 megabytes. Using the same command I set the Pi to work. This time the job was split into 3 parts, so a second core could start counting (up until now we only had used 1 wordcount container and thus 1 core). The result? A Raspberry Pi that was getting a bit warm.

CPU Temp: 73.4ºC
GPU Temp: 73.4ºC
Seems like a LEGO cluster case is not the best in terms of cooling, so I'll have to look into that. But the brave Pi gets the job done (although I can't say without a sweat, pun intended). The average execution time of counting the word occurrences in this text file was 25 minutes and 40 seconds.

My hunger for word occurrence counts was still gnawing, so I decided to seek an even larger corpus. I found the Blog Authorship Corpus that I downloaded, stripped of xml tags and concatenated into a 788MB file. If you want to reproduce this file you can find the steps that I did below.

wget http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip
cd blogs/
cat .xml | sed -e 's/<[^>]>//g' >> blogs.txt
Crunching this file took 1 hour, 8 minutes and 12 seconds. I didn't have the patience to average a few timings. The aggregate resource allocation tells us that this job required 3068130 MB-seconds and 11980 vcore-seconds.

Dataset Execution time
Gutenberg books 3 minutes 25 seconds
IDMB plots 25 minutes 40 seconds
Blog Authorship Corpus 1 hour 8 minutes 12 seconds
Conclusion
So the marvellous LEGO cluster has one working Hadoop node. In the next post I will add the other Pi's to the cluster.

Update: Some sources claim a performance increase after switching to Oracle Java. After installing Oracle Java 8 the blogs dataset took 1h, 47m and 47s. This time does not indicate any significant performance increase. I didn't look any further into this claim.

The first Raspberry Pi node of the Hadoop cluster

2018-09-25 参考博客Hadoop

以下为正文

你可能感兴趣的:(2018-09-25 参考博客Hadoop)