environment:
mac os
pre conditions:
1.下载 hop from http:// code.google.com/p/hop/
我下载的0.2.
2.打开ssh,system preference-》sharing->remote logining(勾上即可)
查看ssh是否一定要密码才能连接.terminal下执行命令 $ ssh localhost
如果需要输入密码,那么,执行下面两条命令
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
再ssh localhost应该就不用密码了
3.jdk调为1.6
utilities->java->java preference 把1.6移到1.5之上
step:
1.修改conf下hadoop-env.sh.
把原来的java-home注释掉,去掉mac下java-home前面的注释
2.打开terminal,进入hop目录依次执行下列命令。
Format a new distributed-filesystem:
$ bin/hadoop namenode -format
Start the hadoop daemons:
$ bin/start-all.sh
Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
3.跑一个自带demo
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' (这里*就是hop-0.2)
结果如下:
10/08/10 16:32:14 INFO mapred.FileInputFormat: Total input paths to process : 11
10/08/10 16:32:14 INFO mapred.JobClient: Running job: job_201008101616_0008
10/08/10 16:32:14 INFO mapred.JobClient: Job configuration (HOP):
map pipeline = false
reduce pipeline = false
snapshot input = false
snapshot freq = 0.1
10/08/10 16:32:15 INFO mapred.JobClient: map 0% reduce 0%
10/08/10 16:32:21 INFO mapred.JobClient: map 8% reduce 0%
10/08/10 16:32:22 INFO mapred.JobClient: map 16% reduce 0%
10/08/10 16:32:23 INFO mapred.JobClient: map 25% reduce 0%
10/08/10 16:32:24 INFO mapred.JobClient: map 33% reduce 0%
10/08/10 16:32:25 INFO mapred.JobClient: map 50% reduce 0%
.......
------------------至此还只是hadoop哦,下面来开启hop功能-------
4.打开conf/hadoop-site.xml,文件
在最后一行之前加入如下代码:
<property>
<name>mapred.reduce.parallel.copies</name>
<value>40</value>
<description>The default number of parallel transfers run by reduce
during the copy(shuffle) phase.
</description>
</property>
<property>
<name>mapred.map.pipeline</name>
<value>true</value>
<description>Pipeline Map Task Output .
</description>
</property>
<property>
<name>mapred.reduce.pipeline</name>
<value>true</value>
<description>Pipeline Reduce Task Output A pipelined reduce task will send its output directly to the map task in the subsequent job, if any.
</description>
</property>
<property>
<name>mapred.snapshot.frequency</name>
<value>0.1</value>
<description>Snapshot frequency Determines how often we perform snapshots based on job progress. A freq=0.1 will perform snapshots at 10% progress intervals (e.g., 10%, 20%, ..., 100%).
</description>
</property>
<property>
<name>mapred.job.input.snapshots</name>
<value>false</value>
<description>Job Input is Snapshot If a job takes snapshot results as input then this parameter must be set to true.
</description>
</property>
注意最后一个value。If a job takes snapshot results as input then this parameter must be set to true.
5.继续执行下step3。
如果报错,说已经有output了,那么执行$ bin/hadoop fs -rmr output
你可以用http://localhost:50070/去查看hdfs上的文件和目录
去掉output后再执行step3,结果显示如下:
10/08/10 16:35:51 INFO mapred.FileInputFormat: Total input paths to process : 11
10/08/10 16:35:52 INFO mapred.JobClient: Running job: job_201008101616_0010
10/08/10 16:35:52 INFO mapred.JobClient: Job configuration (HOP):
map pipeline = true
reduce pipeline = true
snapshot input = false
snapshot freq = 0.1
10/08/10 16:35:53 INFO mapred.JobClient: map 0% reduce 0%
10/08/10 16:35:55 INFO mapred.JobClient: map 8% reduce 0%
10/08/10 16:35:56 INFO mapred.JobClient: map 16% reduce 0%
10/08/10 16:35:57 INFO mapred.JobClient: map 25% reduce 0%
10/08/10 16:35:58 INFO mapred.JobClient: map 33% reduce 0%
10/08/10 16:35:59 INFO mapred.JobClient: map 41% reduce 0%
10/08/10 16:36:00 INFO mapred.JobClient: map 50% reduce 0%
10/08/10 16:36:01 INFO mapred.JobClient: map 58% reduce 0%
10/08/10 16:36:02 INFO mapred.JobClient: map 66% reduce 0%
10/08/10 16:36:03 INFO mapred.JobClient: map 74% reduce 0%
10/08/10 16:36:04 INFO mapred.JobClient: map 83% reduce 0%
10/08/10 16:36:05 INFO mapred.JobClient: map 100% reduce 0%
10/08/10 16:36:10 INFO mapred.JobClient: map 100% reduce 100%
........
这样,就play with hop了,如果想回到hadoop,那么把hadoop-site.xml中pipeline的map和reduce的value改为false应该就可以了。