Hadoop备忘

available forums:
http://bbs.hadoopor.com
http://www.hadoopor.com
http://forum.hadoopor.com
available blogs:
http://blog.chinaunix.net/u3/105041/ 分析源码
http://caibinbupt.iteye.com/       分析源码
http://jimey.com/?cat=2226
http://blog.5188la.net/category/my-research/cloud-computing/hadoop-cloud-computing-my-research/
available books:
hadoop-the definitive guide
pro hadoop

1.hadoop0.20.0 + eclipse环境搭建http://bbs.hadoopor.com/thread-43-1-1.html
台湾一个人写的,很好。hadoop0.20.0 + eclipse环境搭建http://trac.nchc.org.tw/cloud/wiki/waue/2009/0617教怎么打包成jar,制作jar包
注意里面的那个Makefile文件“jar -cvf ${JarFile} -C bin/ .”
”hadoop jar ${JarFile} ${MainFunc} input output“等要用tab开头,而不是空格,至于help下面的我都注释掉了在前面加“#”,因为还不知道怎么用


1.hadoop集群配置中client如何向hadoop传输数据http://bbs.hadoopor.com/thread-362-1-1.html:
使用DFSClient工具,客户端上传数据不需要部署hadoop,只需要安装有DFSClient工具就可以上传数据了。
bin/hadoop fs -put 就是以DFSClient的方式“远程”访问HDFS的(当然也是在本地)

2.Hadoop对mysql支持http://bbs.hadoopor.com/thread-132-1-2.html
lance(274105045) 09:48:43
好像0。20里面有提供对DB的输入,输出。
hadoopor(784027584) 09:48:50
但要使用Job并行化,就不得使用默认的调试器,Facebook提供的FaireScheduler支持对Job的并行调度。?????
Spork(47986766) 09:49:16
不是好像了,就是有,只是目前支持的较好的都是开源的,如mysql

3.SequenceFile介绍:http://bbs.hadoopor.com/thread-144-1-1.html

4.JobTracker.JobInProcesshttp://bbs.hadoopor.com/thread-212-1-1.html用于监控一个Job的调度情况。一个Job会被分解成N个Tasks,这些Tasks被分配到集群中的TaskTracer节点,由TaskTracer节点去执行这些Tasks。

==========搜索自Nabble Hadoop===============
1.Hadoop 0.17 schedules jobs fifo. If it isn't,
that is a bug. http://old.nabble.com/Hadoop-job-scheduling-issue-td19659938.html#a19659938

2.Can jobs be configured to be sequential. it means jobs in Group1 excute first, and jobs in Group2 excute later. and Group2 jobs depends on Group1 jobs. The jobs in Group1 or Group2 are independent.
http://old.nabble.com/Can-jobs-be-configured-to-be-sequential-td20043257.html#a20043257
I recommend that you look at http://cascading.org as
an abstraction layer for managing these kinds of workflows. We've
found it quite useful.

3.Sequence of Streaming Jobs: if you are using the sh or bash, the variable $? holds the exit status of the last command to execute.

hadoop jar streaming.jar ...
if [ $? -ne 0 ]; then
    echo "My job failed" 2>&1
    exit 1
fi

Caution $? is the very last command to execute's exit status. It is easy to run another command before testing and then test the wrong command's exit status
http://old.nabble.com/Sequence-of-Streaming-Jobs-td23336043.html#a23351848

4.mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
mapred.map.multithreadedrunner.threads

5.http://old.nabble.com/Linking-2-MarReduce-jobs-together--td18756178.html#a18756178
Is it possible to put the output from the reduce phase of job 1
to be the input to job number 2?
Well your data has to be somewhere between the two jobs... So I'd say yes, put it in HBase or HDFS to reuse it

6.the <pro hadoop> chapter8 covers this topic.

7.http://old.nabble.com/Customizing-machines-to-use-for-different-jobs-td23864519.html#a23864519
Customizing machines to use for different jobs:
Unfortunately there is no built-in way of doing this.  You'd have to
instantiate two entirely separate Hadoop clusters to accomplish what you're
trying to do, which isn't an uncommon thing to do.

I'm not sure why you're hoping to have this behavior, but the fair share
scheduler might be helpful to you.  It let's you essentially divvy up your
cluster into queues, where each queue has its own "chunk" of the cluster.
When resources are available outside of the "chunk," then jobs can span into
other queues' space.

Cloudera's Distribution for Hadoop (<http://www.cloudera.com/hadoop>)
includes the fair share scheduler.  I recommend using our distribution,
otherwise here is the fair share JIRA:

<http://issues.apache.org/jira/browse/HADOOP-3746>

8.http://old.nabble.com/How-to-run-many-jobs-at-the-same-time--td23151917.html#a23151917
How to run many jobs at the same time?:JobControl example

9.http://issues.apache.org/jira/browse/HADOOP-5170
Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide

once the tasktracker starts, the maximum number of tasks
per node can not be changed. In my case, I've solved this challenge by stopping and starting mapred (stop-mapred.sh, start-mapred.sh) between jobs.
There is a jira so this may be changed in the future:  jira HADOOP-5170 (
http://issues.apache.org/jira/browse/HADOOP-5170)
可能已经修正了

10.Oozie, Hadoop Workflow System
https://issues.apache.org/jira/browse/HADOOP-5303

11.http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
Hadoop Workflow Tools Survey
very clear about jobs schedule
一个视频http://developer.yahoo.net/blogs/theater/archives/2009/08/hadoop_summit_workflow_oozie.html

12.http://wiki.dspace.org/index.php/Creating_and_Applying_Patches_in_Eclipse
Creating and Applying Patches in Eclipse
http://www.ibm.com/developerworks/cn/opensource/os-eclipse-galileopatch/

13.JobControl:http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html

14. http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
By default, Hadoop uses a FIFO scheduler, but there are two more advanced schedulers which are widely used. The Capacity Scheduler is focused on guaranteing that various users of a cluster will have access to their guaranteed number of slots while making it and the Fair Scheduler is focused on providing good latency for small jobs while long running large jobs share the same cluster. These schedulers closely parallel processor scheduling, with hadoop jobs corresponding to processes and the map and reduce tasks corresponding to time slices

你可能感兴趣的:(eclipse,hadoop,workflow,hbase,bbs)