01
02
03
04
05
06
07
08
09
10
11
12
|
HowManyMapsAndReduces
Partitioning your jobintomapsandreduces
Picking the appropriatesizeforthe tasksforyour job can radically change the performanceofHadoop. Increasing the numberoftasks increases the framework overhead, but increasesloadbalancingandlowers the costoffailures.Atone extremeisthe 1 map/1 reducecasewherenothingisdistributed. The other extremeistohave 1,000,000 maps/ 1,000,000 reduceswherethe framework runsoutofresourcesforthe overhead.
NumberofMaps
The numberofmapsisusually drivenbythe numberofDFS blocksinthe input files. Although that causes peopletoadjust their DFS blocksizetoadjust the numberofmaps. Therightlevelofparallelismformaps seemstobe around 10-100 maps/node, although we have taken it upto300orsoforvery cpu-light map tasks. Task setup takes awhile, so itisbest if the maps takeatleast aminutetoexecute.
Actually controlling the numberofmapsissubtle. The mapred.map.tasks parameterisjust a hinttothe InputFormatforthe numberofmaps. ThedefaultInputFormat behavioristosplit the total numberofbytesintotherightnumberoffragments. However,inthedefaultcasethe DFS blocksizeofthe input filesistreatedasanupperboundforinput splits. Alowerboundonthe splitsizecan besetvia mapred.min.split.size. Thus, if you expect 10TBofinput dataandhave 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the [WWW] InputFormat determines the number of maps.
The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(intnum). This can be usedtoincrease the numberofmap tasks, but willnotsetthe number below that which Hadoop determines via splitting the input data.
NumberofReduces
Therightnumberofreduces seemstobe 0.95or1.75 * (nodes * mapred.tasktracker.tasks.maximum).At0.95allofthe reduces can launch immediatelyandstart transfering map outputsasthe maps finish.At1.75 the faster nodes will finish theirfirstroundofreducesandlaunch asecondroundofreduces doing a much better jobofloadbalancing.
Currently the numberofreducesislimitedtoroughly 1000bythe buffersizefortheoutputfiles (io.buffer.size* 2 * numReduces << heapSize). This will be fixedatsomepoint, but until itisit provides a pretty firmupperbound.
The numberofreduces also controls the numberofoutputfilesintheoutputdirectory, but usually thatisnotimportant because thenextmap/reduce step will split themintoeven smaller splitsforthe maps.
The numberofreduce tasks can also be increasedinthe same wayasthe map tasks, via JobConf's conf.setNumReduceTasks(intnum).
|
1
2
3
4
5
6
7
|
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
<description>The maximum number of reduce tasks that will be run
simultaneously by a task tracker.
</description>
</property>
|
1
2
|
# host:path where hadoop code should be rsync'd from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop
|
1
2
|
Introduced recovery of jobs when JobTracker restarts. This facility is off by default.
Introduced config parameters"mapred.jobtracker.restart.recover","mapred.jobtracker.job.history.block.size", and"mapred.jobtracker.job.history.buffer.size".
|
01
02
03
04
05
06
07
08
09
10
11
|
0-1246359584298, infoPort=50075, ipcPort=50020):Got exceptionwhileserving blk_-5911099437886836280_1292 to/172.16.100.165:
java.net.SocketTimeoutException: 480000 millis timeoutwhilewaitingforchannel to be readyforwrite. ch : java.nio.channels.SocketChannel[connectedlocal=/
172.16.100.165:50010 remote=/172.16.100.165:50930]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
at java.lang.Thread.run(Thread.java:619)
|
1
2
3
4
|
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx800M -server</value>
</property>
|
1
2
|
when i use nutch1.0,get this error:
Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)whileindexing.
|
1
2
|
% hadoop jar job.jar MaxTemperatureByStationNameUsingDistributedCacheFile /
-files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/alloutput
|
1
2
3
4
5
6
7
8
|
publicvoidconfigure(JobConf conf) {
metadata =newNcdcStationMetadata();
try{
metadata.initialize(newFile("stations-fixed-width.txt"));
}catch(IOException e) {
thrownewRuntimeException(e);
}
}
|
1
|
There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master)whichdisplay status pages about the state of the entire system. By default, these are located at [WWW] [url]http://job.tracker.addr:50030/[/url] and [WWW] [url]http://name.node.addr:50070/.[/url]
|
1
2
|
java.io.IOException: Task processexitwith nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)
|
1
|
Set mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to higher value. By default, their values are 24 hours. These might be the reasonforfailure, though I'm not sure
|
1
2
3
|
FileInputFormat input splits: (详见 《the definitive guide》P190)
mapred.min.split.size: default=1, the smallest valide sizeinbytesforafilesplit.
mapred.max.split.size: default=Long.MAX_VALUE, the largest valid size.
|
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
|
JobConf conf =newJobConf(ProductMR.class);
conf.setJobName("ProductMR");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Product.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setMapOutputCompressorClass(DefaultCodec.class);
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);
String objpath ="abc1";
SequenceFileInputFormat.addInputPath(conf,newPath(objpath));
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf,0);
SkipBadRecords.setSkipOutputPath(conf,newPath("data/product/skip/"));
String output ="abc";
SequenceFileOutputFormat.setOutputPath(conf,newPath(output));
JobClient.runJob(conf);
|
1
2
|
bin/hadoop-daemon.sh start datanode
bin/hadoop-daemon.sh start jobtracker
|
1
|
sudo/etc/init.d/iptablesstop
|
1
2
3
|
When youcreate a table, hive actually stores the location of the table (e.g.
hdfs://ip:port/user/root/...)inthe SDS and DBS tablesinthe metastore . So when I bring up a new cluster the master has a new IP, but hive's metastore is still pointing to the locations within the old
cluster. I could modify the metastore to update with the new IP everytime I bring up a cluster. But the easier and simpler solution was to just use an elastic IPforthe master
|
1
|
Your Hadoop namespaceID became corrupted. Unfortunately the easiest thing todoreformat the HDFS.
|
1
2
3
|
bin/stop-all.sh
rm-Rf/tmp/hadoop-your-username/*
bin/hadoopnamenode -format
|
1
2
3
4
5
|
bin/hadoopjar contrib/hadoop-0.15.2-streaming.jar /
-mapper $HOME/proj/hadoop/multifetch.py /
-reducer $HOME/proj/hadoop/reducer.py /
-input urls/* /
-output titles
|
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
|
> 09/08/3118:25:45 INFO hdfs.DFSClient: Abandoning block blk_-8575812198227241296_1001
> 09/08/3118:25:51 INFO hdfs.DFSClient: ExceptionincreateBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.16:50010
> 09/08/3118:25:51 INFO hdfs.DFSClient: Abandoning block blk_-2932256218448902464_1001
> 09/08/3118:25:57 INFO hdfs.DFSClient: ExceptionincreateBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.11:50010
> 09/08/3118:25:57 INFO hdfs.DFSClient: Abandoning block blk_-1014449966480421244_1001
> 09/08/3118:26:03 INFO hdfs.DFSClient: ExceptionincreateBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.16:50010
> 09/08/3118:26:03 INFO hdfs.DFSClient: Abandoning block blk_7193173823538206978_1001
> 09/08/3118:26:09 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable
to create new block.
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2731)
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)
>
> 09/08/3118:26:09 WARN hdfs.DFSClient: Error Recoveryforblock blk_7193173823538206978_1001
bad datanode[2] nodes == null
> 09/08/3118:26:09 WARN hdfs.DFSClient: Could not get block locations. Sourcefile"/user/umer/8GB_input"
- Aborting...
> put: Bad connect ack with firstBadLink 192.168.1.16:50010
|
1
2
3
4
|
Exceptioninthread"main"java.lang.NullPointerException
at sun.jvmstat.perfdata.monitor.protocol.local.LocalVmManager.activeVms(LocalVmManager.java:127)
at sun.jvmstat.perfdata.monitor.protocol.local.MonitoredHostProvider.activeVms(MonitoredHostProvider.java:133)
at sun.tools.jps.Jps.main(Jps.java:45)
|