TotalOrderPartitioner Cannot Find _partition.lst File

Question:

I'm using the Cloudera's VM (cloudera-demo-vm-cdh3u3-vmware) to run the
TotalOrderPartitioner class for a specific problem.

When I run the code, it cannot find the _partition.lst file. It saves it to
another location, that isn't the one that I specify.

The path is passed to the DistributedCache as
/ldcloud/results/test-12345/partition-strips/_partition.lst, but it ends up
in /user/root and the job dies.

I've researched the problem on-line, and people point to the Job as running
locally, but I'm not using LocalJobRunner to execute the job.
Also, people indicate that Hadoop might be running in standalone mode. I
checked the VM, and it looks like all the demons are running
so I would assume the Cloudera VM is running is pseudo distributed mode.
The java processes look like the following:

2610 JobTracker
2738 FlumeMaster
2858 DataNode
3393 RunJar
2798 FlumeNode
3539 Sqoop
6556 Jsp
3071 Namenode
2692 FlumeWatchdog
3310 TaskTracker
3173 SecondaryNamenode
3518 Bootstrap

Here is how my job to setup and run the job, which mirros the example in
the latest O'Reilly Hadoop book.

Configuration conf = job.getConfiguration();
conf.set("mapred.reduce.tasks", maxPartitionerReduceTasks );

Job job = new Job();
job.setJobName("STR Centroid Partitioner");

job.setJarByClass(STRStripPartitioner.class);

job.setInputFormatClass(SequenceFileInputFormat.class);

job.setOutputKeyClass(DoubleWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);

SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job,
GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
SequenceFile.CompressionType.BLOCK);

FileInputFormat.addInputPath( job, inputPath );
FileOutputFormat.setOutputPath( job, outputPath );

inputPath = inputPath.makeQualified(inputPath.getFileSystem(conf));

Path partitionFile = new Path(inputPath, "_partitions");
System.out.println("Partition file path: " + partitionFile );

TotalOrderPartitioner.setPartitionFile(conf, partitionFile);

job.setPartitionerClass(TotalOrderPartitioner.class);

System.out.println("Partition file: " + partitionFile.toString() );
URI partitionUri = new URI(partitionFile.toString() +
"#_partitions");
DistributedCache.addCacheFile(partitionUri, conf);
DistributedCache.createSymlink(conf);

InputSampler.Sampler sampler =
new InputSampler.RandomSampler(frequency, numberSamples, maxSplitsSampled);

InputSampler.writePartitionFile(job, sampler);

job.waitForCompletion(true);

=======================

Here is the output. Another job runs prior to the TotalOrderPartitioner
job, that stores the input data to this job in HDFS. Any help would be
greatly appreciated. TIA.

[root@localhost cloudera]# hadoop jar packing-1.0-jar-with-dependencies.jar
TestTotalOrderPartitioner test_ttop.properties
Test name: test-ttop-1351866931010
Number of data files: 10
Source data path: hdfs://localhost/testdata
12/11/02 10:35:31 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
12/11/02 10:35:31 INFO input.FileInputFormat: Total input paths to process
: 10
12/11/02 10:35:31 WARN snappy.LoadSnappy: Snappy native library is available
12/11/02 10:35:31 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/11/02 10:35:31 INFO snappy.LoadSnappy: Snappy native library loaded
12/11/02 10:35:31 INFO mapred.JobClient: Running job: job_201211020952_0003
12/11/02 10:35:32 INFO mapred.JobClient: map 0% reduce 0%
12/11/02 10:35:39 INFO mapred.JobClient: map 20% reduce 0%
12/11/02 10:35:44 INFO mapred.JobClient: map 30% reduce 0%
12/11/02 10:35:49 INFO mapred.JobClient: map 40% reduce 0%
12/11/02 10:35:52 INFO mapred.JobClient: map 50% reduce 0%
12/11/02 10:35:53 INFO mapred.JobClient: map 60% reduce 0%
12/11/02 10:35:55 INFO mapred.JobClient: map 60% reduce 16%
12/11/02 10:35:57 INFO mapred.JobClient: map 80% reduce 16%
12/11/02 10:35:58 INFO mapred.JobClient: map 80% reduce 20%
12/11/02 10:36:01 INFO mapred.JobClient: map 90% reduce 26%
12/11/02 10:36:02 INFO mapred.JobClient: map 100% reduce 26%
12/11/02 10:36:07 INFO mapred.JobClient: map 100% reduce 100%
12/11/02 10:36:07 INFO mapred.JobClient: Job complete: job_201211020952_0003
12/11/02 10:36:07 INFO mapred.JobClient: Counters: 26
12/11/02 10:36:07 INFO mapred.JobClient: Job Counters
12/11/02 10:36:07 INFO mapred.JobClient: Launched reduce tasks=1
12/11/02 10:36:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=55678
12/11/02 10:36:07 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/11/02 10:36:07 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
12/11/02 10:36:07 INFO mapred.JobClient: Launched map tasks=10
12/11/02 10:36:07 INFO mapred.JobClient: Data-local map tasks=10
12/11/02 10:36:07 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=27452
12/11/02 10:36:07 INFO mapred.JobClient: FileSystemCounters
12/11/02 10:36:07 INFO mapred.JobClient: FILE_BYTES_READ=3731780
12/11/02 10:36:07 INFO mapred.JobClient: HDFS_BYTES_READ=3553934
12/11/02 10:36:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8079026
12/11/02 10:36:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3810562
12/11/02 10:36:07 INFO mapred.JobClient: Map-Reduce Framework
12/11/02 10:36:07 INFO mapred.JobClient: Map input records=10000
12/11/02 10:36:07 INFO mapred.JobClient: Reduce shuffle bytes=3731834
12/11/02 10:36:07 INFO mapred.JobClient: Spilled Records=20000
12/11/02 10:36:07 INFO mapred.JobClient: Map output bytes=3698154
12/11/02 10:36:07 INFO mapred.JobClient: CPU time spent (ms)=8030
12/11/02 10:36:07 INFO mapred.JobClient: Total committed heap usage
(bytes)=1348575232
12/11/02 10:36:07 INFO mapred.JobClient: Combine input records=0
12/11/02 10:36:07 INFO mapred.JobClient: SPLIT_RAW_BYTES=1200
12/11/02 10:36:07 INFO mapred.JobClient: Reduce input records=10000
12/11/02 10:36:07 INFO mapred.JobClient: Reduce input groups=6405
12/11/02 10:36:07 INFO mapred.JobClient: Combine output records=0
12/11/02 10:36:07 INFO mapred.JobClient: Physical memory (bytes)
snapshot=1775706112
12/11/02 10:36:07 INFO mapred.JobClient: Reduce output records=10000
12/11/02 10:36:07 INFO mapred.JobClient: Virtual memory (bytes)
snapshot=5733257216
12/11/02 10:36:07 INFO mapred.JobClient: Map output records=10000
Frequency: 0.1
Max Splits Sampled: 100
Number Samples: 1000
Max Partitioner Reduce Tasks: 1
Partition file path:
hdfs://localhost/results/test-ttop-1351866931010/centroid/_partitions
Partition file:
hdfs://localhost/results/test-ttop-1351866931010/centroid/_partitions
12/11/02 10:36:07 INFO input.FileInputFormat: Total input paths to process
: 1
12/11/02 10:36:08 INFO partition.InputSampler: Using 994 samples
12/11/02 10:36:08 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
12/11/02 10:36:08 INFO compress.CodecPool: Got brand-new compressor
12/11/02 10:36:08 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
12/11/02 10:36:08 INFO input.FileInputFormat: Total input paths to process
: 1
12/11/02 10:36:08 INFO mapred.JobClient: Running job: job_201211020952_0004
12/11/02 10:36:09 INFO mapred.JobClient: map 0% reduce 0%
12/11/02 10:36:16 INFO mapred.JobClient: Task Id :
attempt_201211020952_0004_m_000000_0, Status : FAILED
java.lang.IllegalArgumentException: Can't read partitions file
at
org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:639)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.io.FileNotFoundException: File _partition.lst does not
exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:408)
at
12/11/02 10:36:22 INFO mapred.JobClient: Task Id :
attempt_201211020952_0004_m_000000_1, Status : FAILED
java.lang.IllegalArgumentException: Can't read partitions file
at
org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:639)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.io.FileNotFoundException: File _partition.lst does not
exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:408)
at
12/11/02 10:36:27 INFO mapred.JobClient: Task Id :
attempt_201211020952_0004_m_000000_2, Status : FAILED
java.lang.IllegalArgumentException: Can't read partitions file
at
org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:108)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:639)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.io.FileNotFoundException: File _partition.lst does not
exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:408)
at
12/11/02 10:36:33 INFO mapred.JobClient: Job complete: job_201211020952_0004
12/11/02 10:36:33 INFO mapred.JobClient: Counters: 7
12/11/02 10:36:33 INFO mapred.JobClient: Job Counters
12/11/02 10:36:33 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22427
12/11/02 10:36:33 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
12/11/02 10:36:33 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
12/11/02 10:36:33 INFO mapred.JobClient: Launched map tasks=4
12/11/02 10:36:33 INFO mapred.JobClient: Data-local map tasks=4
12/11/02 10:36:33 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/11/02 10:36:33 INFO mapred.JobClient: Failed map tasks=1

Complete.


Answer:

I finally got it to work. Here is the final code incase anyone else has the
same issue:

Configuration conf = new Configuration();
Job job = new Job(conf, "Centroid Partitioner");

FileInputFormat.addInputPath( job, inputPath );
FileOutputFormat.setOutputPath( job, outputPath );

job.setJarByClass(StripPartitioner.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputKeyClass(DoubleWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);

SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job,
GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
SequenceFile.CompressionType.BLOCK);

job.setPartitionerClass(TotalOrderPartitioner.class);
job.setNumReduceTasks( 5 ); //Number of partitions to create.

InputSampler.Sampler sampler = new
InputSampler.RandomSampler(frequency, numberSamples,
maxSplitsSampled);

inputPath = inputPath.makeQualified(
FileSystem.get(job.getConfiguration()) );
Path partitionFile = new Path( inputPath, "_partitions");
System.out.println("Partition file path: " + partitionFile
);

TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),
partitionFile);
InputSampler.writePartitionFile(job, sampler);

URI partitionUri = new URI(partitionFile.toString() +
"#_partitions");
DistributedCache.addCacheFile(partitionUri, job.getConfiguration()
);
DistributedCache.createSymlink( job.getConfiguration() );

job.waitForCompletion(true);

注:在用mapreduce包下的类运行totalorder的时候(而不是用mapred包下的类),需要通过job.getConfiguration()来获得conf,否则会出现Cannot Find _partition.lst File错误。

你可能感兴趣的:(hadoop开发)