mahout提供了内存中的FPG和分布式的PFP两种算频繁项集的方法,其中PFP实现上也是将feature分组,然后在节点上独立地运行FPG算法。PFP默认分组为50,如果项的数量特别多,可能需要考虑修改这个值。
先来看一下mahout 0.5的FPG测试代码:
public void testMaxHeapFPGrowth() throws Exception { FPGrowth<String> fp = new FPGrowth<String>(); Collection<Pair<List<String>,Long>> transactions = new ArrayList<Pair<List<String>,Long>>(); transactions.add(new Pair<List<String>,Long>(Arrays.asList("E", "A", "D", "B"), 1L)); transactions.add(new Pair<List<String>,Long>(Arrays.asList("D", "A", "C", "E", "B"), 1L)); transactions.add(new Pair<List<String>,Long>(Arrays.asList("C", "A", "B", "E"), 1L)); transactions.add(new Pair<List<String>,Long>(Arrays.asList("B", "A", "D"), 1L)); transactions.add(new Pair<List<String>,Long>(Arrays.asList("D"), 1L)); transactions.add(new Pair<List<String>,Long>(Arrays.asList("D", "B"), 1L)); transactions.add(new Pair<List<String>,Long>(Arrays.asList("A", "D", "E"), 1L)); transactions.add(new Pair<List<String>,Long>(Arrays.asList("B", "C"), 1L)); Path path = getTestTempFilePath("fpgrowthTest.dat"); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, TopKStringPatterns.class); fp.generateTopKFrequentPatterns( transactions.iterator(), fp.generateFList(transactions.iterator(), 3), 3, 100, new HashSet<String>(), new StringOutputConverter(new SequenceFileOutputCollector<Text,TopKStringPatterns>(writer)), new ContextStatusUpdater(null)); writer.close(); List<Pair<String, TopKStringPatterns>> frequentPatterns = FPGrowth.readFrequentPattern(conf, path); assertEquals( "[(C,([B, C],3)), " + "(E,([A, E],4), ([A, B, E],3), ([A, D, E],3)), " + "(A,([A],5), ([A, D],4), ([A, E],4), ([A, B],4), ([A, B, E],3), ([A, D, E],3), ([A, B, D],3)), " + "(D,([D],6), ([B, D],4), ([A, D],4), ([A, D, E],3), ([A, B, D],3)), " + "(B,([B],6), ([A, B],4), ([B, D],4), ([A, B, D],3), ([A, B, E],3), ([B, C],3))]", frequentPatterns.toString()); }
所以如果只是要在控制台看输出,不妨改造一下这个包装类,下面的类实现了在控制台输出频繁项集的结果:
public final class PrintStreamConverter implements OutputCollector<String, List<Pair<List<String>, Long>>> { private final PrintStream collector; public PrintStreamConverter(PrintStream collector) { this.collector = collector; } @Override public void collect(String key, List<Pair<List<String>, Long>> values) throws IOException { for (Pair<List<String>, Long> pair : values) { collector.print(key +": " + StringUtils.join(pair.getFirst(),",") + "\t" + pair.getSecond() + "\n"); } }
这时FPG代码就可修改为:
fp.generateTopKFrequentPatterns( transactions.iterator(), fp.generateFList(transactions.iterator(), 3), 3, 100, new HashSet<String>(), new PrintStreamConverter(System.out), new ContextStatusUpdater(null));
有时候希望在集群节点的内存中做FPG,这时需要一些额外的包装,下面的类提供了包装,并将频繁项集输出为<Text, Text>:
import org.apache.commons.lang.StringUtils; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.OutputCollector; import org.apache.mahout.common.Pair; import org.uncommons.maths.combinatorics.CombinationGenerator; import java.io.IOException; import java.util.List; public final class TextOutputConverter implements OutputCollector<String, List<Pair<List<String>, Long>>> { private final OutputCollector<Text, Text> collector; public TextOutputConverter(OutputCollector<Text, Text> collector) { this.collector = collector; } @Override public void collect(String key, List<Pair<List<String>, Long>> values) throws IOException { for (Pair<List<String>, Long> pair : values) { collector.collect(new Text(key + "," + StringUtils.join(pair.getFirst(), ";")), new Text(pair.getSecond().toString())); } } }
reduce中代码如下:
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FPGrowth<String> fp = new FPGrowth<String>(); Collection<Pair<List<String>, Long>> transactions = new ArrayList<Pair<List<String>, Long>>(); while (values.hasNext()) { List<String> list = new ArrayList<String>(); String[] parts = values.next().toString().split(" "); Collections.addAll(list, parts); transactions.add(new Pair<List<String>, Long>(list, 1L)); } fp.generateTopKFrequentPatterns( transactions.iterator(), fp.generateFList(transactions.iterator(), 5), 5, 1000, new HashSet<String>(), new TextOutputConverter(output), new ContextStatusUpdater(null)); }
FP算法中还有一些可调的参数,通过Parameters类来封装,它是一个<key, value>对集合。
numGroups:feature分组的数目,默认50。对于大项集来说,可能设大一些会好点
input:输入路径
output:输出路径
minSupport:最小支持度,默认为3