spark java编程练习

今天练习了spark java常见操作。


SparkConf conf=new SparkConf();
conf.setAppName("xxxxxx");
conf.setMaster("local");

JavaSparkContext sc=new JavaSparkContext(conf);
sc.setLogLevel("error");

JavaRDD rdd1=sc.textFile("test.txt");
System.out.println("rdd1:"+rdd1);

结果为 rdd1:[a 1, b 2, a 3, b 4, c 3, b 9]

JavaRDD> rdd2=rdd1.map(w->Arrays.asList(w.split(" ")));
System.out.println("rdd2:"+rdd2.collect());

结果为:rdd2:[[a, 1], [b, 2], [a, 3], [b, 4], [c, 3], [b, 9]]

JavaPairRDDpairRdd1=rdd1.mapToPair(new PairFunction(){
			
			public Tuple2 call(String t) throws Exception{
				String[]st=t.split(" ");
				return new Tuple2(st[0],st[1]);
			}
		});
System.out.println("pairRdd1:"+pairRdd1.collect());

结果为:pairRdd1:[(a,1), (b,2), (a,3), (b,4), (c,3), (b,9)]

maptopair()用法
JavaPairRDDpairRdd2=rdd1.mapToPair(new PairFunction(){
			
			public Tuple2 call(String t) throws Exception{
				String[]st=t.split(" ");
				return new Tuple2(st[1],st[0]);
			}
		});
System.out.println("pairRdd2:"+pairRdd2.collect());
结果为:pairRdd2:[(1,a), (2,b), (3,a), (4,b), (3,c), (9,b)]

groupByKey()用法1
 
  
JavaPairRDD> pairRdd3=pairRdd1.groupByKey();
System.out.println("pairRdd3:"+pairRdd3.collect());

结果为:pairRdd3:[(a,[1, 3]), (b,[2, 4, 9]), (c,[3])]

groupByKey()用法2
JavaPairRDD> pairRdd4=pairRdd2.groupByKey();
System.out.println("pairRdd4:"+pairRdd4.collect());

结果为:pairRdd4:[(4,[b]), (2,[b]), (9,[b]), (3,[a, c]), (1,[a])]

keyBy()用法
JavaPairRDD> pairRdd5=rdd2.keyBy(new Function,String>() {
			public String call(List s1) throws Exception{
				return s1.get(1);
			}
			
		});
System.out.println("pairRdd5:"+pairRdd5.collect());

结果为:pairRdd5:[(1,[a, 1]), (2,[b, 2]), (3,[a, 3]), (4,[b, 4]), (3,[c, 3]), (9,[b, 9])]



以上即为常见的spark java操作
注意:java8新增lambda函数,代码写法更简洁





你可能感兴趣的:(spark,java编程)