基于MapReduce框架的PageRank算法实战（上）

为什么80%的码农都做不了架构师？>>>

1、本次实战的数据是通过爬虫获取，若有需要可以找我获取。

部分数据展示：

将数据库数据导出为txt格式的文档，命名为userrelation.txt，并将其上传至hdfs中。

2、将数据转换为类似于下图格式的links.txt。第一列是微博所属者的ID，后面的为其所有关注人的ID

3、代码实现

/**
* 处理微博人物关系，得到类似A B C D结构数据
* @author ZD
*/
public class UserRelation {

   private static class UserRelationMapper extends Mapper {

       @Override
       protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException {
           String[] strs = value.toString().split(" ");
           context.write(new Text(strs[0].trim()), new Text(strs[1].trim())); //将关注者和被关注者的ID传给Reducer层
       }
   }

   private static class UserRelationReducer extends Reducer {

       @Override
       protected void reduce(Text value, Iterable datas, Reducer.Context context) throws IOException, InterruptedException {
           StringBuffer sb = new StringBuffer();
           Iterator it = datas.iterator();
           if(it.hasNext()){
               sb.append(it.next().toString());
           }
           while(it.hasNext()){
               sb.append(","+it.next().toString());
           }
//将后面所有被关注者格式改为ID1，ID2，ID3...的形式
           context.write(value, new Text(sb.toString()));
       }
   }

   public static void main(String[] args) {
       try {
           Configuration cfg = HadoopCfg.getConfigration();
           Job job = Job.getInstance(cfg);
           job.setJobName("UserRelation");
           job.setJarByClass(UserRelation.class);
           job.setMapperClass(UserRelationMapper.class);
           job.setMapOutputKeyClass(Text.class);
           job.setMapOutputValueClass(Text.class);
           job.setReducerClass(UserRelationReducer.class);
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(Text.class);
           FileInputFormat.addInputPath(job, new Path("/input/userrelation.txt"));
           FileOutputFormat.setOutputPath(job, new Path("/second/sec2/"));
           System.exit(job.waitForCompletion(true) ? 0 : 1);
       } catch (Exception e) {
           e.printStackTrace();
       }
   }
}

4、获取的部分数据形式展示：

5、紧接着，需要获取初始化的概率分布数据。原理就是统计所有关注者总数（sum），然后用1除以sum，得到每个关注者的初始概率。由于数据太小，所以将数据均扩大10000倍。这个程序比较简单，就不展示了，直接展示部分数据结果。如下图：

6、数据准备好了，接下来将计算每个关注者的支持度，通俗易懂，就是看谁更受关注，粉丝更多。具体实现下次与大家分享。

写在最后：本人初学，若有错误，望纠正。坚持就是胜利，与大家共勉。

基于MapReduce框架的PageRank算法实战（上）

你可能感兴趣的:(基于MapReduce框架的PageRank算法实战（上）)