原始数据是json数据,大约有100万条数据,样例数据如下:
{"movie":"608","rate":"4","timeStamp":"978301398","uid":"1"}
{"movie":"1246","rate":"4","timeStamp":"978302091","uid":"1"}
{"movie":"1357","rate":"5","timeStamp":"978298709","uid":"2"}
{"movie":"3068","rate":"4","timeStamp":"978299000","uid":"2"}
{"movie":"1537","rate":"4","timeStamp":"978299620","uid":"2"}
{"movie":"647","rate":"3","timeStamp":"978299351","uid":"2"}
json是一种常用的数据格式,广泛的适用于数据的存储和数据的传输。数据是使用大括号,冒号,双引号,中括号组成,这些元素是可以嵌套的。
电影评分数据包含了电影id,电影评分,评论时间,用户id。
样例结果:
uid=1的前十条数据
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
。。。此处省略六条。。。
uid=2的前十条数据
{"movie":"1357","rate":"5","timeStamp":"978298709","uid":"2"}
{"movie":"2268","rate":"5","timeStamp":"978299297","uid":"2"}
{"movie":"648","rate":"4","timeStamp":"978299913","uid":"2"}
。。。此处省略n条数据。。。
样例结果:
uid=1 平均分=4.98
uid=2 平均分=4.39
uid=3 平均分=4.87
uid=4 平均分=4.98
uid=5 平均分=5.00
。。。此处省略n条数据。。。
其实就是在问题2的基础上找出平均数比较高的前n条数据。
样例输出:
uid=5 平均分=5.00
uid=329 平均分=4.98
uid=23 平均分=4.95
uid=435 平均分=4.89
uid=324 平均分=4.89
热门的定义: 评论次数多的就是热门
样例输出数据:
movie=217 评论次数:737284
movie=2345 评论次数:733213
movie=748 评论次数:684372
。。。此处省略七条数据。。。
样例数据输出:
movie=5 平均分=5.00
movie=329 平均分=4.98
movie=23 平均分=4.95
movie=435 平均分=4.89
movie=324 平均分=4.89
。。。此处省略5条数据。。。
json解析是把数据解析成对象,所以需要先创建json数据相对应的javabean
json对应的javaBean
package cn.pengpeng.day01.bean;
/**
* json数据对应的javabean
* @author pengpeng
*/
public class RateBean {
//{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
private String movie;
private int rate;
private String timeStamp;
private String uid;
public String getMovie() {
return movie;
}
public void setMovie(String movie) {
this.movie = movie;
}
public int getRate() {
return rate;
}
public void setRate(int rate) {
this.rate = rate;
}
public String getTimeStamp() {
return timeStamp;
}
public void setTimeStamp(String timeStamp) {
this.timeStamp = timeStamp;
}
public String getUid() {
return uid;
}
public void setUid(String uid) {
this.uid = uid;
}
@Override
public String toString() {
return "RateBean [movie=" + movie + ", rate=" + rate + ", timeStamp=" + timeStamp + ", uid=" + uid + "]";
}
}
测试fastjson使用
package cn.pengpeng.day01.test;
import com.alibaba.fastjson.JSON;
import cn.pengpeng.day01.bean.RateBean;
/**
* 测试json数据,json和javabean相互转换
*/
public class TestJson {
public static void main(String[] args) {
String json = "{\"movie\":\"1193\",\"rate\":\"5\",\"timeStamp\":\"978300760\",\"uid\":\"1\"}";
RateBean bean = JSON.parseObject(json, RateBean.class);
System.out.println(bean);
Object object = JSON.toJSON(bean);
String string = object.toString();
System.out.println(object);
}
}
首先需要使用到创建的json数据对应的javabean,上面已经实现(同json测试中的RateBean)
处理问题的主程序(核心代码如下):
package cn.pengpeng.day01;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import com.alibaba.fastjson.JSON;
import cn.pengpeng.day01.bean.RateBean;
import cn.pengpeng.day01.utils.SortUtil;
/**
* 实现步骤:
* 1:读取数据,---> javabean
* 2:根据uid分组 Map>
* 3:局部排序
* 4:取topn
* @author pengpeng
*
* map的遍历有哪些方式(两种)
* map.entrySet();
*
* Set keySet = map.keySet();
for (String key : keySet) {
List list = map.get(key);
}
*
* 排序(list排序)(两种)
* 增加比较器的方式 new Comparator
* 类本身能比较 implements Comparable
*
*
*/
public class TestMain1 {
public static void main(String[] args) {
//获取map数据
Map<String, List<RateBean>> map = getMap();
//遍历map
Set<Entry<String,List<RateBean>>> entrySet = map.entrySet();
for (Entry<String, List<RateBean>> entry : entrySet) {
String uid = entry.getKey();
//获取每一个list,然后对每个list进行排序
List<RateBean> list = entry.getValue();
//排序
SortUtil.sortByRate(list);
System.out.println("uid="+uid);
for(int i =0 ; i<Math.min(10, list.size()) ; i++){
//存储到文件或者数据库
RateBean rateBean = list.get(i);
System.out.println(rateBean);
}
}
}
/**
* 读取数据封装到bean,存储到MAP中
* 封装方法快捷键 alt+shift+m
* 返回提示 ctrl+1
*/
public static Map<String, List<RateBean>> getMap() {
//创建存放数据的集合
Map<String, List<RateBean>> map = new HashMap<>();
//1:读取数据 实现Closeable都能自动关流
String path = "d:\\data\\rating.json";
try (BufferedReader br = new BufferedReader(new FileReader(path));){
String line = null;
while((line = br.readLine())!=null){
//json转换
RateBean bean = JSON.parseObject(line, RateBean.class);
/*// jdk1.8的新特性
List list = map.getOrDefault(bean.getUid(), new ArrayList<>());
list.add(bean);
map.put(bean.getUid(), list);*/
if(map.containsKey(bean.getUid())){//有值的情况
List<RateBean> list = map.get(bean.getUid());
list.add(bean);
// map.put(bean.getUid(), list);
}else{//key还没有存放过
List<RateBean> list = new ArrayList<>();
list.add(bean);
map.put(bean.getUid(), list);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return map;
}
}
里面用到了排序的方法,将其抽取成为相对应的方法(在sortUtile类里面的)代码如下:
package cn.pengpeng.day01.utils;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import cn.pengpeng.day01.bean.RateBean;
import cn.pengpeng.day01.bean.UidAvg;
/**
* Collections Collection 区别
*
* @author pengpeng
*
*/
public class SortUtil {
public static void sortByRate(List<RateBean> list) {
//Collections.sort(list, (o1,o2) -> o2.getRate() - o1.getRate()); // scala
Collections.sort(list, new Comparator<RateBean>() {
//排序规则主要看返回值的正负,零
//对于非int型到的最好能把等于零的时候单独的摘出来 (零特殊对待)
@Override
public int compare(RateBean o1, RateBean o2) {
return o2.getRate() - o1.getRate();
}
});
}
public static void sortByAvg(List<UidAvg> list) {
Collections.sort(list, new Comparator<UidAvg>() {
@Override
public int compare(UidAvg o1, UidAvg o2) {
if(o2.getAvg()-o1.getAvg()==0){
return 0;
}else{
return (o2.getAvg()-o1.getAvg())>0?1:-1;
}
//return Float.compare(o1.getAvg(), o2.getAvg());
}
});
}
}
第二个是在第一个基础上做的,具体代码如下:
package cn.pengpeng.day01;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import cn.pengpeng.day01.bean.RateBean;
/**
* 2.每个用户的uid和评分的平均值
* 1:读取数据---bean
* 2:封装到map集合中
* 3:局部求和求平局值
* 4:输出
* @author pengpeng
*
*/
public class TestMain2 {
public static void main(String[] args) {
Map<String, List<RateBean>> map = TestMain1.getMap();
for (Entry<String, List<RateBean>> entry : map.entrySet()) {
String uid = entry.getKey();
List<RateBean> list = entry.getValue();
float avg = getAvg(list);
System.out.println("uid="+uid+"\tavg="+String.format("%.2f", avg));
}
}
/**
* 求平均值
* @param list
* @return 平均值
*/
private static float getAvg(List<RateBean> list) {
int sum = 0;
int count = 0;
for (RateBean rateBean : list) {
sum+=rateBean.getRate();
count++;
}
return sum*1.0f/count;
}
}
第三个问题实现是在第二个基础上实现的,里面的uid和avg进行封装,封装到UidAvg类里面了,具体代码如下:
package cn.pengpeng.day01.bean;
/**
* 封装数据
* @author pengpeng
*
*/
public class UidAvg {
private String uid;
private float avg;
public String getUid() {
return uid;
}
public void setUid(String uid) {
this.uid = uid;
}
public float getAvg() {
return avg;
}
public void setAvg(float avg) {
this.avg = avg;
}
@Override
public String toString() {
return "uid=" + uid + ", avg=" + String.format("%.2f", avg) ;
}
}
第三个问题的执行入口,排序方法是在上面的排序工具类里面的(核心代码如下):
package cn.pengpeng.day01;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import cn.pengpeng.day01.bean.RateBean;
import cn.pengpeng.day01.bean.UidAvg;
import cn.pengpeng.day01.utils.SortUtil;
/**
* 3.最大方(评分平均值高)的n个用户的uid和评分平均值
* 1:读取数据---bean
* 2:封装到map集合中
* 3:局部求和求平均值
* 4:根据平均值排序 (list)
* 4:输出
* @author pengpeng
*
*/
public class TestMain3 {
public static void main(String[] args) {
Map<String, List<RateBean>> map = TestMain1.getMap();
List<UidAvg> uidAvglist = new ArrayList<>();
//把数据放到list里面
for (Entry<String, List<RateBean>> entry : map.entrySet()) {
String uid = entry.getKey();
List<RateBean> list = entry.getValue();
float avg = getAvg(list);
UidAvg uidAvg = new UidAvg();
uidAvg.setUid(uid);
uidAvg.setAvg(avg);
uidAvglist.add(uidAvg);
}
//对list进行排序
SortUtil.sortByAvg(uidAvglist);
for(int i = 0; i< Math.min(10, uidAvglist.size());i++){
System.out.println(uidAvglist.get(i).toString());
}
}
/**
* 求平均值
* @param list
* @return 平均值
*/
private static float getAvg(List<RateBean> list) {
int sum = 0;
int count = 0;
for (RateBean rateBean : list) {
sum+=rateBean.getRate();
count++;
}
return sum*1.0f/count;
}
}
华丽丽的分割线
这个文件跟rating.json文件内容相似,只是格式不一样,格式为:
UserID::MovieID::Rating::Timestamp
1::3114::4::978302174
1::608::4::978301398
1::1246::4::978302091
2::1357::5::978298709
2::3068::4::978299000
2::1537::4::978299620
2::647::3::978299351
这个是用户数据,数据说明为:
UserID::Gender::Age::Occupation::Zip-code
Gender is denoted by a “M” for male and “F” for female
Age is chosen from the following ranges:
Occupation is chosen from the following choices:
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
这个是电影数据,具体格式为:
MovieID::Title::Genres
Titles are identical to titles provided by the IMDB (including
year of release)
Genres are pipe-separated and are selected from the following genres:
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
将ratings.dat文件中的id替换为用户的详细信息,将movie(电影id)也替换为电影的详细信息
也就是将原始数据为:
1::608::4::978301398
1::1246::4::978302091
2::1357::5::978298709
2::3068::4::978299000
转换为:
1::F::1::10::48067::608::Fargo (1996)::Crime|Drama|Thriller::4::978301398
1::F::1::10::48067::1246::Dead Poets Society (1989)::Drama::4::978302091
2::M::56::16::70072::1357::Shine (1996)::Drama|Romance::5::978298709
2::M::56::16::70072::3068::Verdict, The (1982)::Drama::4::978299000
问答题