数仓分层的原理、架构、用途
rf模型细节,模型搭建
抛开现有的大数据平台(猛犸)如何部署模型
大数据为什么会有数据倾斜,如何优化?
python 进行etl的细节
----------------------------
hadoop HA的原理和流程
fileimage和edit文件原理和使用过程
spark on yarn的启动流程
数据倾斜
tez相关
linux shell如何找到系统中占用大量磁盘空间的一个大文件
spark调优:实际解决的一个典型问题
-----------------------------
业务中整个数据处理流程
-----------------------------
独立实现数据仓库需要哪些资源、模块、具体步骤
数仓分层原理
Hive UDAF实现细节
sparkStreaming与flink区别
sparkStreaming两条流join的实现原理
https://blog.csdn.net/wangpei1949/article/details/83892162
自己工作和业务中的典型亮点案例
数仓中维表过大导致join数据倾斜的优化办法
smb分桶join,桶的数量过多如何处理
https://www.jianshu.com/p/004462037557
--------------------------------------------
hive sql如何提取某个用户的最近一条记录
https://blog.csdn.net/u014571011/article/details/51907822
SQL相关 1. 假设有三个表A、B、C,每个表有一个字段为user_id(int) 用SQL找出 A表 中不在B表,但是在C表的所有user_id
select
a.user_id
from A a left join B b on a.user_id = b.user_id join C c on a.user_id = c.user_id
where b.user_id is null
2. 假设有表A,其中有两个字段,分别为 user_id(用户id,int), stay_time(单次停留时长,double)。 注: 同一个用户可能有多行记录 用SQL求出表A的 用户数,人均停留时长
select
count(user_id) as total_user,
sum(t) / count(user_id) as avg_stay_time
from
(select user_id, sum(stay_time) as t from A where user_id is not null
group by user_id)
3. 假设有表A,其中有三个字段,分布为subject_id(科目id), user_id(学生id), score(分数) 注: 同一个学生在多个科目下有分数 用SQL找出每个科目下分数前10的学生id、分数、该科目平均分
select
subject_id,
user_id,
score,
avg_score
from
(
select
subject_id,
user_id,
score,
avg(score) over(partition by subject_id) as avg_score,
row_number() over(partition by subject_id order by score desc) as rank
from
A
)a
where a.rank <=10
4. 假设有表A,其中有两个字段,content_id(内容id), pool_id_list(每个内容所在的池子id列表,以,分割) 注: 每个内容可能在一个或者多个池子里,表A里每个内容id只有一行数据 用SQL求出每个pool_id下的内容数。注意某些池子下的内容id非常多,有数据倾斜问题
set hive.map.aggr=true;
set hive.groupby.skewdata=true;
select
a.pool_id
count(a.content_id) as num
from
(select
content_id,
pool_id
from
A
LATERAL VIEW explode(pool_id_list) as pool_id)a
group by
pool_id
MR相关 1. 用原生MR实现 count distinct的功能
package hadoop.count_distinct;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class countDistinct {
public static final String DELIMITER = “,";
public static class Map extends org.apache.hadoop.mapreduce.Mapper
protected void map(Text key, Text value, Mapper
org.apache.hadoop.mapreduce.lib.input.FileSplit split = (org.apache.hadoop.mapreduce.lib.input.FileSplit) context.getInputSplit();
String line = value.toString();
//抛弃空数据
if (line == null || line.equals(""))
return;
// 根据分隔符截取
String[] values = line.split(DELIMITER);
//获取distinct field
String distinctField = values[0];
//key 设置为相同
String key = distinctField;
context.write(key, distinctField);
}
}
}
public static class Reduce extends Reducer
protected void reduce(Text key, Iterable
Iterator
String lastVal = "";
long count = 0;
while (iterator.hasNext()) {
String distinctField = iterator.next().toString();
if (!distinctField.equals(lastVal)) {
count++;
lastVal = distinctField;
}
}
String output = new StringBuilder().append(count).toString();
context.write(NullWritable.get(), new Text(output));
}
}
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: input_path, output_path");
System.exit(2);
}
Job job = Job.getInstance(conf, “count_distinct");
job.setJarByClass(countDistinct.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
2. 用原生MR进行两个文件的inner join A文件: user_id, name, phone B文件: phone, location 输出文件C: user_id, name, phone, location
package hadoop.innerjoin;
import java.util.LinkedList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class InnerJoin {
public static final String DELIMITER = “,";
public static class Map extends org.apache.hadoop.mapreduce.Mapper Text, Text, Text> { protected void map(Text key, Text value, Mapper org.apache.hadoop.mapreduce.lib.input.FileSplit split = (org.apache.hadoop.mapreduce.lib.input.FileSplit) context.getInputSplit(); //获取数据路径 String fileName = split.getPath().toString(); String line = value.toString(); //抛弃空数据 if (line == null || line.equals("")) return; // 根据分隔符截取 String[] values = line.split(DELIMITER); //处理来自 A 的数据 if (fileName.contains(“A")) { if (values.length < 3) return; String user_id = values[0]; String name = values[1]; String phone = values[2]; context.write(new Text(phone), new Text("a#" + user_id + DELIMITER + name)); } //处理B 的数据 else if (fileName.contains(“B")) { if (values.length < 2) return; String phone = values[0]; String location = values[1]; context.write(new Text(phone), new Text("b#" + location)); } } } public static class Reduce extends Reducer protected void reduce(Text word, Iterable LinkedList LinkedList for (Text tval : values) { String val = tval.toString(); if(val.startsWith("a#")) { link_A.add(val.substring(2)); } else if (val.startsWith("b#")) { link_B.add(val.substring(2)); } } for (String u : link_A) { for (String k : link_B) { String[] us = u.split(DELIMITER) // user_id, name, phone, location context.write(us[0], new Text(us[1] + DELIMITER + word + DELIMITER + k)); } } } } public static void main(String[] args) throws Exception{ Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: input_path, output_path"); System.exit(2); } Job job = Job.getInstance(conf, "InnerJoin"); job.setJarByClass(InnerJoin.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } 还可以参考这个 里面的 join是left join 另外 count distinct 加上了 group by https://github.com/chenpengcong/mapreduce-query