一、hive架构
1.数据接口层
webUI、thrift server、beeline
2.数据内核层
driver和meta store
driver里面包含 解析器(解析语法、词法logical plan)-》优化器-》编译器(plan,生成物理执行计划)-》执行器
3.数据处理层
mr、spark、tez计算/ hdfs存储/
二、
内部表:manage table 删除表时会删除数据(特定的路径进行管理)。
外部表:external table 删除表时不会删除数据。删除连接。(purge属性可以将其变成和内部表一样的删表同时删数据)
三、4个BY
1.sort by 分区内排序
2.order by 全局排序
3.disturbute by 分区,类似于partition
4.cluster by 相当于(disturbute by +sort by相同字段)
四、UDF
1.UDF 继承UDF接口,重写 evaluate方法
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
@Description(name = "LatestAdjustmentDateUDF",
value = "Returns the latest adjustment date based on the initial date and adjustment frequency",
extended = "Example:\n" +
" SELECT LatestAdjustmentDateUDF('2022-01-01', 'yearly') AS latest_adjustment_date")
public class LatestAdjustmentDateUDF extends UDF {
public String evaluate(String initialDate, String adjustmentFrequency) {
// Perform the logic to calculate the latest adjustment date based on the initial date and adjustment frequency
// Return the calculated latest adjustment date
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
try {
Date date = sdf.parse(initialDate);
Calendar calendar = Calendar.getInstance();
calendar.setTime(date);
// Determine the adjustment frequency and calculate the change period accordingly
int changePeriod = 0;
if (adjustmentFrequency.equalsIgnoreCase("yearly")) {
changePeriod = Calendar.YEAR;
} else if (adjustmentFrequency.equalsIgnoreCase("quarterly")) {
changePeriod = Calendar.MONTH * 3;
} else if (adjustmentFrequency.equalsIgnoreCase("half-yearly")) {
changePeriod = Calendar.MONTH * 6;
}
calendar.add(Calendar.MONTH, changePeriod);
return sdf.format(calendar.getTime());
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
}
ADD JAR ;
CREATE TEMPORARY FUNCTION latest_adjustment_date AS 'com.example.LatestAdjustmentDateUDF';
SELECT latest_adjustment_date(initial_date, adjustment_frequency) FROM your_table;
2.UDTF 继承GenericUDTF方法,重写三个方法:initialize(自定义输出的列名和类型)、process(结果返回)、close(关闭资源)
3.UDAF
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFParameterInfo;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFResolver2;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFResolver2.AbstractEvaluatorResolver;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFResolver2.GenericUDAFEvaluatorResolver;
public class LatestAdjustmentDateUDAF extends UDAF {
public static class Evaluator implements UDAFEvaluator {
private String latestAdjustmentDate;
public Evaluator() {
super();
init();
}
public void init() {
latestAdjustmentDate = null;
}
public boolean iterate(String initialDate, int changePeriod) {
if (initialDate != null) {
// Perform the logic to calculate the latest adjustment date based on the initial date and change period
// Assign the calculated latest adjustment date to the member variable latestAdjustmentDate
}
return true;
}
public String terminatePartial() {
return latestAdjustmentDate;
}
public boolean merge(String otherLatestAdjustmentDate) {
if (otherLatestAdjustmentDate != null) {
// Perform the logic to merge the partial latest adjustment dates
// Assign the merged latest adjustment date to the member variable latestAdjustmentDate
}
return true;
}
public String terminate() {
return latestAdjustmentDate;
}
}
public static class Resolver implements GenericUDAFResolver2 {
@Override
public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info) {
return new Evaluator();
}
@Override
public GenericUDAFEvaluatorResolver getEvaluatorResolver(
boolean isDistinct) {
return null;
}
}
}
静态内部类在初始化阶段会被加载吗?
静态内部类在初始化阶段是会被加载的。当一个类被加载时,其中的静态成员(包括静态内部类)会被初始化。这意味着静态内部类的静态字段、静态代码块和静态方法会在类加载时被执行。
在Java中,静态内部类与外部类是相互独立的类,但它们之间具有特殊的关联。当静态内部类被首次使用时,它会被加载并初始化。无论外部类是否被加载或实例化,静态内部类都可以单独存在并被访问。
静态内部类的加载和初始化过程与外部类的加载和初始化是分开的。当需要使用静态内部类时,首先会加载外部类,然后加载并初始化静态内部类。
需要注意的是,静态内部类的加载和初始化仅在需要访问该静态内部类时发生。如果你的代码中并没有直接使用静态内部类,那么它将不会被加载和初始化。
总结起来,静态内部类在初始化阶段会被加载,但加载和初始化是在需要访问该静态内部类时才发生的。
hive优化
1)map join
如果不指定join方式,默认common join, 当只有一个大表,其它都是小表做关联时,可以使用map join提高效率。它的原理是提前在map端做join表关联操作,免去reduce阶段来提升效率。过程大致是先把小表加载读取并做hash运算生成hashtable存在distributeCache里面,在运算大表的map任务,大表的maptask此时会直接与DC里面小表hashtable做关联匹配,并直接输出结果。
select /*+map join*/ a.id from table a left join table b on a.id = b.id;
-- 使用 Map Join 进行查询 SELECT o.order_id, c.customer_name FROM orders o JOIN customers c MAPJOIN ON o.customer_id = c.customer_id;
2) 行、列裁剪
行裁剪,外关联时,副表的过滤条件写在where条件之前,去提前减少加载的数据量。
列裁剪:不使用select *
3)合理设置map数
过多,每个map任务都会耗资源,如果都是小文件,效率很低。
4)合理设置reduce数
过多,输出文件数也会过多
5)小文件合并
map前执行合并,减少map数。