今天接到一个需求，一个人有多个任务，每个任务都有一个开始时间和结束时间，统计属于这个人的任务开始时间和结束时间有重合的任务。支持分页和筛选。

现在将需求抽象为如下场景:

任务表task

id	operator_id	start_date	end_date
1	1	2021-10-01 00:00:00	2021-10-02 00:00:00
2	1	2021-10-03 00:00:00	2021-10-08 00:00:00
3	1	2021-10-04 00:00:00	2021-10-05 00:00:00
4	1	2021-10-07 00:00:00	2021-10-10 00:00:00
5	2	2021-10-01 00:00:00	2021-10-02 00:00:00
6	2	2021-10-03 00:00:00	2021-10-08 00:00:00
7	2	2021-10-04 00:00:00	2021-10-05 00:00:00
8	2	2021-10-07 00:00:00	2021-10-10 00:00:00

按照上述需求，应该返回如下数据:

id	operator_id	start_date	end_date
2	1	2021-10-03 00:00:00	2021-10-08 00:00:00
3	1	2021-10-04 00:00:00	2021-10-05 00:00:00
4	1	2021-10-07 00:00:00	2021-10-10 00:00:00
6	2	2021-10-03 00:00:00	2021-10-08 00:00:00
7	2	2021-10-04 00:00:00	2021-10-05 00:00:00
8	2	2021-10-07 00:00:00	2021-10-10 00:00:00

思路

统计时间是否重叠，我们需要查询出所有开始时间和结束时间有值的任务，然后根据人去分组，每个人下的任务互相比较，获取到日期区间有重叠的任务，然后再根据筛选条件和分页条件对任务进行分页和筛选。

代码实现

在这样的想法下，最直接的实现是用代码在内存中去实现。

首先查询出所有的任务，然后根据operatorId分组，然后for循环获得所有时间区间重叠的task id，再把所有的task id代入到task表中去做分页和字段筛选(此处不做代码演示)。

这样的实现在功能上没问题的，但是如果任务数量比较大的时候，就需要我们复盘下结果返回的过程，看看有没有优化空间。

首先要做一次sql查询，将所有的任务放到内存中，然后在内存中循环对比，获取时间区间冲突的任务id集合，然后拿到任务id集合作为参数传入新的sql查询中，来做分页和筛选操作。在这个过程中，多次sql查询有一定开销，查询结果映射到内存中有一些开销，而这一连串操作都是数据筛选操作，能不能将这些操作交给专门做数据存储和筛选的数据库做呢？

sql实现

根据以上面的思路，很容易的到如下的查询sql

select 
t1.*
from task t1, task t2
where t1.operator_id = t2.operator_id
and t1.id <> t2.id
and t1.start_date >= t2.start_date 
and t1.start_date < t2.end_date
union
select 
t2.*
from task t1, task t2
where t1.operator_id = t2.operator_id
and t1.id <> t2.id
and t1.start_date >= t2.start_date 
and t1.start_date < t2.end_date

上述sql虽然很简单，但在实际实践当中，查询却很慢。这里假设一个人有1w个任务，查询过程中每个任务都要和其他9999个任务对比时间区间，笛卡尔积为10000 * 9999 = 99990000，显然这样的查询遍历的次数太多了，导致查询极其缓慢，这样的慢sql需要优化一下。

思路优化

查询日期区间是否重复，假设有n条任务数据，则需要n*(n-1)次比较，如果能把对比次数降低，查询速度应该能得到提升。因此我们的目标是降低日期对比的次数。

我们可以根据operatorId和开始时间对数据进行排序，下图是以一个用户为例。

1.jpg

排序后的数据，如果下一行数据的开始时间在上一行数据的时间区间内，则这两条记录时间区间有重叠。这样就只需要和上一条记录做一次时间对比，而不是n-1次，大大节省了对比次数。

按上述规则，我们可以得到task2和task3两个时间重叠的任务，由于task4和task3没有时间重叠，无法返回，但实际上task4和task2是有时间重叠的，这种情况怎么处理呢？

我们注意到task3的时间区间是完全被包含在task2的时间区间内的，如果是这种情况，就需要和时间范围较大的那条记录做对比了，因此task4的时间区间应该和task2的时间区间对比，这样就可以返回task2，task3和task4三条数据了。

有了这样的思路后，接下来就是用sql实现了。

mysql用户定义变量

使用sql实现上述查询的难点是如何记录上一条任务的开始时间和结束时间。此时就需要使用mysql的用户定义变量了。

用户变量是一个生命周期为session的变量，声明后在当次会话中的所有sql都可以获取到该变量的值。我们可以用set或者select来初始化:

set @start = 1, @finish = 10;  
select @start = 1, @finish = 10;  
select * from places where place between @start and @finish;

根据这个特性我们可以声明一个用户定义变量，来存储查询结果每条记录的前一行记录的开始时间和结束时间，用来判断时间区间是否有重叠。

select 
temp.id, 
temp.operator_id,
temp.start_date,
temp.end_date
from (
    select 
    t1.id,
    t1.operator_id,
    t1.start_date,
    t1.end_date,
    if(
        @lastOperatorId = t1.operator_id,
        t1.start_date >= @lastStartDate and t1.start_date < @lastEndDate,
        0
    ) as overlap,
    @lastOperatorId := t1.operator_id as last_operator_id,
    case
        when t1.end_date < @lastEndDate
        then @lastStartDate := @lastStartDate
        else @lastStartDate := t1.start_date
    end as last_start,
    case
        when t1.end_date < @lastEndDate
        then @lastEndDate := @lastEndDate
        else @lastEndDate := t1.end_date
    end as last_end
    from task t1,
    (
        select @lastOperatorId := 0, 
        @lastStarDate := '1000-01-01 00:00:00',
        @lastEndDate := '1000-01-01 00:00:00'
    )  sqlVars
    order by t1.operator_id asc, t1.start_date asc, t1.id asc
) temp
where temp.overlap = 1

首先我们先声明一个sqlVars的临时表以及@lastOperatorId，@lastStarDate和@lastEndDate三个用户定义变量，分别用于记录上一条task的操作人，开始时间以及结束时间。

@lastOperatorId相当于对操作人进行分组，只对比同一个人的开始时间和结束时间。如果当前记录的开始时间在上一条记录的时间区间内，则overlap标记为1，表示重叠。

接下来是将当前行的值重新赋值给变量，如果当前任务的时间区间在上一条记录的时间区间内，则不赋值。

经过测试后，发现每个人的第一条数据没有返回。原因是第一条数据是和后一条数据重叠的，与前一条数据不重叠，那该如何解决丢数据的问题呢？解决办法也很简单，倒序再查一遍即可。

select 
distinct
temp.id, 
temp.operator_id,
temp.start_date,
temp.end_date
from (
    select temp1.* from
    (
        select 
        t1.id,
        t1.operator_id,
        t1.start_date,
        t1.end_date,
        if(
            @lastOperatorId = t1.operator_id,
            t1.start_date >= @lastStartDate and t1.start_date < @lastEndDate,
            0
        ) as overlap,
        @lastOperatorId := t1.operator_id as last_operator_id,
        case
            when t1.end_date < @lastEndDate
            then @lastStartDate := @lastStartDate
            else @lastStartDate := t1.start_date
        end as last_start,
        case
            when t1.end_date < @lastEndDate
            then @lastEndDate := @lastEndDate
            else @lastEndDate := t1.end_date
        end as last_end
        from task t1,
        (
            select @lastOperatorId := 0, 
            @lastStarDate := '1000-01-01 00:00:00',
            @lastEndDate := '1000-01-01 00:00:00'
        )  sqlVars
        order by t1.operator_id asc, t1.start_date asc, t1.id asc
    ) temp1
    union
    select temp2.* from
    (
        select 
        t1.id,
        t1.operator_id,
        t1.start_date,
        t1.end_date,
        if(
            @descLastOperatorId = t1.operator_id,
            @descLastStartDate >= t1.start_date and @descLastStartDate < t1.end_date,
            0
        ) as overlap,
        @descLastOperatorId := t1.operator_id as last_operator_id,
        @descLastStartDate := t1.start_date as last_start,
        @descLastEndDate := t1.end_date as last_end
        from task t1,
        (
            select @descLastOperatorId := 0, 
            @descLastStarDate := '1000-01-01 00:00:00',
            @descLastEndDate := '1000-01-01 00:00:00'
        )  sqlVars
        order by t1.operator_id asc, t1.start_date desc, t1.id desc
    ) temp2
) temp
where temp.overlap = 1

这里需要注意的是，倒序查询的sql中要声明新的用户定义变量，以防止和上一个sql的变量重复，导致查询结果不正确。

到目前为止，我们就解决了使用sql查询表中重叠时间区间的问题。本文提供一种sql解决的方案，如有更好的解决方案欢迎交流沟通。

使用mysql用户变量查询日期范围重叠数据

思路

代码实现

sql实现

思路优化

mysql用户定义变量

你可能感兴趣的:(使用mysql用户变量查询日期范围重叠数据)