1 问题
各大网站录入电影院,地址没有统一的规范,造成电影票无法比价。
2 解决思路
2.1 经纬度范围查找
拿到数据中包含经度维度信息,根据经纬度范围查找锁定这些名字不同的电影院为同一家电影院。
2.1.1 各大网站使用的地图坐标协议不同
(google、高德、腾讯、图吧地图、图吧导航)使用的是gcj02,百度、搜狗使用的是另外一种坐标协议bd09。所以网上找个java写的统一转换各大地图协议至百度地图的代码,然后改写为mysql的自定义函数,转换后误差在万分之五(距离大概是5-5.5米)
一、经纬度距离换算
a)在纬度相等的情况下:
经度每隔0.00001度,距离相差约1米;
每隔0.0001度,距离相差约10米;
每隔0.001度,距离相差约100米;
每隔0.01度,距离相差约1000米;
每隔0.1度,距离相差约10000米。
b)在经度相等的情况下:
纬度每隔0.00001度,距离相差约1.1米;
每隔0.0001度,距离相差约11米;
每隔0.001度,距离相差约111米;
每隔0.01度,距离相差约1113米;
每隔0.1度,距离相差约11132米。
高德 convert to 百度经纬度函数
(网上java有现成代码,这是根据java改写mysql代码)。
各个地图经纬度转换
转换维度
DELIMITER |
CREATE FUNCTION convert_gcj02_to_bd09_lat(longitude DOUBLE(9,6),latitude DOUBLE(9,6))
RETURNS DOUBLE(9,6)
BEGIN
DECLARE x_pi DOUBLE(9,8);
DECLARE x DOUBLE(9,6);
DECLARE y DOUBLE(9,6);
DECLARE z DOUBLE(9,6);
DECLARE theta DOUBLE(10,9);
SET x_pi = 3.14159265358979324 * 3000.0 / 180.0;
SET x=longitude;
SET y=latitude;
SET z=sqrt(x*x+y*y)+ 0.00002 * sin(y*x_pi);
SET theta=atan2(y,x)+ 0.000003 * cos(x*x_pi);
SET longitude=z*cos(theta)+0.0065;
SET latitude=z*sin(theta)+0.006;
RETURN latitude;
END |
DELIMITER ;
测试
SELECT convert_gcj02_to_bd09_lat(120.098703,29.324483);
转换经度
DELIMITER |
CREATE FUNCTION convert_gcj02_to_bd09_lng(longitude DOUBLE(9,6),latitude DOUBLE(9,6))
RETURNS DOUBLE(9,6)
BEGIN
DECLARE x_pi DOUBLE(9,8);
DECLARE x DOUBLE(9,6);
DECLARE y DOUBLE(9,6);
DECLARE z DOUBLE(9,6);
DECLARE theta DOUBLE(10,9);
SET x_pi = 3.14159265358979324 * 3000.0 / 180.0;
SET x=longitude;
SET y=latitude;
SET z=sqrt(x * x + y * y) + 0.00002 * sin(y * x_pi);
SET theta = atan2(y, x) + 0.000003 * cos(x * x_pi);
SET longitude = z * cos(theta) + 0.0065;
RETURN longitude;
END |
DELIMITER ;
测试
SELECT convert_gcj02_to_bd09_lng(120.098703,29.324483);
根据经纬度计算距离函数
DELIMITER |
CREATE FUNCTION `juli`(lat1 DOUBLE(10,7),lat2 DOUBLE(10,7),lng1 DOUBLE(10,7),lng2 DOUBLE(10,7)) RETURNS double
BEGIN
SET @distance=round(6378.138*2*asin(sqrt(pow(sin( (lat1*pi()/180-lat2*pi()/180)/2),2)+cos(lat1*pi()/180)*cos(lat2*pi()/180)* pow(sin( (lng1*pi()/180-lng2*pi()/180)/2),2)))*1000);
RETURN @distance;
END |
DELIMITER ;
弃用经纬度算法
很多影院的经纬度信息为null,而且有些经纬度信息不太准确,所以后面弃用了根据经纬度去定位是否为同一家影院。
根据电影院名字,电话,地址的相识度匹配。
公式如下
count(相识单词之间A和B)/count(A)+count(B)-count(交集))
代码如下:
电影院名称相识度匹配
对比两个字符串
DELIMITER ;;
CREATE FUNCTION `levenshtein`( s1 TEXT, s2 TEXT) RETURNS INT(11)
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 TEXT;
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j)));
SET j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1);
SET c = i;
SET cv0 = UNHEX(HEX(i));
SET j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0;
ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c)));
SET j = j + 1;
END WHILE;
SET cv1 = cv0;
SET i = i + 1;
END WHILE;
END IF;
RETURN c;
END ;;
DELIMITER ;
两个字符串相识度占比
DELIMITER ;;
CREATE FUNCTION `levenshtein_ratio`( s1 TEXT, s2 TEXT ) RETURNS INT(11)
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN
SET max_len = s1_len;
ELSE
SET max_len = s2_len;
END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END |
DELIMITER ;;
通过几次测试,相识度大于等于90的大致为同一影院。个别电影院名字极度相仿的,可以对相识度值做一些调整。
SELECT *,levenshtein_ratio('龙海金逸影城(美一店)',cinema_name) xiangshi FROM `bidding_cinema_data`
WHERE levenshtein_ratio('龙海金逸影城(美一店)',cinema_name)>=90;
重回经纬度
字符串匹配的精确度很难达到80以上(因为有的电影院名字很短,只有两个字或4个字)
所以这些电影院相识度匹配的时候,很难区分…
问题
采集到的数据,有的经纬度信息为null
所以根据百度地图接口传入地址来补全经纬度信息.
根据经纬度范围打标签
SQL脚本如下:
打标签第一版本
DELIMITER ;;
CREATE PROCEDURE `set_lable`(lng DOUBLE,lat DOUBLE,rounds DOUBLE,city_meta_id int,lables int)
-- lng:维度
-- lat:经度
-- rounds:前后范围
-- city_meta_id:城市编号
-- labels:标签
BEGIN
set @lng=lng;
set @lat=lat;
set @rounds=rounds;
set @lable=lables;
set @city_meta_id=city_meta_id;
update clean_cinema_data_copy as a
inner join bidding_city_data as b
on a.city_id=b.city_id
and a.site_id=b.site_id
SET lable=@lable,
brand=replace(replace(replace(replace(replace(replace(replace(replace(cinema_name,'电影院',' '),'电影城',' '),'影视城',' '),'国际',''),'影院',' '),'影城',' '),'影视',' '),city_name,' ')
where longitude<>0.0
and city_meta_id=@city_meta_id
and latitude>= @lat-@rounds and latitude<@lat+@rounds
and longitude>= @lng-@rounds and longitude<@lng+@rounds
and lable is NULL;
END;;
DELIMITER ;
调用上面过程的脚本
批量打标签第一版本
set @rownum=0;
select
concat('call set_lable(',longitude,',',latitude,',',0.006,',',3120,',',@rownum:=@rownum+1,');')
from clean_cinema_data_copy
where city_meta_id=3120
and longitude<>0.0;
-- 把此语句执行的结果复制到连接数据库的IDE里执行
根据经纬度范围打标签结果
最终的准确度在65%-75% 之间, 距离最终90%还有一定距离.
所以后面会加上一些brand的词库. 根据经纬度范围打过标签之后再根据brand这个维度再打一次.
上海市
千分之六:大于等于4家的是128个, 等于4家的是91个;
千分之三:大于等于4家的是139个, 等于4家的是113个;
千分之二:大于等于4家的是142个, 等于4家的是125个; SELECT 125*1.0/165 准确度:0.75758;
北京市
千分之六:大于等于4家的是110个, 等于4家的是83个; select 83*1.0/136 0.61029;
千分之三:大于等于4家的是107个, 等于4家的是92个; select 92*1.0/136 准确度: 0.67647;
千分之二:大于等于4家的是99个, 等于4家的是89个; SELECT 89*1.0/136 0.65441;
广州市
千分之六:大于等于4家的是78个, 等于4家的是58个; select 58*1.0/105 0.55238;
千分之三:大于等于4家的是79个, 等于4家的是68个; select 68*1.0/105 准确度:0.64762;
千分之二:大于等于4家的是76个, 等于4家的是66个; SELECT 66*1.0/105 0.62857;
根据经纬度范围和词库brand两个维度打标签的准确率
思路
根据两个经纬度打标签,打完标签, 本来5个的6个的7个的可能会分出来1,2,3条,
再加上一个维度打标签,完全为4个电影院的准确率为
上海市两个维度打标签 0.7879 上海由原来的0.75758 提升为0.75758
北京市两个维度打标签 0.7353 北京由原来的0.67647提升为0.7353
广州市两个维度打标签 0.6762 广州由原来的0.64762提升为0.6762
准确率还是不太高……
加维度
逻辑如下:
首先根据经纬度的范围打一次标签(把范围在200 米内的并且有4家[4个网站] 算为一个电影院) {结果集1}
再把范围200米内不为4家(有的比较密集,8家,12家)加上简单的品牌分词, 按 机器标签, 品牌分组 等于4个的集合 {结果集2}
再把城市所有数据 跟上面两个结果集的数据的交集求并集{结果集3}
贪心算法 应用到结果集合3, 从500米开始步长循环处理每次递减50米…把完全等于4的集合放入临时表…(这里会产生9个临时表)
创建过程把 集合1 U distinct {临时表1 U 临时表2 U 临时表 U 临时表3 U 临时表4 U 临时表5 U 临时表6 U 临时表7 U 临时表8 U 临时表9}
取出来就是该城市最终数据, 上面公式中文解释 9个临时表取并集 去除重复后 跟集合1 取交集…
打标签代码如下
打标签第二版本
DELIMITER ;;
CREATE PROCEDURE `set_lable1`(tablename VARCHAR(48),lng DOUBLE,lat DOUBLE,rounds DOUBLE,city_meta_id INT)
BEGIN
DECLARE a INT DEFAULT 1;
SET @tablename=tablename;
SET @lng=lng;
SET @lat=lat;
SET @rounds=rounds;
SET @city_meta_id=city_meta_id;
SET @v_sql=CONCAT('SELECT ifnull(max(lable)+1,1) INTO @nums FROM ',@tablename);
PREPARE stmt FROM @v_sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
SET @v_sql=CONCAT('UPDATE ',@tablename,' AS a
INNER JOIN bidding_city_data AS b
ON a.city_id=b.city_id
AND a.site_id=b.site_id
SET lable=',@nums,',',
' brand=LEFT(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(cinema_name,"电影院",""),"电影城",""),"影视城",""),"国际",""),"影院",""),"影城",""),"影视",""),city_name,""),"(",""),"(",""),")",""),")",""),"影剧院",""),"-","")," ",""),3)
WHERE longitude<>0.0
AND city_meta_id=',@city_meta_id,'
AND latitude>=', @lat-@rounds, ' AND latitude< ',@lat+@rounds,
' AND longitude>=', @lng-@rounds,' AND longitude< ',@lng+@rounds,
' AND lable IS NULL;');
PREPARE stmt FROM @v_sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END;;
DELIMITER ;
批量打标签脚本
批量打标签第二版本
delimiter |
CREATE PROCEDURE batch_set_lable1(city INT,rounds DOUBLE,groups INT)
BEGIN
DECLARE done INT DEFAULT -1;
DECLARE lng DOUBLE;
DECLARE lat DOUBLE;
DECLARE cur CURSOR FOR SELECT longitude,latitude FROM clean_cinema_data_copy WHERE city_meta_id=city AND longitude<>0.0;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
OPEN cur;
read_loop:LOOP
FETCH cur INTO lng,lat;
IF done=1 THEN
LEAVE read_loop;
END IF;
CALL set_lable1('clean_cinema_data_copy',lng,lat,rounds,city);
END LOOP;
CLOSE cur;
-- 根据经纬度范围0.002打标签和
-- 每组不等于4个的数据再根据品牌分组等于4的集合,
-- 此集合与上海市的全部数据的差集,
-- 是我们后续需要缩小经纬度分析的集合
DROP TABLE IF EXISTS step_5_0;
CREATE TABLE step_5_0 AS
SELECT * FROM
(
SELECT a.* FROM clean_cinema_data_copy AS a
INNER JOIN
(
SELECT * FROM clean_cinema_data_copy
WHERE city_meta_id=city
GROUP BY `lable`
HAVING count(1)=groups
) AS b
ON a.`lable`=b.`lable`
AND a.city_meta_id=city
UNION
SELECT c.* FROM clean_cinema_data_copy AS c INNER JOIN
(
SELECT * FROM
(
SELECT a.* FROM clean_cinema_data_copy a
INNER JOIN
(
SELECT
lable,
id,
longitude,
latitude,
cinema_name,
cinema_meta_id,
brand,
COUNT(DISTINCT cinema_meta_id) cinemas,
COUNT(1) counts
FROM clean_cinema_data_copy
WHERE city_meta_id=city
AND longitude<>0
GROUP BY lable
HAVING count(1)<>groups
)b
ON a.lable=b.lable
AND a.brand=b.`brand`
GROUP BY lable,brand
HAVING count(1)=groups
)tb
GROUP BY lable,brand
)d
ON c.lable=d.`lable`
AND c.brand=d.brand
AND c.city_meta_id
)h;
处理500米内的数据
SET @rownum=0;
DROP TABLE IF EXISTS tmp_5_0;
CREATE TABLE tmp_5_0
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_5_0 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_5_0;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_5_0 WHERE orders=@loopstart;
CALL set_lable1('tmp_5_0',lng,lat,0.005,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理450米内的数据
DROP TABLE IF EXISTS tmp_4_5;
CREATE TABLE tmp_4_5
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_4_5 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_4_5;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_4_5 WHERE orders=@loopstart;
CALL set_lable1('tmp_4_5',lng,lat,0.0045,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理400米内的数据
DROP TABLE IF EXISTS tmp_4_0;
CREATE TABLE tmp_4_0
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_4_0 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_4_0;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_4_0 WHERE orders=@loopstart;
CALL set_lable1('tmp_4_0',lng,lat,0.004,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理350米内的数据
DROP TABLE IF EXISTS tmp_3_5;
CREATE TABLE tmp_3_5
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_3_5 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_3_5;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_3_5 WHERE orders=@loopstart;
CALL set_lable1('tmp_3_5',lng,lat,0.0035,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理300米内的数据
DROP TABLE IF EXISTS tmp_3_0;
CREATE TABLE tmp_3_0
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_3_0 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_3_0;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_3_0 WHERE orders=@loopstart;
CALL set_lable1('tmp_3_0',lng,lat,0.003,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理250米内的数据
DROP TABLE IF EXISTS tmp_2_5;
CREATE TABLE tmp_2_5
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_2_5 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_2_5;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_2_5 WHERE orders=@loopstart;
CALL set_lable1('tmp_2_5',lng,lat,0.005,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理200米内的数据
DROP TABLE IF EXISTS tmp_2_0;
CREATE TABLE tmp_2_0
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_2_0 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_2_0;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_2_0 WHERE orders=@loopstart;
CALL set_lable1('tmp_2_0',lng,lat,0.002,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理150米内的数据
DROP TABLE IF EXISTS tmp_1_5;
CREATE TABLE tmp_1_5
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_1_5 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_1_5;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_1_5 WHERE orders=@loopstart;
CALL set_lable1('tmp_1_5',lng,lat,0.0015,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理100米内的数据
DROP TABLE IF EXISTS tmp_1_0;
CREATE TABLE tmp_1_0
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_1_0 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_1_0;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_1_0 WHERE orders=@loopstart;
CALL set_lable1('tmp_1_0',lng,lat,0.001,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
处理50米内的数据
DROP TABLE IF EXISTS tmp_0_5;
CREATE TABLE tmp_0_5
SELECT *,@rownum:=@rownum+1 orders FROM clean_cinema_data_copy
WHERE `city_meta_id`=city
AND (longitude<>0.0 OR latitude<>0.0)
AND id NOT IN
(
SELECT id FROM step_5_0
);
UPDATE tmp_0_5 SET lable=NULL;
SET @loopstart=1;
SELECT @loopend:=max(orders) FROM tmp_0_5;
WHILE @loopstart<=@loopend DO
BEGIN
SELECT longitude,latitude INTO lng,lat FROM tmp_0_5 WHERE orders=@loopstart;
CALL set_lable1('tmp_0_5',lng,lat,0.001,city);
SET @loopstart=@loopstart+1;
END;
END WHILE;
END |
DELIMITER ;
获取最终结果的过程
获取结果的过程
delimiter |
CREATE PROCEDURE cinema_result(groups INT)
begin
SELECT id,cinema_id,agent_id,cinema_name,area,addr,`area_name`,
tele,longitude,latitude,cinema_brand,url,score,service,city_id,site_id,
STATUS,`cinema_meta_id`,`unique_name`,concat('step_',lable) lable FROM step_5_0
UNION
SELECT * FROM
(
SELECT DISTINCT id,cinema_id,agent_id,cinema_name,area,addr,`area_name`,
tele,longitude,latitude,cinema_brand,url,score,service,city_id,site_id,
STATUS,`cinema_meta_id`,`unique_name`,lable FROM
(
SELECT * FROM tmp_5_0 WHERE lable IN
(
SELECT lable FROM tmp_5_0
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_4_5 WHERE lable IN
(
SELECT lable FROM tmp_4_5
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_4_0 WHERE lable IN
(
SELECT lable FROM tmp_4_0
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_3_5 WHERE lable IN
(
SELECT lable FROM tmp_3_5
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_3_0
WHERE lable IN
(
SELECT lable FROM tmp_3_0
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_2_5
WHERE lable IN
(
SELECT lable FROM tmp_2_5
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_2_0
WHERE lable IN
(
SELECT lable FROM tmp_2_0
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_1_5
WHERE lable IN
(
SELECT lable FROM tmp_1_5
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_1_0
WHERE lable IN
(
SELECT lable FROM tmp_1_0
GROUP BY lable
HAVING count(1)=groups
)
UNION
SELECT * FROM tmp_0_5
WHERE lable IN
(
SELECT lable FROM tmp_0_5
GROUP BY lable
HAVING count(1)=groups
)
)tb
)tb1
GROUP BY id,cinema_id,agent_id,cinema_name,area,addr,`area_name`,
tele,longitude,latitude,cinema_brand,url,score,service,city_id,site_id,
STATUS,`cinema_meta_id`,`unique_name`
ORDER BY lable;
END |
delimiter ;
过程调用方法
两个过程的使用方法
/
第一个参数是城市编号,
第二个参数是第一次打标签使用的范围值(此处0.003或0.002能筛选出的数据最多).
第三个参数:4 跟爬去的站点数对应 /CALL batch_set_lable1(3120,0.002,4);
/
参数的含义是分几个站,
跟爬去的站点数量对应.
从表中拿出最终结果. /CALL cinema_result(4);
用贪心算法得出北上广三个城市的正确率:
上海:86.67%
北京:85.29%
广州:78.09%
剩余的一些数据:需要人工比对…
根据贪心算法得出数据的准确率
(
此准确率是跟人工分组每组4个对比得出, 人工分组不等于4 小于4数据不完整,大于4此影院多拿一条数据
我暂时认为数据非法, 哪怕机器分组 3 个,5个的跟人工的完全一致 , 也视为非法
)