背景:有一个定时任务,在每天的凌晨执行,任务是备份一张表的数据到另一张表,数据量大约200W。操作很简单,从source表里select数据,然后insert到target表里。由于数据量较大,所以从一开始就注定了要走上优化的不归路。
Step 1
如果使用单线程,一边读一边存,程序要跑多久,不得而知,估计只有傻X才会这么写。所以第一条思路为分批次取数据,然后丢到多线程里去存。分批次取数据类似数据分页,每取一页数据就丢到一个线程里。单线程分批次取数据,多线程存数据。取数据的SQL语句如下:
SELECT ID,AccountID,AccountType,AccountSubject,Credit,Debit,STATUS,IsBuffer,AccountDirection,Balance,FrozenBalance,ADDTIME,UpdateTime FROM Account WHERE AccountType = 2 ORDER BY ID LIMIT #offSet#, #batchSize#;
存数据即一个insert语句,此处不列出。
乍一看,上面的select语句没啥问题,网上都是这么写的。可实际执行的过程中发现,当offSet值为1800000的时候,这条select语句的执行时间约为1600ms,性能及其低下。(性能低下的原因这里不赘述。。。)下面进行SQL语句的第一次优化。
Step 2
假设查询到的第1799999条记录的id为1825000,那么采用下面的SQL语句查询
SELECT ID,AccountID,AccountType,AccountSubject,Credit,Debit,STATUS,IsBuffer,AccountDirection,Balance,FrozenBalance,ADDTIME,UpdateTime FROM TS_Account WHERE AccountType = 2 ORDER BY ID LIMIT 1825000, 200;
此时这条语句的执行时间为5ms左右,性能提升相当明显。那么接下来如何写读取数据的代码。
<select id="findAllMerchantAccount" parameterClass="map" resultMap="accountData"> SELECT ID, AccountID, AccountType, AccountSubject, Credit, Debit, Status, IsBuffer, AccountDirection, Balance, FrozenBalance, AddTime, UpdateTime FROM TS_Account WHERE AccountType = 2 AND id > #lastBiggestId# ORDER BY ID LIMIT #size#; </select>
public void execute() { //获取线程数,默认30 int threadNum = 30; String strThreadNum = LionConfigUtils.getProperty("ts-monitor-job.dailyJob.accountBalanceDailyCheckerThreadNum", ""); if (isNumeric(strThreadNum)) { threadNum = Integer.parseInt(strThreadNum); } monitorLogger.info(String.format("BATCH_SIZE:%s, Thread number:%s", BATCH_SIZE, threadNum)); ExecutorService service = Executors.newFixedThreadPool(threadNum); //所有账号总数 int accountCount = accountDao.findAllMerchantAccountCount(); int latchCount = accountCount % BATCH_SIZE == 0 ? accountCount / BATCH_SIZE : (accountCount / BATCH_SIZE + 1); CountDownLatch latch = new CountDownLatch(latchCount); Date bizDate = DateUtils.addDate(DateUtils.removeTime(new Date()), -1); Date lastBizDate = DateUtils.addDate(bizDate, -1); //起始偏移 int offset = 0; //上一页中最大的ID int lastBiggestId = 0; List<AccountData> accountDataList = null; while (offset < accountCount) { accountDataList = accountDao.findAllMerchantAccount(offset, BATCH_SIZE, lastBiggestId); if(accountDataList == null){ break; } //取出当前批次中最大的id供下次取数据使用,SQL分页查询优化 lastBiggestId = accountDataList.get(accountDataList.size() - 1).getId(); //偏移后移,取下一页 offset += BATCH_SIZE; service.submit(new DoAccountMonitorThread(accountDataList, bizDate, lastBizDate,monitorAccountBalanceDao, accountEntryDao, latch)); } try { latch.await(); } catch (InterruptedException e) { monitorLogger.error("CountDownLatch.await() error:" + e); } }
写数据放在多线程中执行。DoAccountMonitorThread的run方法如下:
public void run() { List<MonitorAccountBalanceData> monitorAccountBalanceDataList = new ArrayList<MonitorAccountBalanceData>(); int i = 0; for (AccountData accountData : accountDataList) { i++; monitorAccountBalanceDataList.add(buildMonitorAccountBalanceData(bizDate, lastBizDate, accountData)); if(i % 200 == 0){ //保证每次最多一次性插入200条数据 try { //缓解数据库压力 Thread.sleep(20); }catch (Exception e){ statisticsLogger.error("Thread sleep error:", e); } //批量将数据添加到备份表 monitorAccountBalanceDao.batchInsertDailyAccountBalance(monitorAccountBalanceDataList); //批量插入后清空monitorAccountBalanceDataList,供下一个200条使用 monitorAccountBalanceDataList.clear(); } } //如果list的accountDataList不是200的倍数,如450,则最后的50条需要单独处理 if(monitorAccountBalanceDataList.size() > 0){ monitorAccountBalanceDao.batchInsertDailyAccountBalance(monitorAccountBalanceDataList); } latch.countDown(); }
插入数据时,如果每条记录执行一次insert,必然造成性能低下,所以此处在一条insert语句中插入多条记录。
<insert id="batchInsertDailyAccountBalance" parameterClass="java.util.Map"> <![CDATA[ INSERT INTO TS_DailyAccountBalance (BizDate, AccountID, AccountSubject, CreditAmount, DebitAmount, AccountAmount, YesterdayBalance, CurrentBalance, VarianceAmount, Memo, Status, AddTime, UpdateTime) VALUES ]]> <iterate property="monitorAccountBalanceDataList" conjunction=","> <![CDATA[( #monitorAccountBalanceDataList[].bizDate#, #monitorAccountBalanceDataList[].accountId#, #monitorAccountBalanceDataList[].accountSubject#, #monitorAccountBalanceDataList[].creditAmount#, #monitorAccountBalanceDataList[].debitAmount#, #monitorAccountBalanceDataList[].accountAmount#, #monitorAccountBalanceDataList[].yesterdayBalance#, #monitorAccountBalanceDataList[].currentBalance#, #monitorAccountBalanceDataList[].varianceAmount#, #monitorAccountBalanceDataList[].memo#, #monitorAccountBalanceDataList[].status#, NOW(), NOW() ) ]]> </iterate> </insert>
生产的SQL语句类似
INSERT INTO table (field1,field2,field3) VALUES ('a',"b","c"), ('a',"b","c"),('a',"b","c");
至此本次调优完毕,待上线测试后验证效果。优化之前一次执行需三四个小时。
总结:
1.大数据量时的分页查询优化
2.分批次取数据,使用多线程插入数据
3.一次insert插入多条记录
PS:预发环境已测试,10个线程写数据,150W条数据,时间大约12分钟。