这是对一次数据库作业的深究
首先说一下作业题目要求:
建立一张包含四个字段的表,表名为test
第一列为id,主键,自增。
第二列为col1,随机为Mike,Bob,Jack,Alice,Cathy,Ann,Betty,Cindy,Mary,Jane中的一个
第三列为col2,随机为一个5位字母,字母限制在a-e
第三列为col3,随机为一个1-20之间的整数
按照步骤一中对表的要求插入100万条记录,记录执行的时间
对要插入的数据范围进行一定的预处理
(1)对于col1,创建取值范围数组
private static String[] col1Values={"Mike","Bob","Jack","Alice","Cathy","Ann","Betty","Cindy","Mary","Jane"};
随机获取的时候只要调用 col1Values[(int)(Math.random()*10)] 即可。
(2)对于col2,通过递归创建取值范围数组
private static String[] col2Values=new String[3125]; static{ point=0; initCol2Value(5,new StringBuffer("")); } private static void initCol2Value(int n,StringBuffer str){ if(n==0){ col2Values[point++]=new String(str); return; } for(int i=0;i<5;i++){ StringBuffer strTemp = new StringBuffer(str); initCol2Value(n-1,strTemp.append((char)('a'+i))); } }
随机获取的时候只要调用col2Values[(int)(Math.random()*3125)]即可。
(3)对于col3,随机获取的时候只要(int)(Math.random()*20)+1即可。
插入大数据量的数据
(1)首先想到的方法当然是传统的一行一行的插入方法:通过Connection获得Statement,再调用Statement对象的execute函数执行sql语句,插入一行,这样循环100万次即可,但是时间复杂度太高,估计没有个把小时是搞不定的。
(2)然后想到了对sql语句进行预处理,于是很大程度上提高了效率。下面是这部分代码的核心部分。
public static void insertData() { try { System.out.println("start insert data"); Long beginTime = System.currentTimeMillis(); conn.setAutoCommit(false); PreparedStatement pst = conn .prepareStatement("INSERT INTO test(col1,col2,col3)values(?,?,?)"); for (int i = 1; i <= 1000000; i++) { pst.setString(1, col1Values[(int) (Math.random() * 10)]); pst.setString(2, col2Values[(int) (Math.random() * 3125)]); pst.setInt(3, (int) (Math.random() * 20) + 1); pst.execute(); } conn.commit(); pst.close(); Long endTime = System.currentTimeMillis(); System.out.println("end insert data"); System.out.println("insert time: " + (double) (endTime - beginTime) / 1000 + " s"); System.out.println(); } catch (SQLException ce) { System.out.println(ce); } }
测试结果如下:
start insert data
end insert data
insert time: 110.215 s
(3)对于上面的结果还是不太满意,于是便开始了探索。
(a)从网上看到一个方法,使用在PreparedStatement 类上的addBatch(),executeBatch()方法,通过批量处理,可以一次性的将1000甚至10000个sql插入操作作为一个事务进行批量优化,并且作者在oracle的数据库上测试过时间是低于10s的。于是我也尝试了一下,发现依然是107s左右,于是便迷茫了。
(b)这个时候看到网上的另外一篇文章,解释了为什么MySql的JDBC驱动不支持批量操作,原来Mysql不支持addBatch(),executeBatch()等方法的批量优化,而Oracle则数据库支持,并且可以在360 ms左右的时间插入100万条记录
网址:http://elf8848.iteye.com/blog/770032
(c)后来看到葛班长的日志,他通过Python在SQLite中插入100万条数据只用了4秒,原因在于Python对所有的这100万条插入语句进行了优化,将所有的插入操作放到了同一个事务中,这样极大的减少了开启和取消事务的时间,而正是这部分操作会消耗大量的时间。
网址:http://aegiryy.net/?p=380
(d)于是我受到了启发,并且了解到对于Mysql数据库的操作时,一个sql插入语句中可以插入多行数据。于是我尝试通过StringBuffer构造一个比较大的sql语句,每个语句可以插入1万行的数据(如果是10万或者100万的话会超出堆内存限制),这样循环100次即可完成插入。下面是这种方法的核心代码:
public static void insertData() { try { System.out.println("start insert data"); Long beginTime = System.currentTimeMillis(); Statement st = conn.createStatement(); for (int i = 0; i < 100; i++) { StringBuffer sqlBuffer = new StringBuffer( "insert into test (col1,col2,col3) values"); sqlBuffer.append(" (/"" + col1Values[(int) (Math.random() * 10)] + "/",/"" + col2Values[(int) (Math.random() * 3125)] + "/"," + ((int) (Math.random() * 20) + 1) + ")"); for (int j = 2; j <= 10000; j++) { sqlBuffer.append(" ,(/"" + col1Values[(int) (Math.random() * 10)] + "/",/"" + col2Values[(int) (Math.random() * 3125)] + "/"," + ((int) (Math.random() * 20) + 1) + ")"); } sqlBuffer.append(";"); String sql = new String(sqlBuffer); st.execute(sql); } Long endTime = System.currentTimeMillis(); System.out.println("end insert data"); System.out.println("insert time: " + (double) (endTime - beginTime) / 1000 + " s"); System.out.println(); } catch (SQLException ce) { System.out.println(ce); } }
测试结果如下:
start insert data
end insert data
insert time: 15.083 s
(e)最后我想到了再将这种方法优化,采用预处理的方式,在代码易读性和效率上都有所提高,虽然效率提高的不多。下面是这个方法的核心代码:
public static void insertData() { try { conn.setAutoCommit(false); StringBuffer sqlBuffer = new StringBuffer( "insert into test (col1,col2,col3) values"); sqlBuffer.append("(?,?,?)"); for (int j = 2; j <= 10000; j++) { sqlBuffer.append(",(?,?,?)"); } sqlBuffer.append(";"); String sql = new String(sqlBuffer); PreparedStatement pst = conn.prepareStatement(sql); System.out.println("start insert data"); Long beginTime = System.currentTimeMillis(); for (int i = 0; i < 100; i++) { for (int j = 0; j < 10000; j++) { pst.setString(3 * j + 1, col1Values[(int) (Math.random() * 10)]); pst.setString(3 * j + 2, col2Values[(int) (Math.random() * 3125)]); pst.setInt(3 * j + 3, (int) (Math.random() * 20) + 1); } pst.execute(); } conn.commit(); pst.close(); Long endTime = System.currentTimeMillis(); System.out.println("end insert data"); System.out.println("insert time: " + (double) (endTime - beginTime) / 1000 + " s"); System.out.println(); } catch (SQLException ce) { System.out.println(ce); } }
测试结果如下:
start insert data
end insert data
insert time: 14.47 s
最后贴出最终个解决方案的所有代码:
package godfrey.nju; import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; public class TestDB2 { private static String dbClassName = "com.mysql.jdbc.Driver"; private static String dbUrl = "jdbc:mysql://localhost:3306/db_test"; private static String dbUser = "root"; private static String dbPwd = "123"; private static Connection conn = null; private static String[] col1Values = { "Mike", "Bob", "Jack", "Alice", "Cathy", "Ann", "Betty", "Cindy", "Mary", "Jane" }; private static String[] col2Values = new String[3125]; private static int point; public static void main(String args[]) { insertData(); // query1(); // clearData(); } public static void insertData() { try { conn.setAutoCommit(false); StringBuffer sqlBuffer = new StringBuffer( "insert into test (col1,col2,col3) values"); sqlBuffer.append("(?,?,?)"); for (int j = 2; j <= 10000; j++) { sqlBuffer.append(",(?,?,?)"); } sqlBuffer.append(";"); String sql = new String(sqlBuffer); PreparedStatement pst = conn.prepareStatement(sql); System.out.println("start insert data"); Long beginTime = System.currentTimeMillis(); for (int i = 0; i < 100; i++) { for (int j = 0; j < 10000; j++) { pst.setString(3 * j + 1, col1Values[(int) (Math.random() * 10)]); pst.setString(3 * j + 2, col2Values[(int) (Math.random() * 3125)]); pst.setInt(3 * j + 3, (int) (Math.random() * 20) + 1); } pst.execute(); } conn.commit(); pst.close(); Long endTime = System.currentTimeMillis(); System.out.println("end insert data"); System.out.println("insert time: " + (double) (endTime - beginTime) / 1000 + " s"); System.out.println(); } catch (SQLException ce) { System.out.println(ce); } } public static void query1() { try { System.out .println("start query1: 'select count(*) from test group by col1 order by count(*);'"); Long beginTime = System.currentTimeMillis(); Statement st = conn.createStatement(); String sql = "select count(*) from test group by col1 order by count(*);"; ResultSet rs = st.executeQuery(sql); Long endTime = System.currentTimeMillis(); System.out.println("result:"); while (rs.next()) { System.out.println(rs.getInt(1)); } System.out.println("query1 time: " + (double) (endTime - beginTime) / 1000 + " s"); st.close(); conn.close(); } catch (Exception e) { e.printStackTrace(); } } public static void clearData() { try { System.out.println("start delete all data"); Long beginTime = System.currentTimeMillis(); Statement st = conn.createStatement(); String sql = "delete from test"; st.execute(sql); st.close(); conn.close(); Long endTime = System.currentTimeMillis(); System.out.println("end delete all data"); System.out.println("delete time: " + (double) (endTime - beginTime) / 1000 + " s"); } catch (Exception e) { e.printStackTrace(); } } static { try { Class.forName(dbClassName).newInstance(); conn = DriverManager.getConnection(dbUrl, dbUser, dbPwd); } catch (Exception e) { e.printStackTrace(); } point = 0; initCol2Value(5, new StringBuffer("")); } private static void initCol2Value(int n, StringBuffer str) { if (n == 0) { col2Values[point++] = new String(str); return; } for (int i = 0; i < 5; i++) { StringBuffer strTemp = new StringBuffer(str); initCol2Value(n - 1, strTemp.append((char) ('a' + i))); } } }