java导入大数据文件

之前在项目中用到了大数据文件的导入,再次总结一下心路里程

数据文件有两种可以选xls,txt.(200M+)

由于之前有利用jxl和POI的经验,所以首先就选择了xls文件. 但是在实施是总是报java堆栈不够用.在几次增加了堆栈之后还是无果.

这是由于JXL在处理时,一次把整个文件全部读入并解析的原因.因此只能另寻他路,选择了利用java最基本的IO流的操作,然后自己解析.一行一行的解析,然后插入.

FileInputStream fis = null;
		InputStreamReader isr = null;
		BufferedReader br = null;
		Connection conn = null;
		PreparedStatement stmt = null;
		try {
			Class.forName(jdbc_driver);
			conn = DriverManager.getConnection(jdbc_url, jdbc_user, jdbc_pwd);
			String sql = "insert into pmc values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)";
			stmt = conn.prepareStatement(sql);
			
			String str = "";
			
			fis = new FileInputStream(filePath);// FileInputStream

			isr = new InputStreamReader(fis);

			br = new BufferedReader(isr);
                        
                        while ((str = br.readLine()) != null) {
				String[] rowData = tr.split("\\|");
				
				if(rowData.length>=20){
					
				for(int i = 0; i < 20; i++) {
			         stmt.setString(i+1,rowData[i]); 
					}
					stmt.execute();
				}
                      }

只是堆栈问题解决,但是发现速度太慢,采用了addBatch的方法1000条记录批量插入一次,最终代码如此:

private static int batchsize = 1000;
public void importFormTxt(String filePath) {
		FileInputStream fis = null;
		InputStreamReader isr = null;
		BufferedReader br = null;
		Connection conn = null;
		PreparedStatement stmt = null;
		try {
			Class.forName(jdbc_driver);
			conn = DriverManager.getConnection(jdbc_url, jdbc_user, jdbc_pwd);
			String sql = "insert into pmc values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)";
			stmt = conn.prepareStatement(sql);
			
			String str = "";
			
			fis = new FileInputStream(filePath);// FileInputStream

			isr = new InputStreamReader(fis);

			br = new BufferedReader(isr);
			
			int rowNum = 0;
			int batchNo = 1;
			long tmpT1 = System.currentTimeMillis();
			System.out.println("import PMC start at:"+(new SimpleDateFormat("yyyy.MM.dd HH:mm:ss")).format(tmpT1));
			while ((str = br.readLine()) != null) {
				String[] rowData = str.split("\\|");
				
				if(rowData.length>=20){
					rowNum++;
					for(int i = 0; i < 20; i++){
						stmt.setString(i+1, rowData[i]);
					}
					stmt.addBatch();
				}
				
				if(rowNum == batchNo * batchsize){
					++batchNo;
					stmt.executeBatch();
					System.out.println("insert into "+rowNum+" success!");
					stmt.clearBatch();
				}
			}
			if ((batchNo - 1) * batchsize < rowNum) {
				stmt.executeBatch();
				System.out.println("insert into "+rowNum+" success!");
				stmt.clearBatch();
			}
			long tmpT2 = System.currentTimeMillis();
			System.out.println("import PMC end at:"+(new SimpleDateFormat("yyyy.MM.dd HH:mm:ss")).format(tmpT2));
			System.out.println("use time:"+(tmpT2-tmpT1)/1000+"s");
			
		} catch (FileNotFoundException e) {
			System.out.println("no file found");
		} catch (IOException e) {
			System.out.println("read file failure");
		} catch (ClassNotFoundException e) {
			e.printStackTrace();
		} catch (SQLException e) {
			e.printStackTrace();
		} finally {
			try {
				br.close();
				isr.close();
				fis.close();
				stmt.close();
				conn.close();
			} catch (IOException e) {
				e.printStackTrace();
			} catch (SQLException e) {
				e.printStackTrace();
			}
		}
	}
忽然间发现,java最基本的就可以解决最实际的问题.有时候第三方的jar包反而把问题搞复杂了.


你可能感兴趣的:(java导入大数据文件)