bulk insert

 

http://dev.mysql.com/tech-resources/articles/xml-in-mysql5.1-6.0.html

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

 

MySQL的批插入 BULK INSERT和load data的速度差不多,并且可靠。

语法如下

假设有表test (ID NUMBER,NAME VARCHAR(10))

insert into test values(1,'aa'),(2,'bb'),.....(100000,'bb');

也就是每个记录间用逗号隔开,最后用分号结束,插入10万条记录只需不到10秒

 

Multiple Inserts with MySQL

Andy Jarrett posted this technique for insterting multiple rows (bulk insert) in one SQL statement a few months ago. It's a handy trick:

INSERT INTO x (a,b)
VALUES 
 ('1', 'one'),
 ('2', 'two'),
 ('3', 'three')

I tried that on a couple hundred thousand rows today, and I got an error that my statement was bigger than the Max Allowed Packet size in MySQL. So keep that in mind when using. You can either change the setting, or go about it a different way.

 

http://metabetageek.com/2010/02/04/learning-mysql-find-in-set-and-bulk-insert-options/

Bulk INSERTs FTW

A short while ago, I had to research some API for a company I’m consulting for. This API yields very good quality data, but isn’t comfortable enough to process it for further research.
The obvious solution was to dump this data into some kind of database, and process it there.
Our first attempt was pickle files. It worked nicely enough, but when the input data was 850 megs, it died horribly with a memory error.

(It should be mentioned that just starting to work with the API costs about a 1.2 gigs of RAM.)

Afterwards, we tried sqlite, with similar results. After clearing it of memory errors, the code (sqlite + sqlalchemy + our code) was still not stable, and apart from that, dumping the data took too much time.

We decided that we needed some *real* database engine, and we arranged to get some nice sql-server with plenty of RAM and CPUs. We used the same sqlalchemy code, and for smaller sized inputs (a few megs) it worked very well. However, for our real input the processing, had it not died in a fiery MemoryError (again!) would have taken more than two weeks to finish.

(As my defense regarding the MemoryError I’ll add that we added an id cache for records, to try and shorten the timings. We could have avoided this cache and the MemoryError, but the timings would have been worse. Not to mention that most of the memory was taken by the API…)

At this point, we asked for help from someone who knows *a little bit* more about databases than us, and he suggested bulk inserts.

The recipe is simple: dump all your information into a csv file (tabs and newlines as delimiters).
Then do BULK INSERT, and a short while later, you’ll have your information inside.
We implemented the changes, and some tens of millions of records later, we had a database full of interesting stuff.

My suggestion: add FTW as a possible extension for the bulk insert syntax. It won’t do anything, but it will certainly fit.

Tags: 

Related posts:

One Response to “Bulk INSERTs FTW”

  1. budowski Says: 

    In your case (really large CSV files), the BULK INSERT process probably takes up a lot of time.
    You could speed up the process by hundreds of percents – instead of using CSV files, you could use a binary dump, where each record and column is saved in a binary format.
    The BULK INSERT command is given a format file (bcp format file – FORMATFILE parameter), that defines how each record looks like.

    I know that in a project I worked on, it dramatically increased processing speed (we processed tens of GBs of information).

    More information on the bcp format files:
    http://msdn.microsoft.com/en-us/library/ms191516.aspx
    http://msdn.microsoft.com/en-us/library/ms191479.aspx

    (Note: BULK INSERT can be also be executed using the bcp command line utility)

 

你可能感兴趣的:(insert)