一篇讲述非常清楚的博文。特转放于此,供路过的朋友共享。翻译太麻烦了,不过如果有朋友需要,欢迎留言,共同翻译,共同讨论。
Vincent McBurney (IBM Information Champion)
Every now and then I come across a blog entry that reminds me there are people out there who know a lot more about my niche than I do! This is fortunate as this week it has helped me understand ELT tools.
ETL versus ELT and ETLT
The world of data integration has it's own Coke versus Pepsi challenge - it's called ETL versus ELT. Not as exciting as Aliens versus Predator, not as compelling as Ali versus Frasier and not as sexy as Erin Brockovich versus whatever company that was ... but an important battle if you are in charge of a multi million dollar IT budget.
ETL (Extract, Transform and Load) is the coca cola in the challenge with Informatica and DataStage the champions in terms of license fees and market share. It is made up of software that transforms and migrates data on most platforms with or without source and target databases. Business Objects, SAS, Microsoft SSIS, Ab Initio and Cognos are also in the ETL corner.
ELT (Extract, Load and Transform) is the challenger and is now largely driven by RDBMS vendor Oracle with Oracle Warehouse Builder and Sunopsis. It consists of software that transforms and migrates data in a database engine, often by generating SQL statements and procedures and moving data between tables.
ELT technology was constrained by what the database was capable and since many came from RDBMS vendors they tended to be suitable for just one database platform. Eg. Oracle Warehouse Builder and Microsoft DTS. They were also lacking functionality as the vendor was more concerned with building a database rather than an ELT tool. Sunopsis was an exception as an ELT tool not owned by an RDBMS vendor (until the Oracle acquired them ).
Informatica has recently moved into the the ETLT (Extract, Transform, Load and Transform) area with database pushdown optimization. This is standard ETL delivering to a target database and some extra sexy moves done moving it into more tables. Microsoft SSIS also has good ETLT capabilities within the SQL Server database.
Pros of each
I haven't had a lot of experience with ELT products but fortunately Dan Lindstedt from the B-Eye-Network blogs has been talking about this topic for years now and his recent entry ETL, ELT - Challenges and Metadata has a great comparison. Here are his pros of each list below, visit his blog for further discussion and the Cons of each tool:
Pros:
* ETL can balance the workload / share the workload with the RDBMS
* ETL can perform more complex operations in single data flow diagrams (data maps)
* ETL can scale with separate hardware.
* ETL can handle Partitioning and parallelism independent of the data model, database layout, and source data model architecture.
* ETL can process data in-stream, as it transfers from source to target
* ETL does not require co-location of data sets in order to do it's work.
* ETL captures huge amounts of metadata lineage today.
* ETL can run on SMP or MPP hardware
I would add to this data quality. The ETL tools have a head start over ELT in terms of data quality integration with Informatica and DataStage integrating closely. The row-by-row processing method of ETL works well with third party products such as data quality or business rule engines.
And the Pros of ELT.
Pros:
* ELT leverages RDBMS engine hardware for scalability
* ELT keeps all data in the RDBMS all the time
* ELT is parallelized according to the data set, and disk I/O is usually optimized at the engine level for faster throughput.
* ELT Scales as long as the hardware and RDBMS engine can continue to scale.
* ETL can achieve 3x to 4x the throughput rates on the appropriately tuned MPP RDBMS platform.
I'm not sure whether that final point refers to data that is being processed from one RDBMS table to another on the same server or whether it also applies to cross database migrations. If you are moving between database platforms ELT may need to transfer the data and save it before it commences transformation whereas ETL can transfer and transform in the one stream.
Another Pro of ELT is that once the database is on the target platform it no longer places stress on the network, all further transformation is done on the RDBMS server. These Pros of ELT is why ETL tools have an ELT capability, either the simple interface of DataStage in running user-defined SQL and procedures or the advanced capabilities of Informatica to generate SQL transformation code.
I agree with the fact that ELT approach is 10 times faster than the ETL approach but we have some cons also:
1. ELT relies heavily on the performance and tuning of the RDBMS instance. If the instance is slow, ELT has no where to go! It will run only as fast as the RDBMS server allows.
2. ELT with huge batches of data, can eat tremendous resources on an RDBMS server, if you're running extremely large data sets, you better have a super-duty RDBMS engine, and it better be water-cooled, twin engine, air-intake with overhead cam shaft. In other words, your DBA's have to be the cream of the crop, and really be wizzes at making your RDBMS hum.
3. Some ELT engines don't allow control over the "array batch size" within the RDBMS, this could easily blow log segments/redo's/temp spaces.
4. Some ETL vendors will tell you that their engine is an ELT engine, only if it generates optimized native RDBMS SQL code with advanced functionality.
5. ELT MUST stage the data in order to run delta's, if the vendor claims in-memory delta processing, then they are an ETL engine not an ELT engine (unless again they generate native RDBMS SQL code) - then they might be an ETL-T engine (new breed).
6. ELT software today usually doesn't have all the connectivity options that ETL has (but that will change soon).
7. ELT engines frequently stage to flat-file for bulk-loader processes, if your ELT engine loads through an OS PIPE, be-careful! The OS Pipe sizes can be limited, and become a bottleneck in the flow. In other words: loading through a pipe directly into an RDBMS bulk-load facility can be slower than staging to a flat file and blasting the bulk-load with buffering mechanisms.
8. ELT engines REQUIRE extra RDBMS space to transform data, particularly when dealing with VLDB (very large databases). Why? Because READS must be processed in a batch form, so they don't conflict with WRITES, especially if the machine itself or the RDBMS cannot show a linear performance increase with the increase in the size of the hardware.
9. ELT vendors (most of them) need to show their integration to the business rules (this is where the EII vendors have really thrived lately - that and metadata).
10. If you can't tune your SQL, then you're better off with an ETL engine (today). ELT will require technicians with a high proficiency in SQL tuning, and RDBMS tuning.