Ralph Kimball在他的<The Data Warehouse ETL Toolkit>中提出了ECCD(Extract-Clean-Conform-Deliver)的架构,在此文章中笔者将使用ECCD的四个步骤进行描述:源系统通过FTP提供文件格式的数据源文件,使用ORACLE_LOADER访问驱动程序使用该文件构建外部表(抽取),与数据库中的标准数据表进行校验并写入对应的Staging表(清洗与整合),处理的结果通过ORACLE_DATAPUMP访问驱动程序写入目标文件(分发)。
准备工作
在使用外部表之前,首先要建立DIRECTORY对象。同时给需要进行外部表操作的用户赋予适当的权限。
数据抽取— ORACLE_LOADERSQL > CREATE OR REPLACE DIRECTORY source_dir as ' C:\oracle\oradata\source ' ; -- 源文件目录 Directory created. SQL > grant connect, dba to stenny identified by stenny; Grant succeeded. SQL > CREATE OR REPLACE DIRECTORY source_dir as ' C:\oracle\oradata\source ' ; Directory created. SQL > CREATE OR REPLACE DIRECTORY target_dir as ' C:\oracle\oradata\target ' ; Directory created. SQL > CREATE OR REPLACE DIRECTORY log_dir as ' C:\oracle\oradata\log ' ; Directory created. SQL > grant read on directory source_dir to stenny; Grant succeeded. SQL > grant write on directory target_dir to stenny; Grant succeeded. SQL > grant write on directory log_dir to stenny; Grant succeeded.
经过上面的步骤,我们将数据库之外的一个平文件通过访问驱动程序ORACLE_LOADER与一个数据库表STENNY_EXT_PRODUCT建立了映射关系。我们对这个外部表可以进行排序,表连接等只读操作。SQL > ed Wrote file afiedt.buf 1 CREATE TABLE stenny_ext_product 2 (product_id NUMBER( 4 ), 3 product_name VARCHAR2( 20 ), 4 location VARCHAR2( 25 ) 5 ) 6 ORGANIZATION EXTERNAL 7 ( 8 TYPE ORACLE_LOADER 9 DEFAULT DIRECTORY source_dir 10 ACCESS PARAMETERS 11 ( 12 records delimited by newline 13 badfile log_dir: ' bad_product.dat ' 14 logfile log_dir: ' product.log ' 15 fields terminated by ' , ' 16 missing field values are null 17 ( product_id, product_name, location ) 18 ) 19 LOCATION ( ' product1.dat ' ) 20 ) 21 * REJECT LIMIT UNLIMITED SQL > / Table created. SQL > select * from stennY_ext_product; PRODUCT_ID PRODUCT_NAME LOCATION ---------- -------------------- ------------------------- 1 Bicycle JiangSu 2 Camps ZheJiang 3 Wearings SiChuan 4 Gloves SiChuan 5 Food YunNan 6 Shoes NULL 6 rows selected.
-- proc_txn_product CREATE OR REPLACE PROCEDURE proc_txn_product AS BEGIN insert into stg_product select product_id,product_name,loc_id from stenny_ext_product,loc_std where loc_std.loc_name = stenny_ext_product.location; insert into stg_excep select * from stenny_ext_product where product_id not in (select product_ id from stg_product); commit; END proc_txn_product
SQL > exec proc_txn_product; PL / SQL procedure successfully completed. SQL > select * from stg_product; PRODUCT_ID PRODUCT_NAME LOC_ID ---------- -------------------- ---------- 1 Bicycle 1 2 Camps 2 3 Wearings 3 4 Gloves 3 5 Food 4 SQL > select * from stg_excep; PRODUCT_ID PRODUCT_NAME LOCATION ---------- -------------------- ------------------------- 6 Shoes NULLEND proc_txn_product
SQL > ed Wrote file afiedt.buf 1 CREATE TABLE tgt_product 2 ORGANIZATION EXTERNAL (TYPE ORACLE_DATAPUMP 3 DEFAULT DIRECTORY target_dir 4 LOCATION ( ' tgt_product.dmp ' )) 5 PARALLEL 2 6 AS 7 SELECT product_id, 8 product_name, 9 loc_id 10 * FROM stg_product SQL > / Table created. SQL > select * from tgt_product; PRODUCT_ID PRODUCT_NAME LOC_ID ---------- -------------------- ---------- 1 Bicycle 1 2 Camps 2 3 Wearings 3 4 Gloves 3 5 Food 4
create or replace procedure proc_file_watcher is v_exists boolean; v_file_length number; v_blocksize number; begin << L_sleeping_child >> if to_char(sysdate, ' hh24 ' ) >= ' 08 ' then -- 超时,可以调用UTL_SMTP null ; else utl_file.fgetattr( ' SOURCE_DIR ' , ' product1.dat ' ,v_exists,v_file_length,v_blocksize); if v_exists then dbms_output.put_line( ' File there! ' ); proc_txn_product; else dbms_output.put_line( ' 404 Error ' ); dbms_lock.sleep( 300 ); goto L_sleeping_child; end if ; end if ; end proc_file_watcher;
SQL > ed Wrote file afiedt.buf 1 BEGIN 2 DBMS_SCHEDULER.CREATE_PROGRAM( 3 program_name => ' STENNY.STP_PROC_FILE_WATCHER ' , 4 program_action => ' STENNY.PROC_FILE_WATCHER ' , 5 program_type => ' STORED_PROCEDURE ' , 6 comments => ' Firing the ETL process if file arrives ' , 7 enabled => TRUE); 8 * END; SQL > / PL / SQL procedure successfully completed. -- 创建调度 SQL > ed Wrote file afiedt.buf 1 BEGIN 2 SYS.DBMS_SCHEDULER.CREATE_SCHEDULE( 3 repeat_interval => ' FREQ=WEEKLY;BYDAY=TUE;BYHOUR=8;BYMINUTE=0;BYSECOND=0 ' , 4 start_date => to_timestamp_tz( ' 2004-04-27 US/Central ' , ' YYYY-MM-DD TZR ' ), 5 comments => ' Tuesday AM Schedule ' , 6 schedule_name => ' "STENNY"."SCS_TXN_PROD" ' ); 7 * END; SQL > / PL / SQL procedure successfully completed. 创建工作 SQL > ed Wrote file afiedt.buf 1 BEGIN 2 SYS.DBMS_SCHEDULER.CREATE_JOB( 3 job_name => ' STENNY.SCJ_TXN_PROD ' , 4 program_name => ' STENNY.STP_PROC_FILE_WATCHER ' , 5 schedule_name => ' STENNY.SCS_TXN_PROD ' , 6 comments => ' Start the ETL process on Tuesday ' , 7 auto_drop => FALSE, 8 enabled => TRUE); 9 * END; SQL > / PL / SQL procedure successfully completed. -- 进行测试 SQL > select count( * ) from stenny.stg_product; COUNT( * ) ---------- 0 SQL > EXEC DBMS_SCHEDULER.RUN_JOB( ' STENNY.SCJ_TXN_PROD ' ,FALSE); PL / SQL procedure successfully completed. SQL > select count( * ) from stenny.stg_product; COUNT( * ) ----------
-- stg_excep Create table stg_excep as select * from stenny_ext_product where 1 = 2 ; -- stg_product CREATE TABLE STG_PRODUCT ( PRODUCT_ID NUMBER, PRODUCT_NAME VARCHAR2( 20 ), LOC_ID NUMBER ); -- loc_std CREATE TABLE LOC_STD ( LOC_ID NUMBER, LOC_NAME VARCHAR2( 20 ) ); INSERT INTO LOC_STD ( LOC_ID, LOC_NAME ) VALUES ( 1 , ' JiangSu ' ); INSERT INTO LOC_STD ( LOC_ID, LOC_NAME ) VALUES ( 2 , ' ZheJiang ' ); INSERT INTO LOC_STD ( LOC_ID, LOC_NAME ) VALUES ( 3 , ' SiChuan ' ); INSERT INTO LOC_STD ( LOC_ID, LOC_NAME ) VALUES ( 4 , ' YunNan ' );