可以用python写的etl工具,Python中最好的ETL软件包

I have 2 use cases:

Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster

Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable).

I am currently using:

SQLAlchemy for extracts (works well generally).

PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out).

An internal tool for the scheduling component (which stores the transform in XML and then the loads from the XML and seems rather long and complicated).

I have been looking through this link but would welcome additional suggestions. Exporting to Spark or similar is also welcome if there is an "easier" process where I can just do everything through Python (I'm only using Redshift because it seems like the best option).

解决方案

You can try pyetl an etl framework write by python3

from pyetl import Task, DatabaseReader, DatabaseWriter

reader = DatabaseReader("sqlite:///db.sqlite3", table_name="source")

writer = DatabaseWriter("sqlite:///db.sqlite3", table_name="target")

columns = {"id": "uuid", "name": "full_name"}

functions={"id": str, "name": lambda x: x.strip()}

Task(reader, writer, columns=columns, functions=functions).start()

你可能感兴趣的:(可以用python写的etl工具,Python中最好的ETL软件包)