一个巴西兄弟写的,几年前在ORACLE 到PG 的数据迁移中用过。感谢这位巴西哥们。
[ 原文]:
http://cunha17.cristianoduarte.pro.br/postgresql/snapshots.en_us.php
摘记:
1. Problem and motivation
When migrating from Oracle to PostgreSQL, we may face situations when we need the same data on both database servers. This situation could be "easily" resolved if we could make "database links" or dblinks between a PostgreSQL server and an Oracle server. That's when the problem arise.
2. Available solutions
We have some solutions, like DBLink and DBI-Link. The first one make "views" that return the result of a function, which in turn makes a query to the remote database server. The second works almost the same way, but creates a whole bunch of structures like tables, types, views and functions to make the handling of remote tables as transparent as possible.
3. Current implementation problems
Comparing the two approaches, it's evident that the DBI-Link is much more elegant and elaborated. But the two suffer from a severe problem: they bring the whole result from the remote query into the local database server. For instance:
Let's take a table on an Oracle database having a list of person (PERSON table). The primary key is the social security number (SSN), this table hold many other fields of person information and has about 250.000 records. The filesystem size of the table is about 122Mb.
If we make the following query on Oracle using SQLPlus:
SELECT name FROM person WHERE ssn='012345678'; (1)
the database server will search at the index(PK) for the given SSN and will return immediately (time < 1 second) the name of the person whose SSN is 012345678.
Now we want a new application, that is being written to use PostgreSQL, to transparently access this Oracle table, that is, to access this table through the PostgreSQL server(like if it was a native table).
Using DBLink we may create a view named "PERSON" based on a call to dblink's query function passing "SELECT * FROM PERSON" (2) .
Using DBI-Link we may create a local mirror of the remote schema structure, and with this, at the local schema, we would find a view called "PERSON", that invoke a function that makes a remote query [2], besides many types and associated functions.
If we make the same query as we did before on Oracle [1], we will have a considerable delay and an absurd computational waste (in my experience, time ~ 5 minutes and process memory consumption ~ 1Gb), what makes these solutions inviable for any corporate project.
4. Internals and the origin of the problems
On the two solutions, the request that originate the function execution was:
SELECT name FROM person WHERE ssn='012345678'; (1)
However, we must notice that both functions were created to make the following remote query:
SELECT * FROM PERSON; (2)
What really happen is that the query present on these functions[2] is always executed, no matter the filter clauses present on the request query that originated the function execution[1]. The result of these functions[2] is, finally, processed by PostgreSQL (on this case filtered by the SSN column) sequentially and the final result is returned(3).
A not so simple way would be to intercept the PostgreSQL planner and gather information about the query that originated the function execution and pass, dynamically, additional parameters to the remote database query. Unfortunately, this doesn't exist yet.
However, we can take another way, when the information access doesn't need to be done in real-time. On these cases, we can make a local copy of the remote table. The previous problem is solved, right? Nope.
How would this copy be made? Using a DBLink or DBI-Link connection we could execute a CREATE TABLE PERSON AS SELECT * FROM REMOTE_PERSON, what would create a copy of the remote table at the local database. Now, any query executed at PostgreSQL would use the local table, which is much faster and efficient. But this approach suffer from one of the previous related problems, the excessive memory consumption when the function gets called (in my experience, the process grew to ~ 1Gb). Even if this operation needs to be done only a few times per day, the high memory consumption makes this approach not practical.
5. Solving with Materialized Views: the PostgreSQL::Snapshots
That's the reason why I created the PostgreSQL::Snapshots project, as an efficient and corporate way to address the current DBLinks solutions problems. This solution was based and inspired by the DBI-LINK project.
The implemented functionalities include:
CREATE DBLINK
This command, available on Oracle, but not on PostgreSQL, create a link between two databases, using a user, a password and the location of the server on the network.
On our case, we use a PL/Perl function (create_dblink) that takes the DBLink name, a DBI:Perl connection string, the user name, the password and some additional attributes needed for the connection establishment.
The table where those data will be saved is pg_dblink and it's worthy remind that access to this table should only be allowed to the DBA(postgres user), despite it's created on the public schema.
Before inserting a record on this table, we check if we can successfully establish a connection with the remote database using the given parameters.
DROP DBLINK
This command, available on Oracle, but not on PostgreSQL, removes a link between two databases, taking only the DBLink name.
On our case, we use a PL/Perl function (drop_dblink) that takes the DBLink name and removes the entry at the pg_dblinks table. The foreign key disallow the deletion if any snapshot references the DBLink to be removed.
CREATE SNAPSHOT
This command, available on Oracle, but not on PostgreSQL, creates a materialized view (aka SNAPSHOT) based on a query. This query may, or may not, be referencing a DBLink.
On our case, we use a PL/Perl function (create_snapshot) that takes the schema name, the Snapshot name, the query, the DBLink name and the refresh method. The DBLink name is optional and, when not given(NULL), create a snapshot based on a query to the local database. The refresh method can be:
The query is executed with a WHERE 1=0 clause as a way to bring only the query result structure. With this structure, a type mapping is done and an empty local table is created with that same structure. Finally, an entry on the snapshots table (pg_snapshots) is added.
The table is not filled by this command.
DROP SNAPSHOT
This command, available on Oracle, but not on PostgreSQL, removes a materialized view(aka SNAPSHOT) taking only the Snapshot name.
On our case, we use a PL/Perl function (drop_snapshot) that takes the schema name and the Snapshot name, removes the object and the entry at the pg_snapshots table.
CREATE SNAPSHOT LOG
This command, available on Oracle, but not on PostgreSQL, creates a materialized view log(aka SNAPSHOT LOG) bound to another table(called master table). When a snapshot is created with a query that references this master table, it will be possible to use fast refreshes (REFRESH FAST) based on the log. This allows, for instance, that, at the snapshot refreshing time , only the deleted, updated and inserted records be retrieved, highly increasing the performance and the refresh time.
On our case, we use a PL/Perl function (create_snapshot_log) that takes the schema name, the master table name and the comma-separated field list on which the log filter will be applied. This list can contain a keyword like "primary key" or "oid" or field names.
To make "snapshot log" functional, this function created a log table with the "mlog$_" prefix and a dynamically coded trigger that monitors any modifications on the master table and records the necessary information on the log table. Finally, an entry on the snapshot log table (pg_mlogs) is created, along with the necessary entries on the snapshot log columns table (pg_mlog_refcols).
DROP SNAPSHOT LOG
This command, available on Oracle, but not on PostgreSQL, removes a materialzed view log (aka SNAPSHOT LOG) taking only the schema name and the master table name.
On our case, we use a PL/Perl function (drop_snapshot_log) that takes the schema name and the master table name and removes the materialized view log table, along with the master table trigger, the pg_mlogs entry and the pg_mlog_refcols entries.
REFRESH SNAPSHOT
This command, available on Oracle as a "Stored Procedure", but not on PostgreSQL, refreshes the data on a materialized view (aka SNAPSHOT) taking only the Snapshot name.
On our case, we use a PL/Perl function (refresh_snapshot) that takes the schema name and the snapshot name and fill it with the results of its creation query.
The secret behind is the use of distinct connections for SPI communications with the backend, for data insertion in the Snapshot and for remote data reading. The insertion process is done in one transaction of 1000 records at a time and make use of "Prepared Statements" to make the process faster and smarter (in my test with the PERSON table, the rate was ~ 650 records/second).
At this function we also check the refresh method and do fast refreshes when configured(or allowed). The fast refreshes depend on the chosen method, on the materialized view log presence, on the log row count and on the row count of the result query. For big tables with few updates/inserts/deletes, the FAST method can refresh the snapshot within a few seconds.
The fast refresh (REFRESH FAST) is only available on driver-supported databases (at the moment, only on PostgreSQL and Oracle) since operations will be performed on internal system tables at the master database (where the query will be executed). On Oracle, for example, tables like SLOG$, MLOG$, MLOG_REFCOL$, etc. needs to be directly accessed and the SLOG$ table must be accessed for modifications.
6. Conclusion
With the PostgreSQL::Snapshots, the basic functionalities of Materialized Views are implemented, what does not avoid a future association with an efficient DBLink solution. The use of Materialized Views is not restricted to table copies from other databases, they can be used as a way to persist results of highly complex and slow queries, giving the responsiveness that front-end systems need.
################ 使用方法############
As database superuser (often postgres, but check for your system), do the following:
INSTALLATION
1. Load PL/Perlu into your database. See the createlang documents for details on how to do this;
2. Make shure that DBI is installed on your Perl system and that the DBD of the database you choose is also installed;
3. Edit the Makefile.sh file and change the KEY variable to a better "secret" value and the BASE_SCHEMA variable to where the base(internal) Pg::Snapshot tables should be placed. Also remember to setup the remaining variables like SUPERUSER.
4. On the PostgreSQL::Snapshots root, execute:
# ./Makefile.sh
5. Load the database driver:
- On PostgreSQL:
# psql -d <database> -h <host> -U <user> -f ./drivers/pg/snapshot.sql
- On Oracle, inside SQL+:
SQL> @./drivers/oracle/snapshot.sql
6. Load the pgsnapshots.sql file:
# psql -d <database> -h <host> -U <user> -f pgsnapshots.sql
7. Allow the access from your workstation(or remote server) to one or more master tables on the current database:
- Inside psql, conected as POSTGRES user:
db=# select snapshot_do('<key>', 'ALLOW', '<masterschema>', '<mastername>', '<ip>');
- or inside SQL+, conected as SYS user:
SQL> begin
snapshot_do('<key>', 'ALLOW', '<masterschema>', '<mastername>', '<ip>');
end;
/
Where:
<key> is the "secret" value placed on the KEY variable inside the Makefile.sh file.
<masterschema> is the schema name of the master table you wish to allow access to
<mastername> is the name of the master table you wish to allow access to
<ip> is the IP address of your workstation/server to whom you wish to give access
8. Use the underlying methods aka functions as needed.
AVAILABLE FUNCTIONS
1. create_dblink (implementation of "CREATE DBLINK")
This function creates a link between databases. It takes the name of the DBLINK to be created and the necessary parameters do establish the remote connection.
Syntax :
create_dblink(dblinkname text, datasource text, username text, password text, attributes text)
dblinkname : name of the DBLINK to be created
datasource : Perl:DBI CONNECTION string to the remote database
username : NAME of the remote database user
password : PASSWORD of the remote database user
attributes : connection ATTRIBUTES, like AutoCommit, RaiseErrors, etc.
2. drop_dblink (implementation of "DROP DBLINK")
This function removes a link between databases taking only the DBLink name as a parameter.
Syntax :
drop_dblink(dblinkname text)
dblinkname : name of the DBLINK to be removed
3. create_snapshot (implementation of "CREATE SNAPSHOT" or "CREATE MATERIALIZED VIEW")
This function creates a materialized view or snapshot based on a query. The query can be referencing a database link or not.
Syntax :
create_snapshot(schemaname text, snapshotname text, query text, dblink text, refresh_method text, prebuilt_table text)
schemaname : name of the schema where the snapshot will be created
snapshotname : name of the snapshot to be created
query : SQL query that will be executed at the remote database and which result will fill the snapshot
dblink : optional parameter that take the name of the DBLink to be used. If the value is NULL, the query will be executed by the local database.
refresh_method : can be "COMPLETE", "FAST" or "FORCE".
prebuilt_table : name of the prebuilt table, on the same schema of the snapshot, over which the snapshot will be created (existing data are preserved). This is an optional parameter.
IMPORTANT: the table will not be filled by this function.
4. drop_snapshot (implementation of "DROP SNAPSHOT" or "DROP MATERIALIZED VIEW")
This function removes a materialized view or snapshot taking the schema name and the snapshot name as parameters.
Syntax :
drop_snapshot (schemaname text, snapshotname text)
schemaname : name of the schema where the snapshot resides
snapshotname : name of the snapshot to be removed
5. create_snapshot_log (implementation of "CREATE MATERIALIZED VIEW LOG" or "CREATE SNAPSHOT LOG")
This function creates a log table bound to a master table. This log table allows the creation of fast refreshing snapshot(FAST REFRESH).
Syntax :
create_snapshot_log (schemaname text, mastername text, withwhat text)
schemaname : name of the schema where the master table resides
mastername : name of the master table
withwhat : use the this clause to indicate whether the snapshot log should record the primary key, the rowid, or both the primary key and rowid when rows in the master are updated. This clause also specifies whether the snapshot records filter columns, which are non-primary-key columns referenced by subquery snapshots. The syntax is:
1) "PRIMARY KEY": indicate that the primary key of all rows updated in the master table should be recorded in the snapshot log;
2) "OID": indicate that the OID of all rows updated in the master table should be recorded in the snapshot log;
3) "(<filter-columns>)" : a parenthesis-delimited comma-separated list that specifies the filter columns to be recorded in the snapshot log. For fast-refreshable primary-key snapshots defined with subqueries, all filter columns referenced by the defining subquery must be recorded in the snapshot log;
4) Any combination of the above in any order.
6. drop_snapshot_log (implementation of "DROP MATERIALIZED VIEW LOG" or "DROP SNAPSHOT LOG")
This function removes a log table previously bound to a master table.
Syntax :
drop_snapshot_log (schemaname text, mastername text)
schemaname : name of the schema where the master table resides
mastername : name of the master table
5. refresh_snapshot (implementation of "DBMS_SNAPSHOTS.REFRESH")
This function refreshes the data on a materialized view or snapshot taking the schema and snapshot names as parameters.
Syntax :
refresh_snapshot (schemaname text, snapshotname text)
schemaname : name of the schema where the snapshot resides
snapshotname : name of the snapshot to be refreshed