配了快一个星期吧。看的英文文档。不停的google。英文真是我的硬伤。
参考资料:http://wiki.postgresql.org/wiki/Streaming_Replication
http://hi.baidu.com/magicgrass/blog/item/69fcdb1e297248e9e0fe0bef.html
http://brandonkonkle.com/blog/2010/oct/20/postgres-9-streaming-replication-and-django-balanc/
http://www.postgresql.org/docs/devel/static/#STANDBY-SERVER-OPERATION
---------------------------------------------------------------------------------------------------
我的主机做master/primary,ip:172.29.0.43.
slave/的ip是:172.29.0.44 ,端口都是5432.
现在我的主机上创建个角色用于Replication时slave的登录role,假设角色名为slave1
CREATE ROLE name REPLICATION LOGIN
然后配置主机postgresql.conf和pg_hba.conf
往pg_hba.conf添加如下:
host all all 172.29.0.44/32 trust host replication slave1 172.29.0.44/32 trust
修改postgresql.conf如下:
wal_level = hot_standby archive_mode = on archive_command = 'cp -i %p /opt/PostgreSQL/9.1/archive/%f' archive_timeout = 600 max_wal_senders = 10 wal_keep_segments = 32
然后创建在archive_command中出现的目录。并给予777权限。
接着开始在主机上创建备份
[root@ShadowFiend data]# su postgres bash-4.1$ pg_ctl start bash-4.1$ psql postgres -c "select pg_start_backup('backup',true)"
然后用scp 传送到 172.29.0.44 或者 用 tar -czvf ****.tar.gz data;/打包 传到172.29.0.44 上的 PostgreSQL 下 的 data 下 并且覆盖 除了 postgresql.conf ,pg_hba.conf 的所有文件。删除 data/postmaster.pid。
然后主机执行
bash-4.1 $ psql postgres -c "select pg_stop_backup" bash-4.1 $ pg_ctl restart
接着在172.29.0.44上 用root 用户给覆盖过的 data/目录 更改属主
chown -R postgres data
并且修改data下的postgresql.conf :
wal_level=hot_standby hot_standby=on
从share/postgresql/复制一个recovery.conf的sample到data/下
cp share/postgresql/recovery.conf.sample data/recovery.conf
并添加
restore_command=‘cp /opt/archive/%f%p’ archive_cleanup_command=‘pg_archivecleanup/archive%r’ standby_mode=‘on’ primary_conninfo=‘host=172.29.0.43 port=5432 user=slave1 password=’ trigger_file= '/opt/archive/trigger_file.172.29.0.43'
接着启动172.29.0.44的服务吧。
ok。
---------------------------------------------------------------------------------------------------------------------------------------------
有错误看data/pg_log/下的日志。
中间有次不小心吧primary上的/data/pg_xlog/删光了,数据库就启动不起来,重新拷贝了其他机器上的/data 下来覆盖。。。。。
-----------------------------------------------------------------------------------------------------------------------------------------------
:-),出错了,大概是primary 和 slave 都关机后,再启动起来报错
DEBUG: received replication command: IDENTIFY_SYSTEM DEBUG: received replication command: START_REPLICATION 0/18000000 2011-12-09 11:07:52 CST LOG: streaming replication successfully connected to primary 2011-12-09 11:07:52 CST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 000000010000000000000018 has already been removed
google了一下,解答如下、
The cause is very simple - the standby needs all WAL segments created since the backup started, but the master removes some of them. There are two ways to fix this: 1) increase the wal_keep_segments so that enough segments is kept It seems that the slave asked for B110000000D when master already created B110000000D. That's almost 4000 segments if I'm counting correctly. That means your database is either quite busy or the backup takes very long time. This stores all the data on master, so you'll have to keep that in mind when planning the capacity. For example the 4000 segments are almost 64GB. 2) Set up a WAL archive - a separate instance where the WAL segments are kept. See how the archive_command works. And there's pg_archivecleanup for maintenance of the archive. Tomas