大纲
一、什么是DRBD
二、DRBD工作原理
三、DRBD相关概念
四、DRBD配置过程
一、什么是DRBD
Distributed Replicated Block Device(DRBD)是一个基于软件的、无共享、复制的存储解决方案,在服务器之间的对块设备(硬盘、分区、逻辑卷等)进行镜像。
DRBD镜像数据
实时性:当应用对磁盘的数据进行修改时,复制立即发生
透明性:应用程序的数据存储在镜像设备上是独立和透明的,数据可存储在不同的服务器上
同步镜像和异步镜像:同步镜像,当本地发申请进行写操作进行时,同步写到两台服务器上。异步镜像,当本地写申请已经完成对本地的写操作时,开始对对应的服务器进行写操作
二、DRBD工作原理
DRBD's position within the Linux I/O stack
三、DRBD相关概念
1、复制模式
单主模式
在单主模式下, 任何资源在任何特定的时间,集群中只存在一个主节点。 正是因为这样在集群中只能有一个节点可以随时操作数据,这种模式可用在任何的文件系统上( EXT3、 EXT4、 XFS等等)
双主模式
在双主模式下,任何资源在任何特定的时间,集群中都存在两个主节点。犹豫双方数据存在并发的可能性,这种模式需要一个共享的集群文件系统,利用分布式的锁机制进行管理,如 GFS 和OCFS2。部署双主模式时, DRBD 是负载均衡的集群,这就需要从两个并发的主节点中选取一个首选的访问数据。这种模式默认是禁用的,如果要是用的话必须在配置文件中进行声明。(DRBD8.0 之后支持)
2、复制协议
协议A
异步复制协议。一旦本地磁盘写入已经完成,数据包已在发送队列中,则写被认为是完成的。在一个节点发生故障时,可能发生数据丢失,因为被写入到远程节点上的数据可能仍在发送队列。尽管,在故障转移节点上的数据是一致的,但没有及时更新。这通常是用于地理上分开的节点
协议B
内存同步(半同步)复制协议。一旦本地磁盘写入已完成且复制数据包达到了对等节点则认为写在主节点上被认为是完成的。数据丢失可能发生在参加的两个节点同时故障的情况下,因为在传输中的数据可能不会被提交到磁盘
协议C
同步复制协议。只有在本地和远程节点的磁盘已经确认了写操作完成,写才被认为完成。没有任何数据丢失,所以这是一个群集节点的流行模式,但I / O吞吐量依赖于网络带宽
一般使用协议C,但选择C协议将影响流量,从而影响网络时延。为了数据可靠性,我们在生产环境使用时须慎重选项使用哪一种协议
简而言之:
A:数据一旦写入磁盘并发送到本地TCP/IP协议栈,就认为完成了写入操作
B:数据一旦到达对等节点的TCP/IP协议栈,即收到接受确认就认为完成了写入操作
C:数据一旦到达对等节点的磁盘,即收到写入确认就认为完成了写入操作
协议A性能最好,C数据可靠性最高
3、DRBD资源
DRBD主要是对磁盘资源的管控,因此在DRBD模块中,资源是所有可复制移动存储设备的总称。
资源名(Resource name):资源名可以指定除了空格外 us-ascii 中的任意字符。
DRBD 设备(DRBD device):DRBD 的虚拟块设备。它有一个主设备号为 147 的设备,默认的它的次要号码编从 0 开始。相关的块设备需命名为/ dev/ drbdm,其中 M 是设备的次要号码。
磁盘配置(Disk configuration):DRBD 内部应用需要本地数据副本,元数据。
网络配置(Network configuration):各个对等接点间需要进行数据通信。
4、资源角色
在DRBD中,每个节点都有自己的角色,比如主或者备
主:在主 DRBD 设备中可以进行不受限制的读和写的操作。他可用来创建和挂载文件系统、初始化或者是直接 I/O 的快设备,等等。
备:在备DRBD设备中,接受所有来自对等节点的更新,但是与此同时也就完全决绝了访问。它既不能被应用也不能被读写访问。备节点不能被读写访问时为了保持缓冲一致性,这就意味着备节点是不可能以任何形式被访问的。
人工干预和管理程序的自动聚类算法都可以改变资源的角色。资源从备节点变为主节点为升级,而反操作为降级
简而言之:主节点可执行读写操作,备节点不能被挂载,即不能写,也不能读
四、DRBD配置过程
系统环境
CentOS5.8 x86_64
node1.network.com node1 172.16.1.101
node2.network.com node2 172.16.1.105
软件版本
drbd83-8.3.15-2.el5.centos.x86_64.rpm
kmod-drbd83-8.3.15-3.el5.centos.x86_64.rpm
注意:两个软件包的版本必须保持一致
拓扑图
1、准备工作
(1)、时间同步
[root@node1 ~]# ntpdate s2c.time.edu.cn [root@node2 ~]# ntpdate s2c.time.edu.cn 可根据需要在每个节点上定义crontab任务 [root@node1 ~]# which ntpdate /sbin/ntpdate [root@node1 ~]# echo "*/5 * * * * /sbin/ntpdate s2c.time.edu.cn &> /dev/null" >> /var/spool/cron/root [root@node1 ~]# crontab -l */5 * * * * /sbin/ntpdate s2c.time.edu.cn &> /dev/null
(2)、主机名称要与uname -n,并通过/etc/hosts解析
node1 [root@node1 ~]# hostname node1.network.com [root@node1 ~]# uname -n node1.network.com [root@node1 ~]# sed -i 's@\(HOSTNAME=\).*@\1node1.network.com@g' /etc/sysconfig/network node2 [root@node2 ~]# hostname node2.network.com [root@node2 ~]# uname -n node2.network.com [root@node2 ~]# sed -i 's@\(HOSTNAME=\).*@\1node2.network.com@g' /etc/sysconfig/network node1添加hosts解析 [root@node1 ~]# vim /etc/hosts [root@node1 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 172.16.1.101 node1.network.com node1 172.16.1.105 node2.network.com node2 拷贝此hosts文件至node2 [root@node1 ~]# scp /etc/hosts node2:/etc/ The authenticity of host 'node2 (172.16.1.105)' can't be established. RSA key fingerprint is 13:42:92:7b:ff:61:d8:f3:7c:97:5f:22:f6:71:b3:24. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node2,172.16.1.105' (RSA) to the list of known hosts. root@node2's password: hosts 100% 233 0.2KB/s 00:00
(3)、ssh互信通信
node1 [root@node1 ~]# ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' Generating public/private rsa key pair. /root/.ssh/id_rsa already exists. Overwrite (y/n)? n # 我这里已经生成过了 [root@node1 ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub node2 root@node2's password: Now try logging into the machine, with "ssh 'node2'", and check in: .ssh/authorized_keys to make sure we haven't added extra keys that you weren't expecting. [root@node1 ~]# setenforce 0 [root@node1 ~]# ssh node2 'ifconfig' eth0 Link encap:Ethernet HWaddr 00:0C:29:D6:03:52 inet addr:172.16.1.105 Bcast:255.255.255.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:fed6:352/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:9881 errors:0 dropped:0 overruns:0 frame:0 TX packets:11220 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5898514 (5.6 MiB) TX bytes:1850217 (1.7 MiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:16 errors:0 dropped:0 overruns:0 frame:0 TX packets:16 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1112 (1.0 KiB) TX bytes:1112 (1.0 KiB) 同理node2也需要做同样的双击互信,一样的操作,此处不再演示
(4)、关闭iptables和selinux
node1
[root@node1 ~]# service iptables stop [root@node1 ~]# vim /etc/sysconfig/selinux [root@node1 ~]# cat /etc/sysconfig/selinux # This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - SELinux is fully disabled. #SELINUX=permissive SELINUX=disabled # SELINUXTYPE= type of policy in use. Possible values are: # targeted - Only targeted network daemons are protected. # strict - Full SELinux protection. SELINUXTYPE=targeted
node2
[root@node2 ~]# service iptables stop [root@node2 ~]# vim /etc/sysconfig/selinux [root@node2 ~]# cat /etc/sysconfig/selinux # This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - SELinux is fully disabled. #SELINUX=permissive SELINUX=disabled # SELINUXTYPE= type of policy in use. Possible values are: # targeted - Only targeted network daemons are protected. # strict - Full SELinux protection. SELINUXTYPE=targeted
(5)、配置163源(也可以自行下载两个软件包,我这里直接使用yum源)
node1
[root@node1 ~]# wget http://mirrors.163.com/.help/CentOS5-Base-163.repo [root@node1 ~]# yum repolist Loaded plugins: fastestmirror, security Loading mirror speeds from cached hostfile * epel: mirrors.hustunique.com repo id repo name status addons CentOS-5 - Addons - 163.com 0 base CentOS-5 - Base - 163.com 3,667 epel Extra Packages for Enterprise Linux 5 - x86_64 6,755 extras CentOS-5 - Extras - 163.com 266 updates CentOS-5 - Updates - 163.com 593 repolist: 11,281
node2
[root@node2 ~]# wget http://mirrors.163.com/.help/CentOS5-Base-163.repo [root@node1 ~]# yum repolist Loaded plugins: fastestmirror, security Loading mirror speeds from cached hostfile * epel: mirrors.hustunique.com repo id repo name status addons CentOS-5 - Addons - 163.com 0 base CentOS-5 - Base - 163.com 3,667 epel Extra Packages for Enterprise Linux 5 - x86_64 6,755 extras CentOS-5 - Extras - 163.com 266 updates CentOS-5 - Updates - 163.com 593 repolist: 11,281
2、安装drbd与kmod-drbd
node1 [root@node1 ~]# yum install -y drbd83 kmod-drbd83 node2 [root@node2 ~]# yum install -y drbd83 kmod-drbd83
3、准备分区作为drbd设备
node1 [root@node1 ~]# fdisk /dev/hda The number of cylinders for this disk is set to 44384. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-44384, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-44384, default 44384): +1G Command (m for help): p Disk /dev/hda: 21.4 GB, 21474836480 bytes 15 heads, 63 sectors/track, 44384 cylinders Units = cylinders of 945 * 512 = 483840 bytes Device Boot Start End Blocks Id System /dev/hda1 1 2068 977098+ 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. [root@node1 ~]# partprobe /dev/hda [root@node1 ~]# cat /proc/partitions major minor #blocks name 3 0 20971520 hda 3 1 977098 hda1 8 0 20971520 sda 8 1 104391 sda1 8 2 20860402 sda2 253 0 18776064 dm-0 253 1 2064384 dm-1 node2 [root@node2 ~]# fdisk /dev/hda Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. The number of cylinders for this disk is set to 44384. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-44384, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-44384, default 44384): +1G Command (m for help): p Disk /dev/hda: 21.4 GB, 21474836480 bytes 15 heads, 63 sectors/track, 44384 cylinders Units = cylinders of 945 * 512 = 483840 bytes Device Boot Start End Blocks Id System /dev/hda1 1 2068 977098+ 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. [root@node2 ~]# partprobe /dev/hda [root@node2 ~]# cat /proc/partitions major minor #blocks name 3 0 20971520 hda 3 1 977098 hda1 8 0 20971520 sda 8 1 104391 sda1 8 2 20860402 sda2 253 0 18776064 dm-0 253 1 2064384 dm-1
4、编辑主配置文件
[root@node1 ~]# cat /etc/drbd.conf # # please have a a look at the example configuration file in # /usr/share/doc/drbd83/drbd.conf # [root@node1 ~]# cp /usr/share/doc/drbd83-8.3.15/drbd.conf /etc/drbd.conf cp: overwrite `/etc/drbd.conf'? y [root@node1 ~]# cat /etc/drbd.conf # You can find an example in /usr/share/doc/drbd.../drbd.conf.example include "drbd.d/global_common.conf"; include "drbd.d/*.res"; 编辑全局属性定义配置文件 [root@node1 ~]# vim /etc/drbd.d/global_common.conf [root@node1 ~]# cat /etc/drbd.d/global_common.conf global { usage-count no; # 关闭用户数量统计 # minor-count dialog-refresh disable-ip-verification } common { protocol C; handlers { # These are EXAMPLE handlers only. # They may have severe implications, # like hard resetting the node under certain circumstances. # Be careful when chosing your poison. pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb } disk { on-io-error detach; # 一旦io发生错误,就拆掉磁盘 # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes # no-disk-drain no-md-flushes max-bio-bvecs } net { cram-hmac-alg "sha1"; # 指定加密算法 shared-secret "qYQ1cwOFC6E="; # 指定共享的密钥 # sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers # max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret # after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork } syncer { rate 300M; # 指定速率 # rate after al-extents use-rle cpu-mask verify-alg csums-alg } }
5、定义一个资源文件/etc/drbd.d/web.res
[root@node1 ~]# vim /etc/drbd.d/web.res [root@node1 ~]# cat /etc/drbd.d/web.res resource web { on node1.network.com { device /dev/drbd0; disk /dev/hda1; address 172.16.1.101:7789; meta-disk internal; } on node2.network.com { device /dev/drbd0; disk /dev/hda1; address 172.16.1.105:7789; meta-disk internal; } } 将配置文件与资源文件拷贝至node2节点 [root@node1 ~]# scp -r /etc/drbd.* node2:/etc/ drbd.conf 100% 133 0.1KB/s 00:00 global_common.conf 100% 1688 1.7KB/s 00:00 web.res 100% 292 0.3KB/s 00:00
6、在两个节点上初始化已定义的资源并启动服务
初始化资源,两个节点同时执行 [root@node1 ~]# drbdadm create-md web Writing meta data... initializing activity log NOT initialized bitmap New drbd meta data block successfully created. [root@node2 ~]# drbdadm create-md web Writing meta data... initializing activity log NOT initialized bitmap New drbd meta data block successfully created. 启动drbd服务,两个节点同时执行 [root@node1 ~]# service drbd start Starting DRBD resources: [ d(web) s(web) n(web) ]...... [root@node2 ~]# service drbd start Starting DRBD resources: [ d(web) s(web) n(web) ]. 查看启动状态 [root@node1 ~]# cat /proc/drbd version: 8.3.15 (api:88/proto:86-97) GIT-hash: 0ce4d235fc02b5c53c1c52c53433d11a694eab8c build by [email protected], 2013-03-27 16:01:26 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:977028 也可以使用drbd-overview命令来查看 [root@node1 ~]# drbd-overview 0:web Connected Secondary/Secondary Inconsistent/Inconsistent C r----- 从上面的信息中可以看出此时两个节点均处于Secondary状态。 于是,我们接下来需要将其中一个节点设置为Primary。在要设置为Primary的节点上执行如下命令 [root@node1 ~]# drbdadm -- --overwrite-data-of-peer primary web 而后再次查看状态,可以发现数据同步过程已经开始 [root@node1 ~]# cat /proc/drbd version: 8.3.15 (api:88/proto:86-97) GIT-hash: 0ce4d235fc02b5c53c1c52c53433d11a694eab8c build by [email protected], 2013-03-27 16:01:26 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r---n- ns:169252 nr:0 dw:0 dr:177024 al:0 bm:10 lo:4 pe:12 ua:64 ap:0 ep:1 wo:b oos:809220 [==>.................] sync'ed: 17.6% (809220/977028)K finish: 0:00:09 speed: 83,904 (83,904) K/sec 等数据同步完成以后再次查看状态,可以发现节点已经牌实时状态,且节点已经有了主次 [root@node1 ~]# drbd-overview # Primary/Secondary表示当前节点为主,对等节点为从 0:web Connected Primary/Secondary UpToDate/UpToDate C r-----
7、创建文件系统并挂载
文件系统的挂载只能在Primary节点进行,因此,也只有在设置了主节点后才能对drbd设备进行格式化 [root@node1 ~]# mke2fs -j /dev/drbd0 mke2fs 1.39 (29-May-2006) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 122368 inodes, 244257 blocks 12212 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=251658240 8 block groups 32768 blocks per group, 32768 fragments per group 15296 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376 Writing inode tables: done Creating journal (4096 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 25 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. 创建挂载目录 [root@node1 ~]# mkdir /mydata 挂载 [root@node1 ~]# mount /dev/drbd0 /mydata [root@node1 ~]# ls /mydata/ lost+found 复制个文件 [root@node1 ~]# cp /etc/fstab /mydata/ [root@node1 ~]# ls /mydata/ fstab lost+found
8、切换Primary和Secondary节点
对主Primary/Secondary模型的drbd服务来讲,在某个时刻只能有一个节点为Primary,因此,要切换两个节点的角色, 只能在先将原有的Primary节点设置为Secondary后,才能原来的Secondary节点设置为Primary 先卸载 [root@node1 ~]# umount /mydata/ 再将当前节点降级为从节点 [root@node1 ~]# drbdadm secondary web 查看状态 [root@node1 ~]# drbd-overview 0:web Connected Secondary/Secondary UpToDate/UpToDate C r----- 此时再切换至node2节点,将node2提升为主节点 [root@node2 ~]# drbdadm primary web 查看状态 [root@node2 ~]# drbd-overview 0:web Connected Primary/Secondary UpToDate/UpToDate C r----- /mydata ext3 940M 18M 875M 2% 再创建挂载目录,挂载 [root@node2 ~]# mkdir /mydata [root@node2 ~]# mount /dev/drbd0 /mydata/ [root@node2 ~]# cat /mydata/fstab /dev/VolGroup00/LogVol00 / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 /dev/VolGroup00/LogVol01 swap swap defaults 0 0 可以看到数据是一模一样的
扩展阅读:DRBD官方文档