CephFS环境搭建(二)

前言
《CephFS环境搭建(一)》介绍了如何简单搭建一个单Mon,MDS的Cephfs并导出使用,这里深入一步建立三个池,SSD和SATA分开存放在不同的池,建立多Mon,多MDS,结合纠删码,cache分层,建立一个比较健全的cephfs。
一、布局
        主机共有node1,node2,node3三个,每台主机有三个OSD,如下图所示,其中osd1,3,5,6,7,8为SSD盘,2,3,4为SATA盘。
        三台主机上各有一个Monitor,也各有一个MDS。
        我们用osd1,3,4建一个pool名叫ssd,采用三副本的形式,osd0,2,4建一个Pool名字叫sata,采用纠删码的形式,k=2,m=1,即用两个osd存数据分片,一个osd存校验信息,osd6,7,8建一个pool名叫metadata用来存放cephfs的元数据。
        pool ssd和sata构成一个writeback模式的cache分层,ssd为hotstorage,即缓存,sata为coldstorage即后端存储;sata和metadata两个pool构建一个cephfs,挂载到/mnt/cephfs目录下。

CephFS环境搭建(二)_第1张图片

二、步骤

1、安装软件

(1)安装依赖
apt-get install autoconf automake autotools-dev libbz2-dev debhelper default-jdk git javahelper junit4 libaio-dev libatomic-ops-dev libbabeltrace-ctf-dev libbabeltrace-dev libblkid-dev libboost-dev libboost-program-options-dev libboost-system-dev libboost-thread-dev libcurl4-gnutls-dev libedit-dev libexpat1-dev libfcgi-dev libfuse-dev libgoogle-perftools-dev libkeyutils-dev libleveldb-dev libnss3-dev libsnappy-dev liblttng-ust-dev libtool libudev-dev libxml2-dev pkg-config python python-argparse python-nose uuid-dev uuid-runtime xfslibs-dev yasm  uuid-dev libkeyutils-dev libgoogle-perftools-dev libatomic-ops-dev libaio-dev libgdata-common libgdata13 libsnappy-dev libleveldb-dev

(2)安装软件包
wget http://ceph.com/download/ceph-0.89.tar.gz
./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var
make -j4
make install

http://docs.ceph.com/docs/master/install/manual-deployment/

(3)同步时间
在各个节点上运行  ntpdate cn.pool.ntp.org 

2、搭建monitor


(1)cp src/init-ceph /etc/init.d/ceph

(2)uuidgen
2fc115bf-b7bf-439a-9c23-8f39f025a9da
vim /etc/ceph/ceph.conf; set fsid = 2fc115bf-b7bf-439a-9c23-8f39f025a9da

(3)在/tmp/ceph.mon.keyring文件下产生一个keyring
ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'

(4)在/tmp/ceph.client.admin.keyring文件下产生一个keyring
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'

(5)将ceph.client.admin.keyring导入到ceph.mon.keyring
ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring

(6)在node1上创建一个Mon,名字叫node1,/tmp/monmap 存monmap
monmaptool --create --add node1 172.10.2.171 --fsid 2fc115bf-b7bf-439a-9c23-8f39f025a9da /tmp/monmap

(7)创建存储monitor数据的文件夹,文件夹里主要有keyring和store.db
mkdir -p /var/lib/ceph/mon/ceph-node1

(8)用monitor map和keyring组装mon守护进程开启需要的初始数据
ceph-mon --mkfs -i node1 --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring

(9) touch /var/lib/ceph/mon/ceph-node1/done

(10)开启monitor
/etc/init.d/ceph start mon.node1



3、加入OSD

(1)做磁盘的格式化工作
ceph-disk prepare --cluster ceph --cluster-uuid 2fc115bf-b7bf-439a-9c23-8f39f025a9da --fs-type xfs /dev/sdb

mkdir -p /var/lib/ceph/bootstrap-osd/
mkdir -p /var/lib/ceph/osd/ceph-0

(2)挂载
ceph-disk activate /dev/sdb1 --activate-key /var/lib/ceph/bootstrap-osd/ceph.keyring

(3)在/etc/ceph/ceph.conf里加入[osd]的信息后 /etc/init.d/ceph start就可以启动所有的OSD
如果启动后 ceph osd stat查看还是没有up
rm -rf /var/lib/ceph/osd/ceph-2/upstart
再次启动 /etc/init.d/ceph start

(4)设置node2免登陆(可以不做这一步)
ssh-keygen
ssh-copy-id node2

(5)第二个节点上加osd需要拷一些配置
scp /etc/ceph/ceph.conf [email protected]:/etc/ceph/
scp /etc/ceph/ceph.client.admin.keyring [email protected]:/etc/ceph/
scp /var/lib/ceph/bootstrap-osd/ceph.keyring  [email protected]:/var/lib/ceph/bootstrap-osd/
然后按照上述(1)-(3)操作
以此类推在node1,2,3上各建立3个osd

4、创建mds,创建文件系统

(1)创建存储mds数据的文件夹
mkdir -p /var/lib/ceph/mds/ceph-node1/

(2)生成Mds的keyring,用cephx验证需要此步骤
ceph auth get-or-create mds.node1 mon 'allow rwx' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mds/ceph-node1/keyring

(4)开启mmsd.node1
/etc/init.d/ceph start mds.node1
以此类推在node1,2,3上都建立MDS

5、在第二个节点上加Monitor
(1)ssh node2
(2)mkdir -p /var/lib/ceph/mon/ceph-node2
(3)ceph auth get mon. -o /tmp/ceph.mon.keyring
(4)ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
(5)ceph mon getmap -o /tmp/monmap
(6)ceph-mon --mkfs -i node2 --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring
(7)touch /var/lib/ceph/mon/ceph-node2/done
(8)rm -f /var/lib/ceph/mon/ceph-node2/upstart
(9)/etc/init.d/ceph start mon.node2
以此类推在node1,2,3上都建立Monitor

至此ps -ef | grep ceph应该可以查看到每个Node上都有一个Mon进程,一个mds进程,3个osd进程,ceph -s命令也可以查看。 配置文件如下:
[global]
fsid = 2fc115bf-b7bf-439a-9c23-8f39f025a9da
mon initial members = node1,node2,node3
mon host = 172.10.2.171,172.10.2.172,172.10.2.173
public network = 172.10.2.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 1024
filestore xattr use omap = true
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 333
osd pool default pgp num = 333
osd crush chooseleaf type = 1
[mon.node1]
host = node1
mon addr = 172.10.2.171:6789
[mon.node2]
host = node2
mon addr = 172.10.2.172:6789
[mon.node3]
host = node3
mon addr = 172.10.2.173:6789
[osd]
osd crush update on start = false
[osd.0]
host = node1
addr = 172.10.2.171:6789
[osd.1]
host = node1
addr = 172.10.2.171:6789
[osd.2]
host = node2
addr = 172.10.2.172:6789
[osd.3]
host = node2
addr = 172.10.2.172:6789
[osd.4]
host = node3
addr = 172.10.2.173:6789
[osd.5]
host = node3
addr = 172.10.2.173:6789
[osd.6]
host = node3
addr = 172.10.2.173:6789
[osd.7]
host = node2
addr = 172.10.2.172:6789
[osd.8]
host = node1
addr = 172.10.2.171:6789
[mds.node1]
host = node1
[mds.node2]
host = node2
[mds.node3]
host = node3

6、修改crushmap
(1)获取crush map
ceph osd getcrushmap -o compiled-crushmap-filename

(2)反编译
crushtool -d compiled-crushmap-filename -o decompiled-crushmap-filename

(3)编辑decompiled-crushmap-filename ,加入ruleset,一共三个root对应三个pool,再次建立root和osd的对应关系,在ruleset中和root连接起来,设置pool的类型等。

(4)编译
crushtool -c decompiled-crushmap-filename -o compiled-crushmap-filename

(5)设置crush map
ceph osd setcrushmap -i  compiled-crushmap-filename

编辑后的crushmap如下:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

root sata {
  id -1   # do not change unnecessarily
  # weight 0.000
  alg straw
  hash 0  # rjenkins1
  item osd.0 weight 0.1
  item osd.2 weight 0.1
  item osd.4 weight 0.1
}
root ssd {
  id -8    # do not change unnecessarily
  #weight 0.000
  alg straw
  hash 0  # rjenkins1
  item osd.1 weight 0.1
  item osd.3 weight 0.1
  item osd.5 weight 0.1
}
root metadata {
  id -9    # do not change unnecessarily
  #weight 0.000
  alg straw
  hash 0  # rjenkins1
  item osd.7 weight 0.1
  item osd.6 weight 0.1
  item osd.8 weight 0.1
}

rule ssd {
 ruleset 1
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step chooseleaf firstn 0 type osd
 step emit
}

rule sata {
 ruleset 0
 type erasure
 min_size 1
 max_size 10
 step take sata
 step chooseleaf firstn 0 type osd
 step emit
}

rule metadata {
 ruleset 2
 type replicated
 min_size 1
 max_size 10
 step take metadata
 step chooseleaf firstn 0 type osd
 step emit
}

7、建立pool

(1)建立ssd pool,类型为replicated
命令原型:ceph osd pool create {pool-name} {pg-num} [{pgp-num}] [replicated]  [crush-ruleset-name]
实际命令: ceph osd pool create ssd 128 128 repicated ssd

(2)建立sata pool,类型为erasure
命令原型:ceph osd pool create {pool-name} {pg-num}  {pgp-num}  erasure  [erasure-code-profile] [crush-ruleset-name]
实际命令: ceph osd pool create sata 128 128 erasure default sata
查看有哪些erasure-code-pofile的命令为 ceph osd erasure-code-profile ls
查看具体profile内容为 ceph osd erasure-code-profile get default,结果为:
directory=/usr/lib/ceph/erasure-code
k=2
m=1
plugin=jerasure
technique=reed_sol_van
erasure-code-profile很重要,一旦设置并应用于pool不能更改,设置命令为
ceph osd erasure-code-profile set myprofile \
   k=3 \
   m=2 \
   ruleset-failure-domain=rack

(3)建立metadata pool,类型为replicated
  ceph osd pool create metadata128 128 repicated metadata

查看pg状态 ceph pg dump可以检查哪些PG在哪些OSD中
可以用 ceph osd lspools查看有哪些池, ceph osd tree查看OSD信息

8、建立cache tier

cache tier有writeback和readonly两种模式
writeback:写操作时先写到cachepool,返回ACK,cachepool下刷到storage pool,读操作时如果读到storagepool上的数据,就把数据先复制到cachepool,再返回给客户端;
readonly:写操作时直接写到storagepool,读操作时如果读到storagepool上的数据,就把数据先复制到cachepool,如果cachepool上有过时的数据,就把数据清除,再返回给客户端,这种模式不能保证两个层的一致性,读的时候很可能从cachepool读到过时的数据,所以不适合用于数据时常变化的场景

(1)创建一个tier
命令原型:ceph osd tier add {storagepool} {cachepool}
实际命令: ceph osd tier add sata ssd

(2)设置tier mode,有writeback和readonly两种
命令原型:ceph osd tier cache-mode {cachepool} {cache-mode}
实际命令: ceph osd tier cache-mode ssd writeback

(3)writeback类型的cache需要额外加此操作
命令原型:ceph osd tier set-overlay {storagepool} {cachepool}
实际命令: ceph osd tier set-overlay sata ssd

(3)设置参数
命令原型:ceph osd pool set {cachepool} {key} {value}
获取参数值的命令:ceph osd pool get {cachepool} {key} 

ceph osd pool set ssd hit_set_type bloom
ceph osd pool set ssd hit_set_count 1
ceph osd pool set ssd hit_set_period 3600
ceph osd pool set ssd target_max_bytes 1000000000000
ceph osd pool set sata cache_target_dirty_ratio 0.4
ceph osd pool set sata cache_target_full_ratio 0.8
ceph osd pool ssd target_max_bytes 1000000000000
ceph osd pool set ssd target_max_objects 1000000
ceph osd pool set ssd cache_min_flush_age 600
ceph osd pool set ssd cache_min_evict_age 1800

9、创建cephfs

命令原型:ceph fs new
实际命令: ceph fs new cephfs metadata sata
可以用 ceph fs ls查看cephfs的状态

10、挂载cephfs

(1)创建挂载点
mkdir /mnt/mycephfs

(1)获得密码
ceph-authtool --print-key /etc/ceph/ceph.client.admin.keyring
AQBNw5dU9K5MCxAAxnDaE0f9UCA/zAWo/hfnSg==
或者直接查看 ceph.client.admin.keyring文件获得密码

(2)挂载
如果内核本身支持ceph fs,直接用  mount -t ceph node1:6789:/ /mnt/cephfs -o name=admin,secret=AQBNw5dU9K5MCxAAxnDaE0f9UCA/zAWo/hfnSg==

内核不支持ceph fs,比如redhat就不支持,则使用fuse挂载,用 ceph-fuse -m  node1:6789 /mnt/cephfs
如果用cephx认证,要保证/etc/ceph下有ceph.client.admin.keyring,这个文件里有密码

你可能感兴趣的:(分布式文件系统)