下面列举了一些常用的tasks,还有很多没列出来,可以自己去查看tasks。这也是这个系列的最后一篇了吧,其他,比如代码执行流程之类的,代码比较简单也有什么好写的了,如果有需要可以写一写。
CentOS下搭建Teuthology Ceph自动化测试平台(一)
CentOS下搭建Teuthology Ceph自动化测试平台(二)
CentOS下搭建Teuthology Ceph自动化测试平台(三)
CentOS下搭建Teuthology Ceph自动化测试平台(四)
CentOS下搭建Teuthology Ceph自动化测试平台(五)
Teuthology节点的部署——Ceph自动化测试平台(六)
Teuthology的使用与Ceph自动化测试用例的编写(一)
Teuthology的使用与Ceph自动化测试用例的编写(二)
Install packages for a given project.
tasks:
- install:
project: ceph
branch: bar
- install:
project: samba
branch: foo
extra_packages: ['samba']
- install:
rhbuild: 1.3.0
playbook: downstream_setup.yml
vars:
yum_repos:
- url: "http://location.repo"
name: "ceph_repo"
Overrides are project specific:
overrides:
install:
ceph:
sha1: ...
Debug packages may optionally be installed:
overrides:
install:
ceph:
debuginfo: true
Default package lists (which come from packages.yaml) may be overridden:
overrides:
install:
ceph:
packages:
deb:
- ceph-osd
- ceph-mon
rpm:
- ceph-devel
- rbd-fuse
When tag, branch and sha1 do not reference the same commit hash, the
tag takes precedence over the branch and the branch takes precedence
over the sha1.
When the overrides have a sha1 that is different from the sha1 of
the project to be installed, it will be a noop if the project has
a branch or tag, because they take precedence over the sha1. For
instance:
overrides:
install:
ceph:
sha1: 1234
tasks:
- install:
project: ceph
sha1: 4567
branch: foobar # which has sha1 4567
The override will transform the tasks as follows:
tasks:
- install:
project: ceph
sha1: 1234
branch: foobar # which has sha1 4567
But the branch takes precedence over the sha1 and foobar
will be installed. The override of the sha1 has no effect.
When passed 'rhbuild' as a key, it will attempt to install an rh ceph build
using ceph-deploy
Reminder regarding teuthology-suite side effects:
The teuthology-suite command always adds the following:
overrides:
install:
ceph:
sha1: 1234
where sha1 matches the --ceph argument. For instance if
teuthology-suite is called with --ceph master, the sha1 will be
the tip of master. If called with --ceph v0.94.1, the sha1 will be
the v0.94.1 (as returned by git rev-parse v0.94.1 which is not to
be confused with git rev-parse v0.94.1^{commit})
Upgrades packages for a given project. example: {cmd_parameter} = ceph_deploy_upgrade
For example::
tasks:
- install.{cmd_parameter}:
all:
branch: end
or specify specific roles::
tasks:
- install.{cmd_parameter}:
mon.a:
branch: end
osd.0:
branch: other
or rely on the overrides for the target version::
overrides:
install:
ceph:
sha1: ...
tasks:
- install.{cmd_parameter}:
all:
(HACK: the overrides will *only* apply the sha1/branch/tag if those
keys are not present in the config.)
It is also possible to attempt to exclude packages from the upgrade set:
tasks:
- install.{cmd_parameter}:
exclude_packages: ['ceph-test', 'ceph-test-dbg']
Set up and tear down a Ceph cluster.
For example::
tasks:
- ceph:
- interactive:
You can also specify what branch to run::
tasks:
- ceph:
branch: foo
Or a tag::
tasks:
- ceph:
tag: v0.42.13
Or a sha1::
tasks:
- ceph:
sha1: 1376a5ab0c89780eab39ffbbe436f6a6092314ed
Or a local source dir::
tasks:
- ceph:
path: /home/sage/ceph
To capture code coverage data, use::
tasks:
- ceph:
coverage: true
To use btrfs, ext4, or xfs on the target's scratch disks, use::
tasks:
- ceph:
fs: xfs
mkfs_options: [-b,size=65536,-l,logdev=/dev/sdc1]
mount_options: [nobarrier, inode64]
Note, this will cause the task to check the /scratch_devs file on each node
for available devices. If no such file is found, /dev/sdb will be used.
To run some daemons under valgrind, include their names
and the tool/args to use in a valgrind section::
tasks:
- ceph:
valgrind:
mds.1: --tool=memcheck
osd.1: [--tool=memcheck, --leak-check=no]
Those nodes which are using memcheck or valgrind will get
checked for bad results.
To adjust or modify config options, use::
tasks:
- ceph:
conf:
section:
key: value
For example::
tasks:
- ceph:
conf:
mds.0:
some option: value
other key: other value
client.0:
debug client: 10
debug ms: 1
By default, the cluster log is checked for errors and warnings,
and the run marked failed if any appear. You can ignore log
entries by giving a list of egrep compatible regexes, i.e.:
tasks:
- ceph:
log-whitelist: ['foo.*bar', 'bad message']
To run multiple ceph clusters, use multiple ceph tasks, and roles
with a cluster name prefix, e.g. cluster1.client.0. Roles with no
cluster use the default cluster name, 'ceph'. OSDs from separate
clusters must be on separate hosts. Clients and non-osd daemons
from multiple clusters may be colocated. For each cluster, add an
instance of the ceph task with the cluster name specified, e.g.::
roles:
- [mon.a, osd.0, osd.1]
- [backup.mon.a, backup.osd.0, backup.osd.1]
- [client.0, backup.client.0]
tasks:
- ceph:
cluster: ceph
- ceph:
cluster: backup
For example::
tasks:
- ceph.wait_for_failure: [mds.*]
tasks:
- ceph.wait_for_failure: [osd.0, osd.2]
tasks:
- ceph.wait_for_failure:
daemons: [osd.0, osd.2]
For example::
tasks:
- ceph.stop: [mds.*]
tasks:
- ceph.stop: [osd.0, osd.2]
tasks:
- ceph.stop:
daemons: [osd.0, osd.2]
For example::
tasks:
- ceph.restart: [all]
For example::
tasks:
- ceph.restart: [osd.0, mon.1, mds.*]
or::
tasks:
- ceph.restart:
daemons: [osd.0, mon.1]
wait-for-healthy: false
wait-for-osds-up: true
ctx
variableRun an interactive Python shell, with the cluster accessible via
the ``ctx`` variable.
Hit ``control-D`` to continue.
This is also useful to pause the execution of the test between two
tasks, either to perform ad hoc operations, or to examine the
state of the cluster. You can also use it to easily bring up a
Ceph cluster for ad hoc testing.
For example::
tasks:
- ceph:
- interactive:
Run RadosModel-based integration tests. 进行所有读,写,快照,回滚,纠删码,克隆等基本内部单元测试。实际会调用ceph/src/test下面的osd/TestRados.cc。
The config should be as follows::
rados:
clients: [client list]
ops: <number of ops>
objects: <number of objects to use>
max_in_flight: number of operations in flight>
object_size: of objects in bytes>
min_stride_size: write stride size in bytes>
max_stride_size: write stride size in bytes>
op_weights: to integer weight>
runs: <number of times to run> - the pool is remade between runs
ec_pool: use an ec pool
erasure_code_profile: profile to use with the erasure coded pool
fast_read: enable ec_pool's fast_read
min_size: set the min_size of created pool
pool_snaps: use pool snapshots instead of selfmanaged snapshots
write_fadvise_dontneed: write behavior like with LIBRADOS_OP_FLAG_FADVISE_DONTNEED.This mean data don't access in the near future.Let osd backend don't keep data in cache.
For example::
tasks:
- ceph:
- rados:
clients: [client.0]
ops: 1000
max_seconds: 0 # 0 for no limit
objects: 25
max_in_flight: 16
object_size: 4000000
min_stride_size: 1024
max_stride_size: 4096
op_weights:
read: 20
write: 10
delete: 2
snap_create: 3
rollback: 2
snap_remove: 0
ec_pool: create an ec pool, defaults to False
erasure_code_use_overwrites: test overwrites, default false
erasure_code_profile:
name: teuthologyprofile
k: 2
m: 1
crush-failure-domain: osd
pool_snaps: true
write_fadvise_dontneed: true
runs: 10
- interactive:
Optionally, you can provide the pool name to run against:
tasks:
- ceph:
- exec:
client.0:
- ceph osd pool create foo
- rados:
clients: [client.0]
pools: [foo]
...
Alternatively, you can provide a pool prefix:
tasks:
- ceph:
- exec:
client.0:
- ceph osd pool create foo.client.0
- rados:
clients: [client.0]
pool_prefix: foo
...
The tests are run asynchronously, they are not complete when the task
returns. For instance:
- rados:
clients: [client.0]
pools: [ecbase]
ops: 4000
objects: 500
op_weights:
read: 100
write: 100
delete: 50
copy_from: 50
- print: "**** done rados ec-cache-agent (part 2)"
will run the print task immediately after the rados tasks begins but
not after it completes. To make the rados task a blocking / sequential
task, use:
- sequential:
- rados:
clients: [client.0]
pools: [ecbase]
ops: 4000
objects: 500
op_weights:
read: 100
write: 100
delete: 50
copy_from: 50
- print: "**** done rados ec-cache-agent (part 2)"
在overrides下,可以增加高优先级的ceph配置。它可以增加配置,却不修改现有的yaml文件,提高了重用。
overrides: override behavior. Typically, this includes sub-tasks being overridden. Overrides technically is not a task (there is no ‘def task’ in an overrides.py file), but from a user’s standpoint can be described as behaving like one. Sub-tasks can nest further information. For example, overrides of install tasks are project specific, so the following section of a yaml file would cause all ceph installations to default to using the jewel branch:
overrides:
install:
ceph:
branch: jewel
For example::
tasks:
- ceph:
- ceph-fuse: [client.0]
- workunit:
clients:
client.0: [direct_io, xattrs.sh]
client.1: [snaps]
branch: foo
You can also run a list of workunits on all clients:
tasks:
- ceph:
- ceph-fuse:
- workunit:
tag: v0.47
clients:
all: [direct_io, xattrs.sh, snaps]
If you have an "all" section it will run all the workunits
on each client simultaneously, AFTER running any workunits specified
for individual clients. (This prevents unintended simultaneous runs.)
To customize tests, you can specify environment variables as a dict. You
can also specify a time limit for each work unit (defaults to 3h):
tasks:
- ceph:
- ceph-fuse:
- workunit:
sha1: 9b28948635b17165d17c1cf83d4a870bd138ddf6
clients:
all: [snaps]
env:
FOO: bar
BAZ: quux
timeout: 3h
This task supports roles that include a ceph cluster, e.g.::
tasks:
- ceph:
- workunit:
clients:
backup.client.0: [foo]
client.1: [bar] # cluster is implicitly 'ceph'
You can also specify an alternative top-level dir to 'qa/workunits', like
'qa/standalone', with::
tasks:
- install:
- workunit:
basedir: qa/standalone
clients:
client.0:
- test-ceph-helpers.sh
How it works::
- pick a monitor
- kill it
- wait for quorum to be formed
- sleep for 'revive_delay' seconds
- revive monitor
- wait for quorum to be formed
- sleep for 'thrash_delay' seconds
Options::
seed Seed to use on the RNG to reproduce a previous
behaviour (default: None; i.e., not set)
revive_delay Number of seconds to wait before reviving
the monitor (default: 10)
thrash_delay Number of seconds to wait in-between
test iterations (default: 0)
thrash_store Thrash monitor store before killing the monitor being thrashed (default: False)
thrash_store_probability Probability of thrashing a monitor's store
(default: 50)
thrash_many Thrash multiple monitors instead of just one. If
'maintain-quorum' is set to False, then we will
thrash up to as many monitors as there are
available. (default: False)
maintain_quorum Always maintain quorum, taking care on how many
monitors we kill during the thrashing. If we
happen to only have one or two monitors configured,
if this option is set to True, then we won't run
this task as we cannot guarantee maintenance of
quorum. Setting it to false however would allow the
task to run with as many as just one single monitor.
(default: True)
freeze_mon_probability: how often to freeze the mon instead of killing it,
in % (default: 0)
freeze_mon_duration: how many seconds to freeze the mon (default: 15)
scrub Scrub after each iteration (default: True)
Note: if 'store-thrash' is set to True, then 'maintain-quorum' must also
be set to True.
For example::
tasks:
- ceph:
- mon_thrash:
revive_delay: 20
thrash_delay: 1
thrash_store: true
thrash_store_probability: 40
seed: 31337
maintain_quorum: true
thrash_many: true
- ceph-fuse:
- workunit:
clients:
all:
- mon/workloadgen.sh
“Thrash” the OSDs by randomly marking them out/down (and then back in) until the task is ended. This loops, and every op_delay seconds it randomly chooses to add or remove an OSD (even odds) unless there are fewer than min_out OSDs out of the cluster, or more than min_in OSDs in the cluster.
All commands are run on mon0 and it stops when __exit__ is called.
The config is optional, and is a dict containing some or all of:
cluster: (default 'ceph') the name of the cluster to thrash
min_in: (default 4) the minimum number of OSDs to keep in the cluster
min_out: (default 0) the minimum number of OSDs to keep out of the cluster
op_delay: (5) the length of time to sleep between changing an OSD's status
min_dead: (0) minimum number of osds to leave down/dead.
max_dead: (0) maximum number of osds to leave down/dead before waiting
for clean. This should probably be num_replicas - 1.
clean_interval: (60) the approximate length of time to loop before
waiting until the cluster goes clean. (In reality this is used
to probabilistically choose when to wait, and the method used
makes it closer to -- but not identical to -- the half-life.)
scrub_interval: (-1) the approximate length of time to loop before
waiting until a scrub is performed while cleaning. (In reality
this is used to probabilistically choose when to wait, and it
only applies to the cases where cleaning is being performed).
-1 is used to indicate that no scrubbing will be done.
chance_down: (0.4) the probability that the thrasher will mark an
OSD down rather than marking it out. (The thrasher will not
consider that OSD out of the cluster, since presently an OSD
wrongly marked down will mark itself back up again.) This value
can be either an integer (eg, 75) or a float probability (eg
0.75).
chance_test_min_size: (0) chance to run test_pool_min_size,
which:
- kills all but one osd
- waits
- kills that osd
- revives all other osds
- verifies that the osds fully recover
timeout: (360) the number of seconds to wait for the cluster
to become clean after each cluster change. If this doesn't
happen within the timeout, an exception will be raised.
revive_timeout: (150) number of seconds to wait for an osd asok to
appear after attempting to revive the osd
thrash_primary_affinity: (true) randomly adjust primary-affinity
chance_pgnum_grow: (0) chance to increase a pool's size
chance_pgpnum_fix: (0) chance to adjust pgpnum to pg for a pool
pool_grow_by: (10) amount to increase pgnum by
max_pgs_per_pool_osd: (1200) don't expand pools past this size per osd
pause_short: (3) duration of short pause
pause_long: (80) duration of long pause
pause_check_after: (50) assert osd down after this long
chance_inject_pause_short: (1) chance of injecting short stall
chance_inject_pause_long: (0) chance of injecting long stall
clean_wait: (0) duration to wait before resuming thrashing once clean
sighup_delay: (0.1) duration to delay between sending signal.SIGHUP to a
random live osd
powercycle: (false) whether to power cycle the node instead
of just the osd process. Note that this assumes that a single
osd is the only important process on the node.
bdev_inject_crash: (0) seconds to delay while inducing a synthetic crash.
the delay lets the BlockDevice "accept" more aio operations but blocks
any flush, and then eventually crashes (losing some or all ios). If 0,no bdev failure injection is enabled.
bdev_inject_crash_probability: (.5) probability of doing a bdev failure
injection crash vs a normal OSD kill.
chance_test_backfill_full: (0) chance to simulate full disks stopping
Backfill
chance_test_map_discontinuity: (0) chance to test map discontinuity
map_discontinuity_sleep_time: (40) time to wait for map trims
ceph_objectstore_tool: (true) whether to export/import a pg while an osd is down
chance_move_pg: (1.0) chance of moving a pg if more than 1 osd is down (default 100%)
optrack_toggle_delay: (2.0) duration to delay between toggling op tracker
enablement to all osds
dump_ops_enable: (true) continuously dump ops on all live osds
noscrub_toggle_delay: (2.0) duration to delay between toggling noscrub
disable_objectstore_tool_tests: (false) disable ceph_objectstore_tool
based tests
chance_thrash_cluster_full: .05
chance_thrash_pg_upmap: 1.0
chance_thrash_pg_upmap_items: 1.0
example:
tasks:
- ceph:
- thrashosds:
cluster: ceph
chance_down: 10
op_delay: 3
min_in: 1
timeout: 600
- interactive:
Check if there are any clock skews among the monitors in the quorum.
This task accepts the following options:
interval amount of seconds to wait before check. (default: 30.0)
expect-skew 'true' or 'false', to indicate whether to expect a skew during
the run or not. If 'true', the test will fail if no skew is
found, and succeed if a skew is indeed found; if 'false', it's
the other way around. (default: false)
- mon_clock_skew_check:
expect-skew: true