Author: Haohao Zhang
Summary
In order to manager thousands of servers , we need adeployment tool to do all kinds of things.
The most used tools are puppet, saltstack , ansible .
Puppet and saltstack both have agent , but ansible donothave agent which is the advantage , because you donot have to manage theseagents using another tool.
Also ansible is written in python language , have lots ofmodules .
You could develop your own modules and contribute back tocommunity.
Ansible use ssh protocal to transfer data .
Here are some best practices that we want to share withyou .
Practice1
Problem:
Result is output to terminal after you execute theansible or ansible-playbook command, sometimes you want it to run in thebackground , and output the result to log . you might use “nohup” to do it ,but you will find it is a disaster .
nohup ansible-playbook -i inventory main.yml -k -K -U root -u test -s -f 10 > ansible_log
Output:
File "/usr/lib/python2.6/site-packages/ansible/runner/connection_plugins/ssh.py", line 162, in _communicate
rfd, wfd, efd = select.select(rpipes, [], rpipes, 1)
ValueError: filedescriptor out of range in select()
Reason:
The python client uses select() to wait forsocket activity. select() is used because it is available on most platforms.However, select() has a hard limit onthe value of an file descriptor. If a socket iscreated that has a file descriptor > the value of FD_SETSIZE, the followingexception is thrown:
Note well: this is caused by the value of the fd,not the number of open fds.
Related issues in community:
https://github.com/ansible/ansible/issues/10157
https://issues.apache.org/jira/browse/QPID-5588
https://github.com/ansible/ansible/issues/14143
Reproduce:
cat test.py
#!/usr/bin/env python
import subprocess
import os
import select
import getpass
host = 'example.com'
timeout = 5
password = getpass.getpass(prompt="Enter your password:")
for i in range(10):
(r, w) = os.pipe()
ssh_cmd = ['sshpass', '-d%d ' % r]
ssh_cmd += ['ssh', '%s' % host, 'uptime']
print ssh_cmd
os.write(w, password + '\n')
os.close(w)
os.close(r)
p = subprocess.Popen(ssh_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
rpipes = [p.stdout, p.stderr]
print "file descriptor: %r" % [x.fileno() for x in rpipes]
rfd, wfd, efd = select.select([p.stdout, p.stderr], [], [p.stderr], timeout)
if rfd:
print p.stdout.read()
print p.stderr.read()
p.stdout.close()
p.stderr.close()
Solution:
Using nohup to run ansible-playbook command will resultfile descriptors leak problem .
The right way to do it:
or you could leverage “screen” to keep the session .
ansible-playbook -i inventory main.yml -k -K -U root -u test -s -f 10 >ansible_log 2>&1
Practice2
Problem:
Imagine that we will make a change to hadoopconfiguration files , then restart hadoop service .
but we donot want to restart the whole cluster at once ,we need rolling restart .
let’s say 10 servers for a batch.
How to we do this?
Solution:
Add "serial: $NUM" to the main playbook .sample:
---
- hosts: upgrade_rack
gather_facts: yes
vars_files:
- vars.yml
pre_tasks:
- include: turn_off_monitor.yml
tasks:
- include: pre_upgrade.yml
- include: upgrade.yml
- include: post_upgrade.yml
post_tasks:
- include: turn_on_monitor.yml
serial: 10
cat main.yml
Practice3
Problem:
We know that ansible to used to deploy things to remotehosts , but sometimes we want to login to other servers and do something when runningthe playbook tasks.
How do we do this?
Solution:
Ansible provides "delegate_to" feature to do this. sample
Ansible just turn over to delegated hosts to executecommand , after that it turns back .
- name: "Refresh nodes on resourcemanager"
shell: "yarn rmadmin -refreshNodes"
delegate_to: "example.com"
Practice4
Problem:
Sometimes we need to make changes on files when usingansible ,ansible provides some modules to do this , like “lineinfile” , “replace”, “blockinfile” .
Let's think a little more complex , assume we use module“replace” to modify a configuration file on the same server with forks 10 .
What will happen ?
We could imagine the configuration file will be messed up, because it is written by multiple processes at the same time .
- name: "Add hosts into mapred-exclude"
replace: dest=mapred-exclude regexp='\Z' replace='{{inventory_hostname}}\n' owner=hadoop group=hadoop mode=644 backup=yes
delegate_to: "example.com"
Solution:
We could add lock in the source code of module “replace”.and release thefile lock after write to file .
f = open(dest, 'rb+')
fcntl.flock(f, fcntl.LOCK_EX)
contents = f.read()
result = do_something_to_contents
f.seek(0)
f.write(result[0])
f.truncate()
fcntl.flock(f, fcntl.LOCK_UN)
f.close()
Practice5
Problem:
When running against a batch of hosts withansible-playbook , we often met following error in “gather facts” step :
failed: [example.com] => {"cmd": "/bin/lsblk -ln --output UUID /dev/sdn1", "failed": true, "rc": 257}
msg: Traceback (most recent call last):
…………………
TimeoutError: Timer expired
Solution:
The reason is“timeout” for get_mount_facts functionin /usr/lib/python2.6/site-packages/ansible/module_utils/facts.py is hardcoded to 10 seconds .
Hadoop nodes often have high IO , so disks may delay toresponse , so 10 seconds is not enough .
This problem have been fixed in ansible2.2 withintroducing a parameter gather_timeout .
你可能感兴趣的:(开发)