创建项目lab
和用户lab
,并授予普通用户角色
source ~/.openstack/.admin-openrc
openstack project create --domain default --description "Lab Project" lab
openstack user create --domain default --password-prompt lab
openstack role add --project lab --user lab user
注:也可以以admin
用户登录Dashboard
,以图形化方式创建项目lab
和用户lab
。
设置OpenStack lab
用户环境变量
① 创建lab
用户环境变量配置文件vi ~/.openstack/.lag-openrc
# Add environment variables for demo
export OS_PROJECT_DOMAIN_NAME=default
export OS_USER_DOMAIN_NAME=default
export OS_PROJECT_NAME=lab
export OS_USERNAME=lab
# To avoid security problems,remove the OS_PASSWORD variable
# Use the --password parameter with OpenStack client commands instead
export OS_PASSWORD=lab@a112
export OS_AUTH_URL=http://controller:5000/v3
export OS_AUTH_TYPE=password
export OS_IDENTITY_API_VERSION=3
export OS_IMAGE_API_VERSION=2
② 使环境变量生效
source ~/.openstack/.lab-openrc
Cloudera Plugin
镜像查看sahara
可用插件
source ~/.openstack/.lab-openrc
openstack dataprocessing plugin list
注:本文选择cdh 5.5.0
下载对应版本镜像
wget http://sahara-files.mirantis.com/images/upstream/mitaka/
http://sahara-files.mirantis.com/images/upstream/mitaka/sahara-mitaka-cloudera-5.5.0-ubuntu.qcow2
创建镜像
source ~/.openstack/.lab-openrc
openstack image create "sahara-mitaka-cloudera-5.5.0-ubuntu" --file sahara-mitaka-cloudera-5.5.0-ubuntu.qcow2 --disk-format qcow2 --container-format bare
查看镜像列表
openstack image list
创建用于部署集群的网络、子网、路由
source ~/.openstack/.lab-openrc
neutron net-create selfservice-sahara-cluster
neutron subnet-create --name selfservice-sahara-cluster --dns-nameserver 8.8.4.4 --gateway 172.16.100.1 selfservice-sahara-cluster 172.16.100.0/24
neutron router-create router
neutron router-interface-add router selfservice-sahara-cluster
neutron router-gateway-set router provider
openstack network list
openstack subnet list
neutron router-list
创建用于部署集群的云主机类型模板
source ~/.openstack/.admin-openrc
openstack flavor create --vcpus 4 --ram 8192 --disk 20 sahara-flavor
openstack flavor list
Sahara
获取镜像sahara-mitaka-cloudera-5.5.0-ubuntu
的ID
source ~/.openstack/.lab-openrc
export IMAGE_ID=$(openstack image list | awk '/ sahara-mitaka-cloudera-5.5.0-ubuntu / { print $2 }')
获取镜像sahara-mitaka-cloudera-5.5.0-ubuntu
的用户名和标签
可参考:http://docs.openstack.org/developer/sahara/userdoc/cdh_plugin.html
① 用户名为ubuntu
② 标签:cdh
、5.5.0
注册镜像
openstack dataprocessing image register $IMAGE_ID --username ubuntu
添加标签
openstack dataprocessing image tags add $IMAGE_ID --tags cdh 5.5.0
获取基本信息
① 云主机类型模板ID:8d824f5a-a829-42ad-9878-f38318cc9821
openstack flavor list | awk '/ sahara-flavor / { print $2 }'
② 浮动IP池ID:20b2a466-cd25-4b9a-9194-2b8005a8b547
openstack ip floating pool list
openstack network list | awk '/ provider / { print $2 }'
创建cdh-550-default-namenode
节点组模板
① 新建文件namenode.json
,内容如下:
{
"plugin_name": "cdh",
"hadoop_version": "5.5.0",
"node_processes": [
"HDFS_NAMENODE",
"YARN_RESOURCEMANAGER",
"HIVE_SERVER2",
"HIVE_METASTORE",
"CLOUDERA_MANAGER"
],
"name": "cdh-550-default-namenode",
"floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547",
"flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821",
"auto_security_group": true,
"is_protected": true
}
② 创建节点组模板:
openstack dataprocessing node group template create --json namenode.json
注:也可用命令直接创建,如
openstack dataprocessing node group template create --name vanilla-default-worker --plugin
--plugin-version --processes HDFS_NAMENODE YARN_RESOURCEMANAGER HIVE_SERVER2 HIVE_METASTORE CLOUDERA_MANAGER --flavor <flavor-id> --floating-ip-pool <pool-id> --auto-security-group
创建cdh-550-default-secondary-namenode
节点组模板
① 新建文件secondary-namenode.json
,内容如下:
{
"plugin_name": "cdh",
"hadoop_version": "5.5.0",
"node_processes": [
"HDFS_SECONDARYNAMENODE",
"OOZIE_SERVER",
"YARN_JOBHISTORY",
"SPARK_YARN_HISTORY_SERVER"
],
"name": "cdh-550-default-secondary-namenode",
"floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547",
"flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821",
"auto_security_group": true,
"is_protected": true
}
② 创建节点组模板:
openstack dataprocessing node group template create --json secondary-namenode.json
创建cdh-550-default-datanode
节点组模板
① 新建文件datanode.json
,内容如下:
{
"plugin_name": "cdh",
"hadoop_version": "5.5.0",
"node_processes": [
"HDFS_DATANODE",
"YARN_NODEMANAGER"
],
"name": "cdh-550-default-datanode",
"floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547",
"flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821",
"auto_security_group": true,
"is_protected": true
}
② 创建节点组模板:
openstack dataprocessing node group template create --json datanode.json
获取节点组模板ID
① 打印节点组模板列表
openstack dataprocessing node group template list
② 获取每个节点组模板的对应ID:
Node Template | ID |
---|---|
cdh-550-default-namenode | f8eb08e6-80d5-4409-af7e-13009e694603 |
cdh-550-default-secondary-namenode | a4ebb4d5-67b4-41f2-969a-2ac6db4f892f |
cdh-550-default-datanode | c80540fe-98b7-4dc8-9e94-0cd93c23c0f7 |
创建集群模板cdh-550-default-cluster
① 新建文件cluster.json
,内容如下:
{
"plugin_name": "cdh",
"hadoop_version": "5.5.0",
"node_groups": [
{
"name": "datanode",
"count": 8,
"node_group_template_id": "c80540fe-98b7-4dc8-9e94-0cd93c23c0f7"
},
{
"name": "secondary-namenode",
"count": 1,
"node_group_template_id": "a4ebb4d5-67b4-41f2-969a-2ac6db4f892f"
},
{
"name": "namenode",
"count": 1,
"node_group_template_id": "f8eb08e6-80d5-4409-af7e-13009e694603"
}
],
"name": "cdh-550-default-cluster",
"cluster_configs": {},
"is_protected": true
}
② 创建集群模板
openstack dataprocessing cluster template create --json cluster.json
查看集群模板列表
openstack dataprocessing cluster template list
获取创建集群所需基本信息
① 创建密钥对
source ~/.openstack/.lab-openrc
openstack keypair create labkey --public-key ~/.ssh/id_rsa.pub
openstack keypair list
② 获取集群模板cdh-550-default-cluster
的ID
openstack dataprocessing cluster template list | awk '/ cdh-550-default-cluster / { print $4 }'
③ 获取集群默认的sahara
注册镜像的ID
openstack dataprocessing image list | awk '/ 'sahara-mitaka-cloudera-5.5.0-ubuntu' / { print $4 }'
④ 获取集群网络selfservice-sahara-cluster
的ID
openstack network list | awk '/ 'selfservice-sahara-cluster' / { print $2 }'
新建创建集群的配置文件cluster_create.json
内容如下:
{
"plugin_name": "cdh",
"hadoop_version": "5.5.0",
"name": "cluster-1",
"cluster_template_id" : "b55ef1b7-b5df-4642-9543-71a9fe972ac0",
"user_keypair_id": "labkey",
"default_image_id": "1b0a2a22-26d5-4a0f-b186-f19dbacbb971",
"neutron_management_network": "548e06a1-f86c-4dd7-bdcd-dfa1c3bdc24f",
"is_protected": true
}
创建集群
openstack dataprocessing cluster create --json cluster_create.json
Sahara
集群无法删除问题
创建集群后,集群还在创建过程中,未等集群创建完成,执行删除操作后,集群状态一直显示Deleting
,无法删除集群。
原因
暂未发现
解决方法
查询数据库sahara
,手动删除集群表clusters
和node_groups
中该集群对应记录。
mysql -usahara -p
use sahara;
show tables;
delete from node_groups;
delete from clusters;
注:此处表clusters
和node_groups
中只有刚创建的集群对应的数据,所以删除表中全部数据。建议删除时带条件,限制只删除该集群ID对应数据。
RAM
配额不足问题
创建集群失败,提示RAM
配额不足
Quota exceeded for RAM: Requested 81920, but available 51200
Error ID: c196131b-047d-4ed8-9dbd-4cc074cb8147
原因
集群请求分配的内存总量超出了项目的RAM
配额
解决方法
以admin
登录,修改项目lab
的RAM
配额
source .openstack/.admin-openrc
openstack quota show lab
openstack quota set --ram 81920 lab
floating_ip
配额不足问题
Quota exceeded for floating ip: Requested 10, but available 0
Error ID: d5d04298-ba8b-466c-80cc-aa12ca989d8f
原因
项目lab
的浮动IP配额不足,但查看项目lab
的浮动IP配额,发现浮动IP配额充足。删除集群,重新尝试,发现还是提示配额不足,查看官方参考文档
http://docs.openstack.org/developer/sahara/userdoc/configuration.guide.html#floating-ip-management
发现是配置文件/etc/sahara/sahara.conf
中配置有误,将use_floating_ips
设置成了False
解决方法
① 修改配置文件sudo vi /etc/sahara/sahara.conf
,将use_floating_ips=False
改为use_floating_ips=true
② 将修改写入数据库
su root
sahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head
② 重启sahara
服务
sudo service sahara-api restart
sudo service sahara-engine restart
Sahara
集群创建失败,状态Error
问题
创建集群后,集群状态显示Error
2016-07-22 11:01:39.339 7763 ERROR sahara.service.trusts [req-5414693a-cb53-4974-a222-e4431dacc834 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 44c81451-200b-4711-9801-1d49f0468da9] Unable to create trust (reason: Expecting to find id or name in user - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error. (HTTP 400) (Request-ID: req-3981c44d-4c09-4254-aacb-d67ee74746f8))
2016-07-22 11:01:39.476 7763 ERROR sahara.service.ops [req-5414693a-cb53-4974-a222-e4431dacc834 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 44c81451-200b-4711-9801-1d49f0468da9] Error during operating on cluster (reason: Failed to create trust
原因
经分析,是由于keystone_authtoken
认证失败,没有管理员权限导致。
解决方法
① 修改sudo vi /etc/sahara/sahara.conf
,在keystone
认证配置部分新增如下内容:
注:将SAHARA_PASS
替换为实际密码
[keystone_authtoken]
identity_uri = http://controller:35357
admin_tenant_name = service
admin_user = sahara
admin_password = SAHARA_PASS
② 将修改写入数据库
su root
sahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head
② 重启sahara
服务
sudo service sahara-api restart
sudo service sahara-engine restart
问题
创建集群,状态显示为Spawning
,后显示Error
,日志文件/var/log/sahara/sahara-engine.log
提示如下信息
2016-07-22 21:18:27.317 110479 WARNING sahara.service.heat.heat_engine [req-8e497091-c8ce-429c-a0bf-4141753c5582 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 05ed7935-3063-4410-9277-25754a73726f] Cluster creation rollback (reason: Heat stack failed with status Resource CREATE failed: OverQuotaClient: resources.cdh-550-default-namenode.resources[0].resources.cluster-1-cdh-550-default-namenode-3e806c9b: Quota exceeded for resources: ['security_group_rule'].
Neutron server returns request_ids: ['req-4d91968f-451f-426c-aa67-d8827f1ad426']
Error ID: 03aa7921-898e-4444-9c7f-c2321a5f8bdb)
2016-07-22 21:18:27.911 110479 INFO sahara.utils.cluster [req-8e497091-c8ce-429c-a0bf-4141753c5582 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 05ed7935-3063-4410-9277-25754a73726f] Cluster status has been changed. New status=Error
原因
项目lab
的安全组规则配额不足导致
解决方法
以admin
登录,修改项目lab
的secgroups
和secgroup-rules
配额
注:配额为负数表示没有限制
source .openstack/.admin-openrc
openstack quota show lab
openstack quota set --secgroups -1 lab
openstack quota set --secgroup-rules -1 lab
问题
虚拟机可以ping通外网但是无法ping通浮动IP。
原因
默认安全组default的规则中没有允许ICMP和SSH,添加相应安全组规则即可。
解决方法
source ~/.openstack/.lab_openrc
openstack security group rule create --proto icmp default
openstack security group rule create --proto tcp --dst-port 22 default
Heat stack
资源创建超时 参考:
OS::Heat::WaitCondition doesnt work after upgrade to Liberty
wait condition in HOT heat template
OpenStack Orchestration In Depth, Part II: Single Instance Deployments
https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-waitcondition-timed-out/
https://ask.openstack.org/en/question/42657/how-to-debug-scripts-at-heats-softwareconfig-resource
问题
创建集群时,集群状态一直显示Spawning
,日志文件/var/log/sahara/sahara-enging.log
显示:
2016-07-22 22:51:00.470 119076 WARNING sahara.service.heat.heat_engine [req-7c025790-b9b3-40a0-bc36-e04421ecfd21 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 03fb1184-4cc2-49c3-a0f5-c5b156908f48] Cluster creation rollback (reason: Heat stack failed with status Resource CREATE failed: WaitConditionTimeout: resources.cdh-550-default-secondary-namenode.resources[0].resources.cdh-550-default-secondary-namenode-wc-waiter: 0 of 1 received
Error ID: 8a34fc47-4e84-4818-8728-78e543c97efb)
2016-07-22 22:51:01.069 119076 INFO sahara.utils.cluster [req-7c025790-b9b3-40a0-bc36-e04421ecfd21 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 03fb1184-4cc2-49c3-a0f5-c5b156908f48] Cluster status has been changed. New status=Error
日志文件/var/log/heat/heat-enging.log
显示:
2016-07-28 10:21:12.748 7561 INFO heat.engine.resources.openstack.heat.wait_condition [-] HeatWaitCondition "cdh-550-default-secondary-namenode-wc-waiter" Stack "testcb185f1e-cdh-550-default-secondary-namenode-htifx6ojz65i-0-yrd44ot5ddwk" [8eb36d2d-02b6-42ab-8f6f-1ae0baeea159] Timed out (0 of 1 received)
2016-07-28 10:21:12.762 7563 DEBUG heat.engine.scheduler [-] Task stack_task from Stack "testcb185f1e-cdh-550-default-namenode-bym45epemale-0-az3fffva54ro" [58bb1eac-744b-4130-b486-98f726975dc0] running step /usr/lib/python2.7/dist-packages/heat/engine/scheduler.py:216
2016-07-28 10:21:12.763 7563 DEBUG heat.engine.scheduler [-] Task create running step /usr/lib/python2.7/dist-packages/heat/engine/scheduler.py:216
2016-07-28 10:21:12.749 7561 INFO heat.engine.resource [-] CREATE: HeatWaitCondition "cdh-550-default-secondary-namenode-wc-waiter" Stack "testcb185f1e-cdh-550-default-secondary-namenode-htifx6ojz65i-0-yrd44ot5ddwk" [8eb36d2d-02b6-42ab-8f6f-1ae0baeea159]
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource Traceback (most recent call last):
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 704, in _action_recorder
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource yield
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 775, in _do_action
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/scheduler.py", line 314, in wrapper
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource step = next(subtask)
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 749, in action_handler_task
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource while not check(handler_data):
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resources/openstack/heat/wait_condition.py", line 130, in check_create_complete
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource return self._wait(*data)
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resources/openstack/heat/wait_condition.py", line 108, in _wait
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource raise exc
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource WaitConditionTimeout: 0 of 1 received
2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource
原因
① 查看日志,初步分析应该是Stack
创建失败,尝试创建栈时请求超时,最后导致集群状态Error
。
② 经排查,在集群创建过程中,检查实例是否创建完成,状态为ACTIVE
时,从日志中提取一直处于CREATE_IN_PROGRESS
状态的stack
的ID
,在controller
节点上查看openstack stack resource list stack_id
,查看该stack
的资源列表,发现Heat
在创建Stack
时只有wc-waiter
没有创建成功,一直处于CREATE_IN_PROGRESS
状态,如下图所示。
③ 参考AWS
相关的提示:猜测是由于实例创建完成后没有通知heat已经创建完成。
④ SSH登录到实例,查看/var/lib/cloud/data/cfn-userdata
,可以看到其中有一段是关于向heat-api
发送创建成功消息的HTTP
请求,但是heat-api
地址使用了主机名controller
代替IP地址。通过ping
命令测试发现可以ping
通主机controller
的IP地址,但是无法ping
通controller
,将主机名替换为IP地址后手动执行该HTTP请求,发现回复OK。如下图:
登录集群每个节点,手动执行该HTTP请求后发现集群创建成功。
⑤ 经测试和分析,基本断定是由于创建heat-api endpoint
时使用了主机名代替IP地址,但是实例无法解析主机名,导致无法访问heat-api
,无法发送实例创建成功的消息,导致heat stack
创建时一直阻塞,最后超时失败。
集群创建测试成功,如下图:
解决方法
① 方法一:在集群创建时,实例显示ACTIVE
后,登录集群所有节点,手动修改集群每个节点的/etc/hosts
,添加一条记录192.168.1.11 controller
。但这种方法比较费事,只是临时解决方法。
② 方法二:在创建heat api
的endpoint
时,将主机名替换为IP地址。(该方法未经测试,推断应该可以解决该问题)
③ 方法三:通过dnsmasq
配置,使实例可以解析主机名。(该方法多次配置测试后未成功,还需要继续研究是否可行)
问题
在创建sahara集群时,实例创建成功后,集群状态一直starting,查看后台日志,提示磁盘空间不足。
原因
df -h
查看磁盘空间,发现磁盘使用率接近100%。通过在当前目录执行du -h --max-depth=1
,发现/var/lib
和/var/log
目录占用磁盘最多,其中系统日志/var/log/syslog
占用磁盘20G,可以看到是因为日志文件的原因。
解决方法
① 对于日志文件,可用cat /dev/null > /var/log/syslog
,同时清空占用磁盘空间较多的日志文件即可。若是测试环境,可用如下脚本清空日志。
#!/bin/bash
for i in `find /var/log -name *.log`
do
echo $i
cat /dev/null > $i
done
exit 0
注:对于mongodb
日志太大的问题,可以采用旋转日志解决。如下:
参考:https://docs.mongodb.com/manual/tutorial/rotate-log-files/
ps -ef | grep mongod //查看mongodb进程ID
kill -SIGUSR1 <mongod process id>
或者命令行登录mongodb,执行如下命令:
use admin
db.runCommand( { logRotate : 1 } )
② 查看/var/lib
下的目录,发现占用最多的目录/var/lib/mongodb
和/var/lib/glance
。经分析,/var/lib/glance
下存放的是openstack镜像文件,是正常占用;而/var/lib/mongodb/
下可看到许多文件名是ceilometer.*
的文件,该类文件是mongodb数据库中的ceilometer计量信息。若为测试环境无用删除即可。
参考:mongodb删除集合后磁盘空间不释放
mongo --host controller
show dbs
use ceilometer
db // 查看当前数据库
show collections
db.meter.count()
db.meter.remove({}) //数据量较大时,删除较费时
db.repairDatabase() //释放磁盘空间
问题
sahara创建集群时,无法启动集群服务。
① 在日志sahara-engine.log
中显示:
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [req-ee794aca-7c2c-4f55-a2c9-4e7f4233c4a8 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Error during operating on cluster (reason: Failed to Provision Hadoop Cluster: Failed to start service.
Error ID: df99b010-46e0-41df-89ac-95490a52fc90)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Traceback (most recent call last):
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/service/ops.py", line 192, in wrapper
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] f(cluster_id, *args, **kwds)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/service/ops.py", line 301, in _provision_cluster
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] plugin.start_cluster(cluster)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/plugin.py", line 51, in start_cluster
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] cluster.hadoop_version).start_cluster(cluster)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/abstractversionhandler.py", line 109, in start_cluster
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] self.deploy.start_cluster(cluster)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/v5_5_0/deploy.py", line 165, in start_cluster
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] CU.first_run(cluster)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/utils/cluster_progress_ops.py", line 139, in handler
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] add_fail_event(instance, e)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] self.force_reraise()
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] six.reraise(self.type_, self.value, self.tb)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/utils/cluster_progress_ops.py", line 136, in handler
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] value = func(*args, **kwargs)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/cloudera_utils.py", line 42, in wrapper
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] raise ex.HadoopProvisionError(c.resultMessage)
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] HadoopProvisionError: Failed to Provision Hadoop Cluster: Failed to start service.
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Error ID: df99b010-46e0-41df-89ac-95490a52fc90
2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e]
2016-08-12 21:28:08.185 7531 INFO sahara.utils.cluster [req-ee794aca-7c2c-4f55-a2c9-4e7f4233c4a8 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Cluster status has been changed. New status=Error
② 在实例/var/log/hdfs/
日志中显示错误
java.io.IOException: File /user/ubuntu/pies could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)
at org.apache.hadoop.ipc.Client.call(Client.java:905)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)
原因
参考链接:
http://stackoverflow.com/questions/5293446/hdfs-error-could-only-be-replicated-to-0-nodes-instead-of-1
http://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1
暂未解决,初步断定是datanode和namenode之间网络状态不良,导致datanode不可用。
解决方法
问题
sahara集群主机重启后,发现CDH或Spark相应服务并没有自动启动,导致服务不可用。
原因
对于Spark Plugin,官方文档中有说明,Spark服务没有以Ubuntu标准服务方式部署,虚拟机重启后Spark服务不会自动重启。
Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.
参考链接:
http://docs.openstack.org/developer/sahara/userdoc/spark_plugin.html
解决方法