在OpenStack(Mitaka版本)上通过Sahara部署Hadoop&Spark集群

  • 在OpenStack(Mitaka)上通过Sahara部署Hadoop&Spark集群
    • 1. 准备工作
      • 1.1 创建用于部署集群的项目和用户
      • 1.2 准备Cloudera Plugin镜像
      • 1.3 创建集群网络和云主机类型模板
    • 2. 通过Sahara部署CDH5.5集群
      • 2.1 注册镜像到Sahara
      • 2.2 创建节点组模板
      • 2.3 创建集群模板
      • 2.4. 创建集群
    • 3. Elastic Data Processing (EDP)
    • 4. 遇到的问题
      • 4.1 Sahara集群无法删除
      • 4.2 RAM配额不足
      • 4.3 floating_ip配额不足
      • 4.4 Sahara集群创建失败,状态Error
      • 4.5 安全组规则配额不足
      • 4.6 虚拟机无法ping通浮动IP
      • 4.7 创建集群错误,提示Heat stack资源创建超时
      • 4.8 controller节点磁盘空间不足
      • 4.9 sahara集群无法启动hadoop服务
      • 4.10 sahara集群主机重启后cdh/spark等服务没有自动启动
    • 参考链接

1. 准备工作

1.1 创建用于部署集群的项目和用户

  • 创建项目lab和用户lab,并授予普通用户角色

         
         
         
         
    1. source ~/.openstack/.admin-openrc
    2. openstack project create --domain default --description "Lab Project" lab
    3. openstack user create --domain default --password-prompt lab
    4. openstack role add --project lab --user lab user

    注:也可以以admin用户登录Dashboard,以图形化方式创建项目lab和用户lab

  • 设置OpenStack lab用户环境变量 
    ① 创建lab用户环境变量配置文件vi ~/.openstack/.lag-openrc

         
         
         
         
    1. # Add environment variables for demo
    2. export OS_PROJECT_DOMAIN_NAME=default
    3. export OS_USER_DOMAIN_NAME=default
    4. export OS_PROJECT_NAME=lab
    5. export OS_USERNAME=lab
    6. # To avoid security problems,remove the OS_PASSWORD variable
    7. # Use the --password parameter with OpenStack client commands instead
    8. export OS_PASSWORD=lab@a112
    9. export OS_AUTH_URL=http://controller:5000/v3
    10. export OS_AUTH_TYPE=password
    11. export OS_IDENTITY_API_VERSION=3
    12. export OS_IMAGE_API_VERSION=2

    ② 使环境变量生效

         
         
         
         
    1. source ~/.openstack/.lab-openrc

1.2 准备Cloudera Plugin镜像

  • 查看sahara可用插件

         
         
         
         
    1. source ~/.openstack/.lab-openrc
    2. openstack dataprocessing plugin list

    注:本文选择cdh 5.5.0

  • 下载对应版本镜像

         
         
         
         
    1. wget http://sahara-files.mirantis.com/images/upstream/mitaka/
    2. http://sahara-files.mirantis.com/images/upstream/mitaka/sahara-mitaka-cloudera-5.5.0-ubuntu.qcow2
  • 创建镜像

         
         
         
         
    1. source ~/.openstack/.lab-openrc
    2. openstack image create "sahara-mitaka-cloudera-5.5.0-ubuntu" --file sahara-mitaka-cloudera-5.5.0-ubuntu.qcow2 --disk-format qcow2 --container-format bare
  • 查看镜像列表

         
         
         
         
    1. openstack image list

1.3 创建集群网络和云主机类型模板

  • 创建用于部署集群的网络、子网、路由

         
         
         
         
    1. source ~/.openstack/.lab-openrc
    2. neutron net-create selfservice-sahara-cluster
    3. neutron subnet-create --name selfservice-sahara-cluster --dns-nameserver 8.8.4.4 --gateway 172.16.100.1 selfservice-sahara-cluster 172.16.100.0/24
    4. neutron router-create router
    5. neutron router-interface-add router selfservice-sahara-cluster
    6. neutron router-gateway-set router provider
    7. openstack network list
    8. openstack subnet list
    9. neutron router-list
  • 创建用于部署集群的云主机类型模板

         
         
         
         
    1. source ~/.openstack/.admin-openrc
    2. openstack flavor create --vcpus 4 --ram 8192 --disk 20 sahara-flavor
    3. openstack flavor list

2. 通过Sahara部署CDH5.5集群

2.1 注册镜像到Sahara

  • 获取镜像sahara-mitaka-cloudera-5.5.0-ubuntu的ID

         
         
         
         
    1. source ~/.openstack/.lab-openrc
    2. export IMAGE_ID=$(openstack image list | awk '/ sahara-mitaka-cloudera-5.5.0-ubuntu / { print $2 }')
  • 获取镜像sahara-mitaka-cloudera-5.5.0-ubuntu的用户名和标签 
    可参考:http://docs.openstack.org/developer/sahara/userdoc/cdh_plugin.html 
    ① 用户名为ubuntu 
    ② 标签:cdh5.5.0

  • 注册镜像

         
         
         
         
    1. openstack dataprocessing image register $IMAGE_ID --username ubuntu
  • 添加标签

         
         
         
         
    1. openstack dataprocessing image tags add $IMAGE_ID --tags cdh 5.5.0

2.2 创建节点组模板

  • 获取基本信息 
    ① 云主机类型模板ID:8d824f5a-a829-42ad-9878-f38318cc9821

         
         
         
         
    1. openstack flavor list | awk '/ sahara-flavor / { print $2 }'

    ② 浮动IP池ID:20b2a466-cd25-4b9a-9194-2b8005a8b547

         
         
         
         
    1. openstack ip floating pool list
    2. openstack network list | awk '/ provider / { print $2 }'
  • 创建cdh-550-default-namenode节点组模板 
    ① 新建文件namenode.json,内容如下:

         
         
         
         
    1. {
    2. "plugin_name": "cdh",
    3. "hadoop_version": "5.5.0",
    4. "node_processes": [
    5. "HDFS_NAMENODE",
    6. "YARN_RESOURCEMANAGER",
    7. "HIVE_SERVER2",
    8. "HIVE_METASTORE",
    9. "CLOUDERA_MANAGER"
    10. ],
    11. "name": "cdh-550-default-namenode",
    12. "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547",
    13. "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821",
    14. "auto_security_group": true,
    15. "is_protected": true
    16. }

    ② 创建节点组模板:

         
         
         
         
    1. openstack dataprocessing node group template create --json namenode.json

    注:也可用命令直接创建,如

         
         
         
         
    1. openstack dataprocessing node group template create --name vanilla-default-worker --plugin --plugin-version --processes HDFS_NAMENODE YARN_RESOURCEMANAGER HIVE_SERVER2 HIVE_METASTORE CLOUDERA_MANAGER --flavor <flavor-id> --floating-ip-pool <pool-id> --auto-security-group
  • 创建cdh-550-default-secondary-namenode节点组模板 
    ① 新建文件secondary-namenode.json,内容如下:

         
         
         
         
    1. {
    2. "plugin_name": "cdh",
    3. "hadoop_version": "5.5.0",
    4. "node_processes": [
    5. "HDFS_SECONDARYNAMENODE",
    6. "OOZIE_SERVER",
    7. "YARN_JOBHISTORY",
    8. "SPARK_YARN_HISTORY_SERVER"
    9. ],
    10. "name": "cdh-550-default-secondary-namenode",
    11. "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547",
    12. "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821",
    13. "auto_security_group": true,
    14. "is_protected": true
    15. }

    ② 创建节点组模板:

         
         
         
         
    1. openstack dataprocessing node group template create --json secondary-namenode.json
  • 创建cdh-550-default-datanode节点组模板 
    ① 新建文件datanode.json,内容如下:

         
         
         
         
    1. {
    2. "plugin_name": "cdh",
    3. "hadoop_version": "5.5.0",
    4. "node_processes": [
    5. "HDFS_DATANODE",
    6. "YARN_NODEMANAGER"
    7. ],
    8. "name": "cdh-550-default-datanode",
    9. "floating_ip_pool": "20b2a466-cd25-4b9a-9194-2b8005a8b547",
    10. "flavor_id": "8d824f5a-a829-42ad-9878-f38318cc9821",
    11. "auto_security_group": true,
    12. "is_protected": true
    13. }

    ② 创建节点组模板:

         
         
         
         
    1. openstack dataprocessing node group template create --json datanode.json

2.3 创建集群模板

  • 获取节点组模板ID 
    ① 打印节点组模板列表

         
         
         
         
    1. openstack dataprocessing node group template list

    ② 获取每个节点组模板的对应ID:

    Node Template ID
    cdh-550-default-namenode f8eb08e6-80d5-4409-af7e-13009e694603
    cdh-550-default-secondary-namenode a4ebb4d5-67b4-41f2-969a-2ac6db4f892f
    cdh-550-default-datanode c80540fe-98b7-4dc8-9e94-0cd93c23c0f7
  • 创建集群模板cdh-550-default-cluster 
    ① 新建文件cluster.json,内容如下:

         
         
         
         
    1. {
    2. "plugin_name": "cdh",
    3. "hadoop_version": "5.5.0",
    4. "node_groups": [
    5. {
    6. "name": "datanode",
    7. "count": 8,
    8. "node_group_template_id": "c80540fe-98b7-4dc8-9e94-0cd93c23c0f7"
    9. },
    10. {
    11. "name": "secondary-namenode",
    12. "count": 1,
    13. "node_group_template_id": "a4ebb4d5-67b4-41f2-969a-2ac6db4f892f"
    14. },
    15. {
    16. "name": "namenode",
    17. "count": 1,
    18. "node_group_template_id": "f8eb08e6-80d5-4409-af7e-13009e694603"
    19. }
    20. ],
    21. "name": "cdh-550-default-cluster",
    22. "cluster_configs": {},
    23. "is_protected": true
    24. }

    ② 创建集群模板

         
         
         
         
    1. openstack dataprocessing cluster template create --json cluster.json
  • 查看集群模板列表

         
         
         
         
    1. openstack dataprocessing cluster template list

2.4. 创建集群

  • 获取创建集群所需基本信息 
    ① 创建密钥对

         
         
         
         
    1. source ~/.openstack/.lab-openrc
    2. openstack keypair create labkey --public-key ~/.ssh/id_rsa.pub
    3. openstack keypair list

    ② 获取集群模板cdh-550-default-cluster的ID

         
         
         
         
    1. openstack dataprocessing cluster template list | awk '/ cdh-550-default-cluster / { print $4 }'

    ③ 获取集群默认的sahara注册镜像的ID

         
         
         
         
    1. openstack dataprocessing image list | awk '/ 'sahara-mitaka-cloudera-5.5.0-ubuntu' / { print $4 }'

    ④ 获取集群网络selfservice-sahara-cluster的ID

         
         
         
         
    1. openstack network list | awk '/ 'selfservice-sahara-cluster' / { print $2 }'
  • 新建创建集群的配置文件cluster_create.json 
    内容如下:

         
         
         
         
    1. {
    2. "plugin_name": "cdh",
    3. "hadoop_version": "5.5.0",
    4. "name": "cluster-1",
    5. "cluster_template_id" : "b55ef1b7-b5df-4642-9543-71a9fe972ac0",
    6. "user_keypair_id": "labkey",
    7. "default_image_id": "1b0a2a22-26d5-4a0f-b186-f19dbacbb971",
    8. "neutron_management_network": "548e06a1-f86c-4dd7-bdcd-dfa1c3bdc24f",
    9. "is_protected": true
    10. }
  • 创建集群

         
         
         
         
    1. openstack dataprocessing cluster create --json cluster_create.json

3. Elastic Data Processing (EDP)

4. 遇到的问题

4.1 Sahara集群无法删除

  • 问题 
    创建集群后,集群还在创建过程中,未等集群创建完成,执行删除操作后,集群状态一直显示Deleting,无法删除集群。

  • 原因 
    暂未发现

  • 解决方法 
    查询数据库sahara,手动删除集群表clustersnode_groups中该集群对应记录。

         
         
         
         
    1. mysql -usahara -p
    2. use sahara;
    3. show tables;
    4. delete from node_groups;
    5. delete from clusters;

    注:此处表clustersnode_groups中只有刚创建的集群对应的数据,所以删除表中全部数据。建议删除时带条件,限制只删除该集群ID对应数据。

4.2 RAM配额不足

  • 问题 
    创建集群失败,提示RAM配额不足

         
         
         
         
    1. Quota exceeded for RAM: Requested 81920, but available 51200
    2. Error ID: c196131b-047d-4ed8-9dbd-4cc074cb8147
  • 原因 
    集群请求分配的内存总量超出了项目的RAM配额

  • 解决方法 
    admin登录,修改项目labRAM配额

         
         
         
         
    1. source .openstack/.admin-openrc
    2. openstack quota show lab
    3. openstack quota set --ram 81920 lab

4.3 floating_ip配额不足

  • 问题

         
         
         
         
    1. Quota exceeded for floating ip: Requested 10, but available 0
    2. Error ID: d5d04298-ba8b-466c-80cc-aa12ca989d8f
  • 原因 
    项目lab的浮动IP配额不足,但查看项目lab的浮动IP配额,发现浮动IP配额充足。删除集群,重新尝试,发现还是提示配额不足,查看官方参考文档 
    http://docs.openstack.org/developer/sahara/userdoc/configuration.guide.html#floating-ip-management 
    发现是配置文件/etc/sahara/sahara.conf中配置有误,将use_floating_ips设置成了False

  • 解决方法 
    ① 修改配置文件sudo vi /etc/sahara/sahara.conf,将use_floating_ips=False改为use_floating_ips=true 
    ② 将修改写入数据库

         
         
         
         
    1. su root
    2. sahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head

    ② 重启sahara服务

         
         
         
         
    1. sudo service sahara-api restart
    2. sudo service sahara-engine restart

4.4 Sahara集群创建失败,状态Error

  • 问题 
    创建集群后,集群状态显示Error

         
         
         
         
    1. 2016-07-22 11:01:39.339 7763 ERROR sahara.service.trusts [req-5414693a-cb53-4974-a222-e4431dacc834 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 44c81451-200b-4711-9801-1d49f0468da9] Unable to create trust (reason: Expecting to find id or name in user - the server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error. (HTTP 400) (Request-ID: req-3981c44d-4c09-4254-aacb-d67ee74746f8))
    2. 2016-07-22 11:01:39.476 7763 ERROR sahara.service.ops [req-5414693a-cb53-4974-a222-e4431dacc834 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 44c81451-200b-4711-9801-1d49f0468da9] Error during operating on cluster (reason: Failed to create trust
  • 原因 
    经分析,是由于keystone_authtoken认证失败,没有管理员权限导致。

  • 解决方法 
    ① 修改sudo vi /etc/sahara/sahara.conf,在keystone认证配置部分新增如下内容: 
    注:SAHARA_PASS替换为实际密码

         
         
         
         
    1. [keystone_authtoken]
    2. identity_uri = http://controller:35357
    3. admin_tenant_name = service
    4. admin_user = sahara
    5. admin_password = SAHARA_PASS

    ② 将修改写入数据库

         
         
         
         
    1. su root
    2. sahara-db-manage --config-file /etc/sahara/sahara.conf upgrade head

    ② 重启sahara服务

         
         
         
         
    1. sudo service sahara-api restart
    2. sudo service sahara-engine restart

4.5 安全组规则配额不足

  • 问题 
    创建集群,状态显示为Spawning,后显示Error,日志文件/var/log/sahara/sahara-engine.log提示如下信息

         
         
         
         
    1. 2016-07-22 21:18:27.317 110479 WARNING sahara.service.heat.heat_engine [req-8e497091-c8ce-429c-a0bf-4141753c5582 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 05ed7935-3063-4410-9277-25754a73726f] Cluster creation rollback (reason: Heat stack failed with status Resource CREATE failed: OverQuotaClient: resources.cdh-550-default-namenode.resources[0].resources.cluster-1-cdh-550-default-namenode-3e806c9b: Quota exceeded for resources: ['security_group_rule'].
    2. Neutron server returns request_ids: ['req-4d91968f-451f-426c-aa67-d8827f1ad426']
    3. Error ID: 03aa7921-898e-4444-9c7f-c2321a5f8bdb)
    4. 2016-07-22 21:18:27.911 110479 INFO sahara.utils.cluster [req-8e497091-c8ce-429c-a0bf-4141753c5582 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 05ed7935-3063-4410-9277-25754a73726f] Cluster status has been changed. New status=Error
  • 原因 
    项目lab的安全组规则配额不足导致

  • 解决方法 
    admin登录,修改项目labsecgroupssecgroup-rules配额 
    注:配额为负数表示没有限制

         
         
         
         
    1. source .openstack/.admin-openrc
    2. openstack quota show lab
    3. openstack quota set --secgroups -1 lab
    4. openstack quota set --secgroup-rules -1 lab

4.6 虚拟机无法ping通浮动IP

  • 问题 
    虚拟机可以ping通外网但是无法ping通浮动IP。

  • 原因 
    默认安全组default的规则中没有允许ICMP和SSH,添加相应安全组规则即可。

  • 解决方法

         
         
         
         
    1. source ~/.openstack/.lab_openrc
    2. openstack security group rule create --proto icmp default
    3. openstack security group rule create --proto tcp --dst-port 22 default

4.7 创建集群错误,提示Heat stack资源创建超时

参考: 
OS::Heat::WaitCondition doesnt work after upgrade to Liberty 
wait condition in HOT heat template 
OpenStack Orchestration In Depth, Part II: Single Instance Deployments 
https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-waitcondition-timed-out/ 
https://ask.openstack.org/en/question/42657/how-to-debug-scripts-at-heats-softwareconfig-resource

  • 问题 
    创建集群时,集群状态一直显示Spawning,日志文件/var/log/sahara/sahara-enging.log显示:

         
         
         
         
    1. 2016-07-22 22:51:00.470 119076 WARNING sahara.service.heat.heat_engine [req-7c025790-b9b3-40a0-bc36-e04421ecfd21 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 03fb1184-4cc2-49c3-a0f5-c5b156908f48] Cluster creation rollback (reason: Heat stack failed with status Resource CREATE failed: WaitConditionTimeout: resources.cdh-550-default-secondary-namenode.resources[0].resources.cdh-550-default-secondary-namenode-wc-waiter: 0 of 1 received
    2. Error ID: 8a34fc47-4e84-4818-8728-78e543c97efb)
    3. 2016-07-22 22:51:01.069 119076 INFO sahara.utils.cluster [req-7c025790-b9b3-40a0-bc36-e04421ecfd21 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 03fb1184-4cc2-49c3-a0f5-c5b156908f48] Cluster status has been changed. New status=Error

    日志文件/var/log/heat/heat-enging.log显示:

         
         
         
         
    1. 2016-07-28 10:21:12.748 7561 INFO heat.engine.resources.openstack.heat.wait_condition [-] HeatWaitCondition "cdh-550-default-secondary-namenode-wc-waiter" Stack "testcb185f1e-cdh-550-default-secondary-namenode-htifx6ojz65i-0-yrd44ot5ddwk" [8eb36d2d-02b6-42ab-8f6f-1ae0baeea159] Timed out (0 of 1 received)
    2. 2016-07-28 10:21:12.762 7563 DEBUG heat.engine.scheduler [-] Task stack_task from Stack "testcb185f1e-cdh-550-default-namenode-bym45epemale-0-az3fffva54ro" [58bb1eac-744b-4130-b486-98f726975dc0] running step /usr/lib/python2.7/dist-packages/heat/engine/scheduler.py:216
    3. 2016-07-28 10:21:12.763 7563 DEBUG heat.engine.scheduler [-] Task create running step /usr/lib/python2.7/dist-packages/heat/engine/scheduler.py:216
    4. 2016-07-28 10:21:12.749 7561 INFO heat.engine.resource [-] CREATE: HeatWaitCondition "cdh-550-default-secondary-namenode-wc-waiter" Stack "testcb185f1e-cdh-550-default-secondary-namenode-htifx6ojz65i-0-yrd44ot5ddwk" [8eb36d2d-02b6-42ab-8f6f-1ae0baeea159]
    5. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource Traceback (most recent call last):
    6. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 704, in _action_recorder
    7. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource yield
    8. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 775, in _do_action
    9. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
    10. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/scheduler.py", line 314, in wrapper
    11. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource step = next(subtask)
    12. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resource.py", line 749, in action_handler_task
    13. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource while not check(handler_data):
    14. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resources/openstack/heat/wait_condition.py", line 130, in check_create_complete
    15. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource return self._wait(*data)
    16. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource File "/usr/lib/python2.7/dist-packages/heat/engine/resources/openstack/heat/wait_condition.py", line 108, in _wait
    17. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource raise exc
    18. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource WaitConditionTimeout: 0 of 1 received
    19. 2016-07-28 10:21:12.749 7561 ERROR heat.engine.resource
  • 原因 
    ① 查看日志,初步分析应该是Stack创建失败,尝试创建栈时请求超时,最后导致集群状态Error。 
    ② 经排查,在集群创建过程中,检查实例是否创建完成,状态为ACTIVE时,从日志中提取一直处于CREATE_IN_PROGRESS状态的stackID,在controller节点上查看openstack stack resource list stack_id,查看该stack的资源列表,发现Heat在创建Stack时只有wc-waiter没有创建成功,一直处于CREATE_IN_PROGRESS状态,如下图所示。 



    ③ 参考AWS相关的提示:猜测是由于实例创建完成后没有通知heat已经创建完成。 

    ④ SSH登录到实例,查看/var/lib/cloud/data/cfn-userdata,可以看到其中有一段是关于向heat-api发送创建成功消息的HTTP请求,但是heat-api地址使用了主机名controller代替IP地址。通过ping命令测试发现可以ping通主机controller的IP地址,但是无法pingcontroller,将主机名替换为IP地址后手动执行该HTTP请求,发现回复OK。如下图: 

    登录集群每个节点,手动执行该HTTP请求后发现集群创建成功。 
    ⑤ 经测试和分析,基本断定是由于创建heat-api endpoint时使用了主机名代替IP地址,但是实例无法解析主机名,导致无法访问heat-api,无法发送实例创建成功的消息,导致heat stack创建时一直阻塞,最后超时失败。 


    集群创建测试成功,如下图: 



  • 解决方法 
    ① 方法一:在集群创建时,实例显示ACTIVE后,登录集群所有节点,手动修改集群每个节点的/etc/hosts,添加一条记录192.168.1.11 controller。但这种方法比较费事,只是临时解决方法。 
    ② 方法二:在创建heat apiendpoint时,将主机名替换为IP地址。(该方法未经测试,推断应该可以解决该问题) 
    ③ 方法三:通过dnsmasq配置,使实例可以解析主机名。(该方法多次配置测试后未成功,还需要继续研究是否可行)

4.8 controller节点磁盘空间不足

  • 问题 
    在创建sahara集群时,实例创建成功后,集群状态一直starting,查看后台日志,提示磁盘空间不足。

  • 原因 
    df -h查看磁盘空间,发现磁盘使用率接近100%。通过在当前目录执行du -h --max-depth=1,发现/var/lib/var/log目录占用磁盘最多,其中系统日志/var/log/syslog占用磁盘20G,可以看到是因为日志文件的原因。

  • 解决方法 
    ① 对于日志文件,可用cat /dev/null > /var/log/syslog,同时清空占用磁盘空间较多的日志文件即可。若是测试环境,可用如下脚本清空日志。

         
         
         
         
    1. #!/bin/bash
    2. for i in `find /var/log -name *.log`
    3. do
    4. echo $i
    5. cat /dev/null > $i
    6. done
    7. exit 0

    注:对于mongodb日志太大的问题,可以采用旋转日志解决。如下: 
    参考:https://docs.mongodb.com/manual/tutorial/rotate-log-files/

         
         
         
         
    1. ps -ef | grep mongod //查看mongodb进程ID
    2. kill -SIGUSR1 <mongod process id>

    或者命令行登录mongodb,执行如下命令:

         
         
         
         
    1. use admin
    2. db.runCommand( { logRotate : 1 } )

    ② 查看/var/lib下的目录,发现占用最多的目录/var/lib/mongodb/var/lib/glance。经分析,/var/lib/glance下存放的是openstack镜像文件,是正常占用;而/var/lib/mongodb/下可看到许多文件名是ceilometer.*的文件,该类文件是mongodb数据库中的ceilometer计量信息。若为测试环境无用删除即可。 
    参考:mongodb删除集合后磁盘空间不释放

         
         
         
         
    1. mongo --host controller
    2. show dbs
    3. use ceilometer
    4. db // 查看当前数据库
    5. show collections
    6. db.meter.count()
    7. db.meter.remove({}) //数据量较大时,删除较费时
    8. db.repairDatabase() //释放磁盘空间

4.9 sahara集群无法启动hadoop服务

  • 问题 
    sahara创建集群时,无法启动集群服务。 
    ① 在日志sahara-engine.log中显示:

         
         
         
         
    1. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [req-ee794aca-7c2c-4f55-a2c9-4e7f4233c4a8 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Error during operating on cluster (reason: Failed to Provision Hadoop Cluster: Failed to start service.
    2. Error ID: df99b010-46e0-41df-89ac-95490a52fc90)
    3. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Traceback (most recent call last):
    4. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/service/ops.py", line 192, in wrapper
    5. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] f(cluster_id, *args, **kwds)
    6. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/service/ops.py", line 301, in _provision_cluster
    7. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] plugin.start_cluster(cluster)
    8. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/plugin.py", line 51, in start_cluster
    9. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] cluster.hadoop_version).start_cluster(cluster)
    10. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/abstractversionhandler.py", line 109, in start_cluster
    11. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] self.deploy.start_cluster(cluster)
    12. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/v5_5_0/deploy.py", line 165, in start_cluster
    13. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] CU.first_run(cluster)
    14. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/utils/cluster_progress_ops.py", line 139, in handler
    15. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] add_fail_event(instance, e)
    16. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    17. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] self.force_reraise()
    18. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    19. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] six.reraise(self.type_, self.value, self.tb)
    20. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/utils/cluster_progress_ops.py", line 136, in handler
    21. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] value = func(*args, **kwargs)
    22. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] File "/usr/lib/python2.7/dist-packages/sahara/plugins/cdh/cloudera_utils.py", line 42, in wrapper
    23. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] raise ex.HadoopProvisionError(c.resultMessage)
    24. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] HadoopProvisionError: Failed to Provision Hadoop Cluster: Failed to start service.
    25. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Error ID: df99b010-46e0-41df-89ac-95490a52fc90
    26. 2016-08-12 21:28:03.201 7531 ERROR sahara.service.ops [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e]
    27. 2016-08-12 21:28:08.185 7531 INFO sahara.utils.cluster [req-ee794aca-7c2c-4f55-a2c9-4e7f4233c4a8 4a2d8a220ac94aa0a2056a50c35a88c5 b6a282f2d53a4c9ebca385ace50042e8 - - -] [instance: none, cluster: 0f9aa4ef-65e7-4b89-b587-5485fe59fc1e] Cluster status has been changed. New status=Error

    ② 在实例/var/log/hdfs/日志中显示错误

         
         
         
         
    1. java.io.IOException: File /user/ubuntu/pies could only be replicated to 0 nodes, instead of 1
    2. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)
    3. at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)
    4. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    5. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    6. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    7. at java.lang.reflect.Method.invoke(Method.java:597)
    8. at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342)
    9. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350)
    10. at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346)
    11. at java.security.AccessController.doPrivileged(Native Method)
    12. at javax.security.auth.Subject.doAs(Subject.java:396)
    13. at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
    14. at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)
    15. at org.apache.hadoop.ipc.Client.call(Client.java:905)
    16. at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
    17. at $Proxy0.addBlock(Unknown Source)
    18. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    19. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    20. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    21. at java.lang.reflect.Method.invoke(Method.java:597)
    22. at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    23. at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    24. at $Proxy0.addBlock(Unknown Source)
    25. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)
    26. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811)
    27. at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)
  • 原因 
    参考链接: 
    http://stackoverflow.com/questions/5293446/hdfs-error-could-only-be-replicated-to-0-nodes-instead-of-1 
    http://stackoverflow.com/questions/36015864/hadoop-be-replicated-to-0-nodes-instead-of-minreplication-1-there-are-1 
    暂未解决,初步断定是datanode和namenode之间网络状态不良,导致datanode不可用。

  • 解决方法

4.10 sahara集群主机重启后cdh/spark等服务没有自动启动

  • 问题 
    sahara集群主机重启后,发现CDH或Spark相应服务并没有自动启动,导致服务不可用。

  • 原因 
    对于Spark Plugin,官方文档中有说明,Spark服务没有以Ubuntu标准服务方式部署,虚拟机重启后Spark服务不会自动重启。

    Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.

    参考链接: 
    http://docs.openstack.org/developer/sahara/userdoc/spark_plugin.html

  • 解决方法

参考链接

  • Sahara (Data Processing) UI User Guide
  • 通过Sahara部署Hadoop集群
  • 使用Openstack Sahara快速部署Cloudera Hadoop集群
  • Spark集群安装和使用
  • Openstack Cinder 多后端
  • Sahara Quickstart Guide
  • Sahara集群的状态一览
  • Sahara Cluster Statuses Overview
  • Data Processing service command-line client
  • Sahara cluster creation/deletion stuck
  • How-to: Get Started with CDH on OpenStack with Sahara

你可能感兴趣的:(在OpenStack(Mitaka版本)上通过Sahara部署Hadoop&Spark集群)