Hadoop、spark的一些基本使用笔记

Hadoop、spark的一些基本使用笔记

新项目要用python+Hadoop+spark1:

  • hdfs中没有pwd概念所以 要记得目录呀
  • ipython3 python3.5 spark-2.1.1-bin-hadoop2.7


----------
#导入模块创建连接

In [1]: from hdfs import Client

In [2]: client = Client('http://master:50070')


----------
#获取文件列表
In [6]: client.list("/")
Out[6]: ['ligq', 'test', 'tmp', 'user']
----------
#查看文件信息
In [7]: client.status('/tmp')
Out[7]:
{'pathSuffix': '',
 'fileId': 16391,
 'storagePolicy': 0,
 'childrenNum': 0,
 'permission': '644',
 'accessTime': 1495003819798,
 'length': 22628,
 'blockSize': 134217728,
 'type': 'FILE',
 'modificationTime': 1494923718027,
 'group': 'supergroup',
 'owner': 'cpda',
 'replication': 2}
----------
#获取文件的md5值,算法貌似和python的默认md5算法不同,**待验证**
In [43]: client.checksum('/admin/iris_noheader.csv')
Out[43]:
{'algorithm': 'MD5-of-0MD5-of-512CRC32C',
 'bytes': '0000020000000000000000005918d3777bd0fc6ef00741feb23fe79d00000000',
 'length': 28}

sparksql

In [1]: from pyspark import SparkConf,SparkContext

In [2]: from pyspark.sql import SQLContext

In [3]: conf = SparkConf().setAppName("spar_sql_test")

In [4]: sc = SparkContext(conf = conf)

In [5]: sqlContext=SQLContext(sc)

In [6]: from pyspark.sql import HiveContext

In [7]: hc = HiveContext(sc)

Apache Spark:How to user pyspark with Python 3 ?


1,edit profile :vim ~/.profile

2,add the code into the file: export PYSPARK_PYTHON=python3

3, execute command : source ~/.profile

4, ./bin/pyspark

你可能感兴趣的:(python)