MongoDB数据导入到Hive

首先我们需要的是MongoDB的导出工具mongoexport,我们先看一下它的使用帮助。

/usr/local/hive » mongoexport --help                                                                                                   penelope@wjj-PC
Usage:
  mongoexport 

Export data from MongoDB in CSV or JSON format.

See http://docs.mongodb.org/manual/reference/program/mongoexport/ for more information.

general options:
      --help                                      print usage
      --version                                   print the tool version and exit

verbosity options:
  -v, --verbose=                           more detailed log output (include multiple times for more verbosity, e.g. -vvvvv, or specify a
                                                  numeric value, e.g. --verbose=N)
      --quiet                                     hide all log output

connection options:
  -h, --host=                           mongodb host to connect to (setname/host1,host2 for replica sets)
      --port=                               server port (can also use --host hostname:port)

ssl options:
      --ssl                                       connect to a mongod or mongos that has ssl enabled
      --sslCAFile=                      the .pem file containing the root certificate chain from the certificate authority
      --sslPEMKeyFile=                  the .pem file containing the certificate and key
      --sslPEMKeyPassword=              the password to decrypt the sslPEMKeyFile, if necessary
      --sslCRLFile=                     the .pem file containing the certificate revocation list
      --sslAllowInvalidCertificates               bypass the validation for server certificates
      --sslAllowInvalidHostnames                  bypass the validation for server name
      --sslFIPSMode                               use FIPS mode of the installed openssl library

authentication options:
  -u, --username=                       username for authentication
  -p, --password=                       password for authentication
      --authenticationDatabase=    database that holds the user's credentials
      --authenticationMechanism=       authentication mechanism to use

namespace options:
  -d, --db=                        database to use
  -c, --collection=              collection to use

uri options:
      --uri=mongodb-uri                           mongodb uri connection string

output options:
  -f, --fields=[,]*                 comma separated list of field names (required for exporting CSV) e.g. -f "name,age"
      --fieldFile=                      file with field names - 1 per line
      --type=                               the output format, either json or csv (defaults to 'json') (default: json)
  -o, --out=                            output file; if not specified, stdout is used
      --jsonArray                                 output to a JSON array rather than one object per line
      --pretty                                    output JSON formatted to be human-readable
      --noHeaderLine                              export CSV data without a list of field names at the first line

querying options:
  -q, --query=                              query filter, as a JSON string, e.g., '{x:{$gt:1}}'
      --queryFile=                      path to a file containing a query filter (JSON)
  -k, --slaveOk                                   allow secondary reads if available (default true) (default: false)
      --readPreference=|            specify either a preference name or a preference json object
      --forceTableScan                            force a table scan (do not use $snapshot)
      --skip=                              number of documents to skip
      --limit=                             limit the number of documents to export
      --sort=                               sort order, as a JSON string, e.g. '{x:1}'
      --assertExists                              if specified, export fails if the collection does not exist (default: false)

mongoexport作用

mongoexport这个工具的作用就是Export data from MongoDB in CSV or JSON format.。将MongoDB中的数据导出为csv或者json格式。

CSV

CSV(Comma-Separated Values)–逗号分隔值文件格式。其文件以纯文本形式存储表格数据(数字和文本)。记录间以某种换行符分隔;每条记录由字段组成,字段间的分隔符是其它字符或字符串,最常见的是逗号或制表符。

一些参数说明

  • -h, --host= 指定MongoDB的主机地址。
  • --port=,MongoDB服务器端口,默认是27017
  • -d, --db=,要导出数据的数据库。
  • -c, --collection=,collection名称。
  • --type=,导出的数据格式,默认是json
  • -f, --fields=[,]*,指定要导出的字段
  • --fieldFile=,导出为CSV格式时,如果字段很长,可以指定这个参数,一个字段占一行。
  • -o, --out=,保存导出文件的路径和名称
  • -u, --username=,用户名,没有设置就不需要指定
  • -p, --password=,密码,同样,没有设置不需要指定。

例子

导出为csv,字段较少,用fields

mongoexport --host 127.0.0.1 --port 27017 --db book --collection 'csbook' --type csv --fields _id,book_name,book_url,book_detail,book_author,price,book_pub --out /home/penelope/Desktop/csbook.csv --limit 5

导出为csv,字段较多时采用 –fieldFile 文件中每个字段独占一行。

mongoexport --host 127.0.0.1 --port 27017 --db book --collection 'csbook' --type csv --fieldFile /home/penelope/Desktop/csbook.txt  --out /home/penelope/Desktop/csbook.csv --limit 5

导出行数限制 --limit

创建hive表

hive> create table cs_book(
    > id string,
    > book_name string,
    > book_url string,
    > book_detail string,
    > book_author string,
    > price string,
    > book_pub string)
    > row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties (
    > "separatorChar"=",",
    > "quotechar"="\""
    > )stored as textfile ;

load data 导入数据

hive> load data local inpath '/home/penelope/Desktop/csbook.csv' into table book.cs_book;

查询Hive表

hive> select count(*) from cs_book;
Query ID = penelope_20190130183036_4416c634-b715-41e7-8b2c-d0306ccb96ce
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1548840204909_0001, Tracking URL = http://wjj-PC:8088/proxy/application_1548840204909_0001/
Kill Command = /usr/local/hadoop-2.9.2/bin/mapred job  -kill job_1548840204909_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-01-30 18:32:12,645 Stage-1 map = 0%,  reduce = 0%
2019-01-30 18:32:31,826 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.51 sec
2019-01-30 18:32:40,137 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.7 sec
MapReduce Total cumulative CPU time: 8 seconds 700 msec
Ended Job = job_1548840204909_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.7 sec   HDFS Read: 4145470 HDFS Write: 104 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 700 msec
OK
5941
Time taken: 126.276 seconds, Fetched: 1 row(s)

你可能感兴趣的:(大数据)