首先我们需要的是MongoDB的导出工具mongoexport
,我们先看一下它的使用帮助。
/usr/local/hive » mongoexport --help penelope@wjj-PC
Usage:
mongoexport
Export data from MongoDB in CSV or JSON format.
See http://docs.mongodb.org/manual/reference/program/mongoexport/ for more information.
general options:
--help print usage
--version print the tool version and exit
verbosity options:
-v, --verbose= more detailed log output (include multiple times for more verbosity, e.g. -vvvvv, or specify a
numeric value, e.g. --verbose=N)
--quiet hide all log output
connection options:
-h, --host= mongodb host to connect to (setname/host1,host2 for replica sets)
--port= server port (can also use --host hostname:port)
ssl options:
--ssl connect to a mongod or mongos that has ssl enabled
--sslCAFile= the .pem file containing the root certificate chain from the certificate authority
--sslPEMKeyFile= the .pem file containing the certificate and key
--sslPEMKeyPassword= the password to decrypt the sslPEMKeyFile, if necessary
--sslCRLFile= the .pem file containing the certificate revocation list
--sslAllowInvalidCertificates bypass the validation for server certificates
--sslAllowInvalidHostnames bypass the validation for server name
--sslFIPSMode use FIPS mode of the installed openssl library
authentication options:
-u, --username= username for authentication
-p, --password= password for authentication
--authenticationDatabase= database that holds the user's credentials
--authenticationMechanism= authentication mechanism to use
namespace options:
-d, --db= database to use
-c, --collection= collection to use
uri options:
--uri=mongodb-uri mongodb uri connection string
output options:
-f, --fields=[,]* comma separated list of field names (required for exporting CSV) e.g. -f "name,age"
--fieldFile= file with field names - 1 per line
--type= the output format, either json or csv (defaults to 'json') (default: json)
-o, --out= output file; if not specified, stdout is used
--jsonArray output to a JSON array rather than one object per line
--pretty output JSON formatted to be human-readable
--noHeaderLine export CSV data without a list of field names at the first line
querying options:
-q, --query= query filter, as a JSON string, e.g., '{x:{$gt:1}}'
--queryFile= path to a file containing a query filter (JSON)
-k, --slaveOk allow secondary reads if available (default true) (default: false)
--readPreference=| specify either a preference name or a preference json object
--forceTableScan force a table scan (do not use $snapshot)
--skip= number of documents to skip
--limit= limit the number of documents to export
--sort= sort order, as a JSON string, e.g. '{x:1}'
--assertExists if specified, export fails if the collection does not exist (default: false)
mongoexport
这个工具的作用就是Export data from MongoDB in CSV or JSON format.
。将MongoDB中的数据导出为csv
或者json
格式。
CSV(Comma-Separated Values)–逗号分隔值文件格式。其文件以纯文本形式存储表格数据(数字和文本)。记录间以某种换行符分隔;每条记录由字段组成,字段间的分隔符是其它字符或字符串,最常见的是逗号或制表符。
-h, --host=
指定MongoDB的主机地址。--port=
,MongoDB服务器端口,默认是27017-d, --db=
,要导出数据的数据库。-c, --collection=
,collection名称。--type=
,导出的数据格式,默认是json
。-f, --fields=[,]*
,指定要导出的字段--fieldFile=
,导出为CSV格式时,如果字段很长,可以指定这个参数,一个字段占一行。-o, --out=
,保存导出文件的路径和名称-u, --username=
,用户名,没有设置就不需要指定-p, --password=
,密码,同样,没有设置不需要指定。导出为csv,字段较少,用fields
mongoexport --host 127.0.0.1 --port 27017 --db book --collection 'csbook' --type csv --fields _id,book_name,book_url,book_detail,book_author,price,book_pub --out /home/penelope/Desktop/csbook.csv --limit 5
导出为csv,字段较多时采用 –fieldFile 文件中每个字段独占一行。
mongoexport --host 127.0.0.1 --port 27017 --db book --collection 'csbook' --type csv --fieldFile /home/penelope/Desktop/csbook.txt --out /home/penelope/Desktop/csbook.csv --limit 5
导出行数限制 --limit
hive> create table cs_book(
> id string,
> book_name string,
> book_url string,
> book_detail string,
> book_author string,
> price string,
> book_pub string)
> row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties (
> "separatorChar"=",",
> "quotechar"="\""
> )stored as textfile ;
hive> load data local inpath '/home/penelope/Desktop/csbook.csv' into table book.cs_book;
hive> select count(*) from cs_book;
Query ID = penelope_20190130183036_4416c634-b715-41e7-8b2c-d0306ccb96ce
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1548840204909_0001, Tracking URL = http://wjj-PC:8088/proxy/application_1548840204909_0001/
Kill Command = /usr/local/hadoop-2.9.2/bin/mapred job -kill job_1548840204909_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-01-30 18:32:12,645 Stage-1 map = 0%, reduce = 0%
2019-01-30 18:32:31,826 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.51 sec
2019-01-30 18:32:40,137 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.7 sec
MapReduce Total cumulative CPU time: 8 seconds 700 msec
Ended Job = job_1548840204909_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.7 sec HDFS Read: 4145470 HDFS Write: 104 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 700 msec
OK
5941
Time taken: 126.276 seconds, Fetched: 1 row(s)