1/列出mysql数据库中的所有数据库sqoop list-databases -connect jdbc:mysql://localhost:3306/ -username root -password 1234562/连接mysql并 列出test数据库中的表sqoop list-tables -connect jdbc:mysql://localhost:3306/test -username root -password 1234563/将关系型数据的表结构 复制到hive中,只是复制表结构 内容不复制sqoop create-hive-table -connect jdbc:mysql://localhost:3306/test -table sqoop_testTabinMySql -username root -password 123456 -hive-table testNewTabInHive4/从关系数据库导入文件到hive中sqoop import -connect jdbc:mysql://localhost:3306/zxtest -username root -password 123456 -table sqoop_test -hive-import -hive-table s_test -m 15/将hive中的表数据导入到mysql 中,在进行导入前 mysql中的表hive_test 必须提前创建好sqoop export -connect jdbc:mysql://localhost:3306/zxtest -username root -password root -table hive_test -export-dir /user/hive/warehouse/new_test_partition/dt=2012-03-056/从数据库导出表的数据到HDFS上的文件sqoop import -connect jdbc:mysql://localhost:3306/compression -username=hadoop -password=123456 -table HADOOP_USER_INFO -m 1 -target -dir /user/test7/数据库增量导入表数据到hdfs中sqoop import -connect jdbc:mysql://localhost:3306/compression -username=hadoop -password=123456 -table HADOOP_USER_INFO -m 1 -target -dir /user/test -check -column id -incremental append -last-value 3Importsqoop 数据导入具有以下特点:1.支持文本文件(--as-textfile)、avro(--as-avrodatafile)、SequenceFiles(--as-sequencefile)。 RCFILE暂未支持,默认为文本2.支持数据追加,通过--apend指定3.支持table列选取(--column),支持数据选取(--where),和--table一起使用4.支持数据选取,例如读入多表join后的数据'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) ‘,不可以和--table同时使用5.支持map数定制(-m)6.支持压缩(--compress)7.支持将关系数据库中的数据导入到Hive(--hive-import)、HBase(--hbase-table) 数据导入Hive分三步:1)导入数据到HDFS 2)Hive建表 3)使用“LOAD DATA INPAHT”将数据LOAD到表中 数据导入HBase分二部:1)导入数据到HDFS 2)调用HBase put操作逐行将数据写入表*import是将关系数据库迁移到HDFS上 默认目录是/user/${user.name}/${tablename},可以通过--target-dir设置hdfs上的目标目录。export是import的反向过程,将hdfs上的数据导入到关系数据库中 由于sqoop是通过map完成数据的导入,各个map过程是独立的,没有事物的概念,可能会有部分map数据导入失败的情况。为了解决这一问题,sqoop中有一个折中的办法,即是指定中间 staging表,成功后再由中间表导入到结果表。这一功能是通过 --staging-table指定,同时staging表结构也是需要提前创建出来的:sqoop export --connect jdbc:mysql://192.168.81.176/sqoop --username root -password passwd --table sds --export-dir /user/guojian/sds --staging-table sds_tmp需要说明的是,在使用 --direct, --update-key或者--call存储过程的选项时,staging中间表是不可用的。create-hive-table将关系数据库表导入到hive表中参数说明–hive-homeHive的安装目录,可以通过该参数覆盖掉默认的hive目录–hive-overwrite覆盖掉在hive表中已经存在的数据–create-hive-table默认是false,如果目标表已经存在了,那么创建任务会失败–hive-table后面接要创建的hive表–table指定关系数据库表名sqoop create-hive-table --connect jdbc:mysql://192.168.81.176/sqoop --username root -password passwd --table sds --hive-table sds_bak默认sds_bak是在default数据库的。这一步需要依赖HCatalog,需要先安装HCatalog,否则报如下错误:Hive history file=/tmp/guojian/hive_job_log_cfbe2de9-a358-4130-945c-b97c0add649d_1628102887.txtFAILED: ParseException line 1:44 mismatched input ')' expecting Identifier near '(' in column specificationmetastore 配置sqoop job的共享元数据信息,这样多个用户定义和执行sqoop job在这一 metastore中。默认存储在~/.sqoop启动:sqoop metastore关闭:sqoop metastore --shutdownmetastore文件的存储位置是在 conf/sqoop-site.xml中 sqoop.metastore.server.location 配置,指向本地文件。metastore可以通过TCP/IP访问,端口号可以通过 sqoop.metastore.server.port配置,默认是16000。客户端可以通过 指定 sqoop.metastore.client.autoconnect.url或使用 --meta-connect,配置为 jdbc:hsqldb:hsql://:/sqoop,例如 jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop。Sqoop will read entire content of the password file and use it as a password. This will include any trailing white space characters such as new line characters that are added by default by most of the text editors. You need to make sure that your password file contains only characters that belongs to your password. On the command line you can use command echo with switch -n to store password without any trailing white space characters. For example to store password secret you would call echo -n "secret" > password.file.Sqoop automatically supports several databases, including MySQL. Connect strings beginning with jdbc:mysql:// are handled automatically in Sqoop. (A full list of databases with built-in support is provided in the "Supported Databases" section. For some, you may need to install the JDBC driver yourself.)You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import, and install the .jar file in the $SQOOP_HOME/lib directory on your client machine. (This will be /usr/lib/sqoop/lib if you installed from an RPM or Debian package.) Each driver .jar file also has a specific driver class which defines the entry-point to the driver. For example, MySQL’s Connector/J library has a driver class of com.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with --driver.For example, to connect to a SQLServer database, first download the driver from microsoft.com and install it in your Sqoop lib path.Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns and --where arguments, you can specify a SQL statement with the --query argument.When importing a free-form query, you must specify a destination directory with --target-dir.NoteIf you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results. 即 where中不能有orSqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. 默认开启4个taskWhen performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.If a table does not have a primary key defined and the --split-by is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 option or the --autoreset-to-one-mapper option is used. The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job. When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache. Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs. Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.By default, Sqoop will import a table named foo to a directory named foo inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/foo/(files). You can adjust the parent directory of the import with the --warehouse-dir argument. For example:$ sqoop import --connnect--table foo --warehouse-dir /shared \When using direct mode, you can specify additional arguments which should be passed to the underlying tool. If the argument -- is given on the command-line, then subsequent arguments are sent directly to the underlying tool. For example, the following adjusts the character set used by mysqldump:$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \ --direct -- --default-character-set=latin1By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents. If you use the --append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.Sqoop is preconfigured to map most SQL types to appropriate Java or Hive representatives. However the default mapping might not be suitable for everyone and might be overridden by --map-column-java (for changing mapping to Java) or --map-column-hive (for changing Hive mapping).Sqoop is expecting comma separated list of mapping in form=. For example:
$ sqoop import ... --map-column-java id=String,value=Integer
Sqoop will rise exception in case that some configured mapping will not be used.
You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.
At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data.
You can import data in one of two file formats: delimited text or SequenceFiles.
Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive.
reading from SequenceFiles is higher-performance than reading from text files, as records do not need to be parsed
By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.
While the choice of delimiters is most important for a text-mode import, it is still relevant if you import to SequenceFiles with --as-sequencefile. The generated class' toString() method will use the delimiters you specify, so subsequent formatting of the output data will rely on the delimiters you choose.
When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import.
When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import. The delimiters are chosen with arguments such as --fields-terminated-by; this controls both how the data is written to disk, and how the generated parse() method reinterprets this data. The delimiters used by the parse() method can be chosen independently of the output arguments, by using --input-fields-terminated-by, and so on. This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.
2016年9月2日 阅读笔记
Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments. The partition value must be a string. Please see the Hive documentation for more details on partitioning.
7.3. Example Invocations
The following examples illustrate how to use the import tool in a variety of situations.
A basic import of a table named EMPLOYEES in the corp database:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
A basic import requiring a login:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--username SomeUser -P
Enter password: (hidden)
Selecting specific columns from the EMPLOYEES table:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--columns "employee_id,first_name,last_name,job_title"
Controlling the import parallelism (using 8 parallel tasks):
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
-m 8
Storing data in SequenceFiles, and setting the generated class name to com.foocorp.Employee:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--class-name com.foocorp.Employee --as-sequencefile
Specifying the delimiters to use in a text-mode import:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--fields-terminated-by '\t' --lines-terminated-by '\n' \
--optionally-enclosed-by '\"'
Importing the data to Hive:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--hive-import
Importing only new employees:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--where "start_date > '2010-01-01'"
Changing the splitting column from the default:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--split-by dept_id
Verifying that an import was successful:
$ hadoop fs -ls EMPLOYEES
Found 5 items
drwxr-xr-x - someuser somegrp 0 2010-04-27 16:40 /user/someuser/EMPLOYEES/_logs
-rw-r--r-- 1 someuser somegrp 2913511 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000
-rw-r--r-- 1 someuser somegrp 1683938 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001
-rw-r--r-- 1 someuser somegrp 7245839 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002
-rw-r--r-- 1 someuser somegrp 7842523 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003
$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
0,joe,smith,engineering
1,jane,doe,marketing
...
Performing an incremental import of new data, after having already imported the first 100,000 rows of a table:
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
--where "id > 100000" --target-dir /incremental_dataset --append
An import of a table named EMPLOYEES in the corp database that uses validation to validate the import using the table row count and number of rows copied into HDFS: More Details
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
--table EMPLOYEES --validate
The import-all-tables tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.
For the import-all-tables tool to be useful, the following conditions must be met:
Each table must have a single-column primary key or --autoreset-to-one-mapper option must be used.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.
These arguments behave in the same manner as they do when used for the sqoop-import tool, but the --table, --split-by, --columns, and --where arguments are invalid for sqoop-import-all-tables. The --exclude-tables argument is for +sqoop-import-all-tables only.
8.3. Example Invocations
Import all tables from the corp database:
$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
Verifying that it worked:
$ hadoop fs -ls
Found 4 items
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/EMPLOYEES
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/PAYCHECKS
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/DEPARTMENTS
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/OFFICE_SUPPLIES
The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
The default operation is to transform these into a set of INSERT statements that inject the records into the database. In "update mode," Sqoop will generate UPDATE statements that replace existing records in the database, and in "call mode" Sqoop will make a stored procedure call for each record.
Although the Hadoop generic arguments must preceed any export arguments, the export arguments can be entered in any order with respect to one another.参数无顺序
Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase.
By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS. Sqoop will import data to the table specified as the argument to --hbase-table. Each row of the input table will be transformed into an HBase Put operation to a row of the output table. The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table. You can manually specify the row key column with --hbase-row-key. Each output column will be placed in the same column family, which must be specified with --column-family.
[Note] Note
This function is incompatible with direct import (parameter --direct).
This function is incompatible with direct import (parameter --direct), and cannot be used in the same operation as an HBase import.
The --export-dir argument and one of --table or --call are required. These specify the table to populate in the database (or the stored procedure to call), and the directory in HDFS that contains the source data.
By default, all columns within a table are selected for export. You can select a subset of columns and control their ordering by using the --columns argument. This should include a comma-delimited list of columns to export. For example: --columns "col1,col2,col3". Note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail. 导出数据时 未选择的列在数据库表中需要为可空类型 否则数据库将拒绝接受数据导入
By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The --num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.可以增加map数,但是当数据库遇到性能瓶颈时,增加map反而会降低性能。
If --input-null-string is not specified, then the string "null" will be interpreted as null for string-type columns. If --input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string columns. Note that, the empty string will be always interpreted as null for non-string columns, in addition to other string if specified by --input-null-non-string.导出数据时,由于非空列 的null数据导出至表将会被打断任务
Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.导出任务被分为多个部分 多事务 其中一部分错误将导致只有部分数据被导入至表 如果使用--staging-table 将数据保存至临时表内 最终作为单一事务将临时表数据导入至目标表
Support for staging data prior to pushing it into the destination table is not always available for --direct exports. It is also not available when export is invoked using the --update-key option for updating existing data, and when stored procedures are used to insert the data.
By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table.
By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.
If you specify the --update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATE statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with --update-key. For example, consider the following table definition:
Depending on the target database, you may also specify the --update-mode argument with allowinsert mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet.
If an UPDATE statement modifies no rows, this is not considered an error; the export will silently continue. (In effect, this means that an update-based export will not insert new rows into the database.) Likewise, if the column specified with --update-key does not uniquely identify rows and multiple rows are updated by a single statement, this condition is also undetected.
The argument --update-key can also be given a comma separated list of column names. In which case, Sqoop will match all keys from this list before updating any existing record.
Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete. 一次导出任务并行,每一个写操作都是一个独立的事务。注意事务缓存的增长不能超过警戒线,以及内存溢出的清空。因此,一次导出是一件原子事务。可能在数据尚未完全导入之前就可以在表中看见。
Exports may fail for a number of reasons: 导出任务可能失败的原因
Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes) 失去连接
Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value) 违反一致性约束插入
Attempting to parse an incomplete or malformed record from the HDFS source data 解析一个不完整的或残缺的记录
Attempting to parse records using incorrect delimiters 使用不正确的分割符解析记录
Capacity issues (such as insufficient RAM or disk space) 容量问题 内存不足或磁盘空间不足
If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export.
每一个export map任务 都在独立的事务中进行作业,相互独立的提交他们各自的任务。如果任务失败,当前事务会回滚而先前已完成的事务将继续保留在数据库内,就会造成部分完成的导出任务。
A basic export to populate a table named bar:
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data
This example takes the files in /results/bar_data and injects their contents in to the bar table in the foo database on db.example.com. The target table must already exist in the database. Sqoop performs a set of INSERT INTO operations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.
插入操作 保证表已存在 违反一致性约束 将失败
Alternatively, you can specify the columns to be exported by providing --columns "col1,col2,col3". Please note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail. 那些未被选择导出数据的列 在数据库中要么具有默认值 要么允许null值 否则将会失败
Another basic export to populate a table named bar with validation enabled: More Details
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data --validate
An export that calls a stored procedure named barproc for every record in /results/bar_data would look like: 导出数据时调用存储过程
$ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc \
--export-dir /results/bar_data
Validation
Validate the data copied, either import or export by comparing the row counts from the source and the target post copy. 比较源表和目标表的行数 验证操作效果
Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario.
Sqoop allows you to define saved jobs which make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time. The section on the sqoop-job tool describes how to create and work with saved jobs.
By default, job descriptions are saved to a private repository stored in $HOME/.sqoop/. You can configure Sqoop to instead use a shared metastore, which makes saved jobs available to multiple users across a shared cluster. Starting the metastore is covered by the section on the sqoop-metastore tool.
使用sqoop job 定义多次import export
The job tool allows you to create and work with saved jobs. Saved jobs remember the parameters used to specify a job, so they can be re-executed by invoking the job by its handle.
If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.
job 重复执行 增量导入时 允许新增数据持续性加入当前文件
Creating saved jobs is done with the --create action. This operation requires a -- followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job. Consider:
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
--table mytable
This creates a job named myjob which can be executed later. The job is not run. This job is now available in the list of saved jobs:
$ sqoop job --list
Available jobs:
myjob
We can inspect the configuration of a job with the show action:
$ sqoop job --show myjob
Job: myjob
Tool: import
Options:
----------------------------
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
...
And if we are satisfied with it, we can run the job with exec:
$ sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...
The exec action allows you to override arguments of the saved job by supplying them after a --. For example, if the database were changed to require a username, we could specify the username and password with:
$ sqoop job --exec myjob -- --username someuser -P
Enter password:
...
Incremental imports are performed by comparing the values in a check column against a reference value for the most recent import. For example, if the --incremental append argument was specified, along with --check-column id and --last-value 100, all rows with id > 100 will be imported. If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job --exec someIncrementalJob will continue to import only newer rows than those previously imported.
增量导入 --incremental append --check-column id --last-value 100
The metastore tool configures Sqoop to host a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the --meta-connect argument.
The metastore is available over TCP/IP. The port is controlled by the sqoop.metastore.server.port configuration parameter, and defaults to 16000.
The merge tool runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. These are specified with --new-data and --onto respectively. The output of the MapReduce job will be placed in the directory in HDFS specified by --target-dir. 合并数据集时指定新旧数据集且指定目标目录
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur. 合并时保证一致性约束 保证主键的唯一性
The codegen tool generates Java classes which encapsulate and interpret imported records. The Java definition of a record is instantiated as part of the import process, but can also be performed separately. For example, if Java source is lost, it can be recreated. New versions of a class can be created which use different delimiters between fields, and so on. 自动生成java类文件
Recreate the record interpretation code for the employees table of a corporate database:
$ sqoop codegen --connect jdbc:mysql://db.example.com/corp \
--table employees
The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. This effectively performs the "--hive-import" step of sqoop-import without running the preceeding import.
If data was already loaded to HDFS, you can use this tool to finish the pipeline of importing the data to Hive. You can also create Hive tables with this tool; data then can be imported and populated into the target after a preprocessing step run by the user.
Define in Hive a table named emps with a definition based on a database table named employees:
$ sqoop create-hive-table --connect jdbc:mysql://db.example.com/corp \
--table employees --hive-table emps
The eval tool allows users to quickly run simple SQL queries against a database; results are printed to the console. This allows users to preview their import queries to ensure they import the data they expect.
The eval tool is provided for evaluation purpose only. You can use it to verify database connection from within the Sqoop or to test simple queries. It’s not suppose to be used in production workflows.
使用eval快速简单评估sql 查询 仅仅用于评估目的 验证数据库链接等目的 在生产工作流中并不支持
Select ten records from the employees table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
--query "SELECT * FROM employees LIMIT 10"
Insert a row into the foo table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
-e "INSERT INTO foo VALUES(42, 'bar')"
List database schemas available on a MySQL server:
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
information_schema
employees
This only works with HSQLDB, MySQL and Oracle. When using with Oracle, it is necessary that the user connecting to the database has DBA privileges.
Netezza connector supports an optimized data transfer facility using the Netezza external tables feature. Each map tasks of Netezza connector’s import job will work on a subset of the Netezza partitions and transparently create and use an external table to transport data. Similarly, export jobs will use the external table to push data fast onto the NZ system. Direct mode does not support staging tables, upsert options etc.
Here is an example of complete command line for import using the Netezza external table feature.
$ sqoop import \
--direct \
--connect jdbc:netezza://nzhost:5480/sqoop \
--table nztable \
--username nzuser \
--password nzpass \
--target-dir hdfsdir
Here is an example of complete command line for export with tab as the field terminator character.
$ sqoop export \
--direct \
--connect jdbc:netezza://nzhost:5480/sqoop \
--table nztable \
--username nzuser \
--password nzpass \
--export-dir hdfsdir \
--input-fields-terminated-by "\t"
Netezza direct connector supports the null-string features of Sqoop. The null string values are converted to appropriate external table options during export and import operations.