mirror of
https://github.com/apache/sqoop.git
synced 2025-05-03 06:09:47 +08:00
SQOOP-1337: Doc refactoring - Consolidate documentation of --direct
(Gwen Shapira via Jarek Jarcec Cecho)
This commit is contained in:
parent
d902d2449f
commit
c320b4fe03
@ -1,4 +1,3 @@
|
||||
|
||||
////
|
||||
Licensed to the Apache Software Foundation (ASF) under one
|
||||
or more contributor license agreements. See the NOTICE file
|
||||
@ -127,42 +126,6 @@ Sqoop is currently not supporting import from view in direct mode. Use
|
||||
JDBC based (non direct) mode in case that you need to import view (simply
|
||||
omit +--direct+ parameter).
|
||||
|
||||
Direct-mode Transactions
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For performance, each writer will commit the current transaction
|
||||
approximately every 32 MB of exported data. You can control this
|
||||
by specifying the following argument _before_ any tool-specific arguments: +-D
|
||||
sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
|
||||
bytes. Set _size_ to 0 to disable intermediate checkpoints,
|
||||
but individual files being exported will continue to be committed
|
||||
independently of one another.
|
||||
|
||||
Sometimes you need to export large data with Sqoop to a live MySQL cluster that
|
||||
is under a high load serving random queries from the users of your application.
|
||||
While data consistency issues during the export can be easily solved with a
|
||||
staging table, there is still a problem with the performance impact caused by
|
||||
the heavy export.
|
||||
|
||||
First off, the resources of MySQL dedicated to the import process can affect
|
||||
the performance of the live product, both on the master and on the slaves.
|
||||
Second, even if the servers can handle the import with no significant
|
||||
performance impact (mysqlimport should be relatively "cheap"), importing big
|
||||
tables can cause serious replication lag in the cluster risking data
|
||||
inconsistency.
|
||||
|
||||
With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
|
||||
milliseconds, you can let the server relax between checkpoints and the replicas
|
||||
catch up by pausing the export process after transferring the number of bytes
|
||||
specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
|
||||
settings of these two parameters to archieve an export pace that doesn't
|
||||
endanger the stability of your MySQL cluster.
|
||||
|
||||
IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
|
||||
parameter=value+ are Hadoop _generic arguments_ and must appear before
|
||||
any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
|
||||
Don't forget that these parameters are only supported with the +\--direct+
|
||||
flag set.
|
||||
|
||||
PostgreSQL
|
||||
~~~~~~~~~~
|
||||
|
@ -1,4 +1,3 @@
|
||||
|
||||
////
|
||||
Licensed to the Apache Software Foundation (ASF) under one
|
||||
or more contributor license agreements. See the NOTICE file
|
||||
@ -17,7 +16,7 @@
|
||||
limitations under the License.
|
||||
////
|
||||
|
||||
|
||||
[[connectors]]
|
||||
Notes for specific connectors
|
||||
-----------------------------
|
||||
|
||||
@ -39,6 +38,80 @@ it will update appropriate row instead. As a result, Sqoop is ignoring values sp
|
||||
in parameter +\--update-key+, however user needs to specify at least one valid column
|
||||
to turn on update mode itself.
|
||||
|
||||
|
||||
MySQL Direct Connector
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
MySQL Direct Connector allows faster import and export to/from MySQL using +mysqldump+ and +mysqlimport+ tools functionality
|
||||
instead of SQL selects and inserts.
|
||||
|
||||
To use the MySQL Direct Connector, specify the +\--direct+ argument for your import or export job.
|
||||
|
||||
Example:
|
||||
|
||||
----
|
||||
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
|
||||
--direct
|
||||
----
|
||||
|
||||
Passing additional parameters to mysqldump:
|
||||
|
||||
----
|
||||
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
|
||||
--direct -- --default-character-set=latin1
|
||||
----
|
||||
|
||||
Requirements
|
||||
^^^^^^^^^^^^
|
||||
|
||||
Utilities +mysqldump+ and +mysqlimport+ should be present in the shell path of the user running the Sqoop command on
|
||||
all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
|
||||
|
||||
Limitations
|
||||
^^^^^^^^^^^^
|
||||
|
||||
* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
|
||||
* Importing to HBase and Accumulo is not supported
|
||||
* Use of a staging table when exporting data is not supported
|
||||
* Import of views is not supported
|
||||
|
||||
Direct-mode Transactions
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For performance, each writer will commit the current transaction
|
||||
approximately every 32 MB of exported data. You can control this
|
||||
by specifying the following argument _before_ any tool-specific arguments: +-D
|
||||
sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
|
||||
bytes. Set _size_ to 0 to disable intermediate checkpoints,
|
||||
but individual files being exported will continue to be committed
|
||||
independently of one another.
|
||||
|
||||
Sometimes you need to export large data with Sqoop to a live MySQL cluster that
|
||||
is under a high load serving random queries from the users of your application.
|
||||
While data consistency issues during the export can be easily solved with a
|
||||
staging table, there is still a problem with the performance impact caused by
|
||||
the heavy export.
|
||||
|
||||
First off, the resources of MySQL dedicated to the import process can affect
|
||||
the performance of the live product, both on the master and on the slaves.
|
||||
Second, even if the servers can handle the import with no significant
|
||||
performance impact (mysqlimport should be relatively "cheap"), importing big
|
||||
tables can cause serious replication lag in the cluster risking data
|
||||
inconsistency.
|
||||
|
||||
With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
|
||||
milliseconds, you can let the server relax between checkpoints and the replicas
|
||||
catch up by pausing the export process after transferring the number of bytes
|
||||
specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
|
||||
settings of these two parameters to archieve an export pace that doesn't
|
||||
endanger the stability of your MySQL cluster.
|
||||
|
||||
IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
|
||||
parameter=value+ are Hadoop _generic arguments_ and must appear before
|
||||
any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
|
||||
Don't forget that these parameters are only supported with the +\--direct+
|
||||
flag set.
|
||||
|
||||
Microsoft SQL Connector
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
@ -60,6 +133,7 @@ Argument Description
|
||||
|
||||
Schema support
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
If you need to work with tables that are located in non-default schemas, you can
|
||||
specify schema names via the +\--schema+ argument. Custom schemas are supported for
|
||||
both import and export jobs. For example:
|
||||
@ -98,8 +172,31 @@ Argument Description
|
||||
Default is "public".
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
The direct connector (used when specified +\--direct+ parameter), offers also
|
||||
additional extra arguments:
|
||||
Schema support
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
If you need to work with table that is located in schema other than default one,
|
||||
you need to specify extra argument +\--schema+. Custom schemas are supported for
|
||||
both import and export job (optional staging table however must be present in the
|
||||
same schema as target table). Example invocation:
|
||||
|
||||
----
|
||||
$ sqoop import ... --table custom_table -- --schema custom_schema
|
||||
----
|
||||
|
||||
PostgreSQL Direct Connector
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
PostgreSQL Direct Connector allows faster import and export to/from PostgresSQL "COPY" command.
|
||||
|
||||
To use the PostgreSQL Direct Connector, specify the +\--direct+ argument for your import or export job.
|
||||
|
||||
When importing from PostgreSQL in conjunction with direct mode, you
|
||||
can split the import into separate files after
|
||||
individual files reach a certain size. This size limit is controlled
|
||||
with the +\--direct-split-size+ argument.
|
||||
|
||||
The direct connector offers also additional extra arguments:
|
||||
|
||||
.Additional supported PostgreSQL extra arguments in direct mode:
|
||||
[grid="all"]
|
||||
@ -114,19 +211,19 @@ Argument Description
|
||||
Default is "FALSE".
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
Schema support
|
||||
^^^^^^^^^^^^^^
|
||||
Requirements
|
||||
^^^^^^^^^^^^
|
||||
|
||||
If you need to work with table that is located in schema other than default one,
|
||||
you need to specify extra argument +\--schema+. Custom schemas are supported for
|
||||
both import and export job (optional staging table however must be present in the
|
||||
same schema as target table). Example invocation:
|
||||
|
||||
----
|
||||
$ sqoop import ... --table custom_table -- --schema custom_schema
|
||||
----
|
||||
Utility +psql+ should be present in the shell path of the user running the Sqoop command on
|
||||
all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
|
||||
|
||||
|
||||
Limitations
|
||||
^^^^^^^^^^^^
|
||||
|
||||
* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
|
||||
* Importing to HBase and Accumulo is not supported
|
||||
* Import of views is not supported
|
||||
|
||||
pg_bulkload connector
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
@ -1,4 +1,3 @@
|
||||
|
||||
////
|
||||
Licensed to the Apache Software Foundation (ASF) under one
|
||||
or more contributor license agreements. See the NOTICE file
|
||||
@ -93,13 +92,10 @@ additional load may decrease performance. The +\--num-mappers+ or +-m+
|
||||
arguments control the number of map tasks, which is the degree of
|
||||
parallelism used.
|
||||
|
||||
MySQL provides a direct mode for exports as well, using the
|
||||
+mysqlimport+ tool. When exporting to MySQL, use the +\--direct+ argument
|
||||
to specify this codepath. This may be
|
||||
higher-performance than the standard JDBC codepath.
|
||||
|
||||
NOTE: When using export in direct mode with MySQL, the MySQL bulk utility
|
||||
+mysqlimport+ must be available in the shell path of the task process.
|
||||
Some databases provides a direct mode for exports as well. Use the +\--direct+ argument
|
||||
to specify this codepath. This may be higher-performance than the standard JDBC codepath.
|
||||
Details about use of direct mode with each specific RDBMS, installation requirements, available
|
||||
options and limitations can be found in <<connectors>>.
|
||||
|
||||
The +\--input-null-string+ and +\--input-null-non-string+ arguments are
|
||||
optional. If +\--input-null-string+ is not specified, then the string
|
||||
@ -127,9 +123,9 @@ If the staging table contains data and the +\--clear-staging-table+ option is
|
||||
specified, Sqoop will delete all of the data before starting the export job.
|
||||
|
||||
NOTE: Support for staging data prior to pushing it into the destination
|
||||
table is not available for +--direct+ exports. It is also not available when
|
||||
table is not always available for +--direct+ exports. It is also not available when
|
||||
export is invoked using the +--update-key+ option for updating existing data,
|
||||
and when stored procedures are used to insert the data.
|
||||
and when stored procedures are used to insert the data. It is best to check the <<connectors>> section to validate.
|
||||
|
||||
|
||||
Inserts vs. Updates
|
||||
|
@ -1,4 +1,3 @@
|
||||
|
||||
////
|
||||
Licensed to the Apache Software Foundation (ASF) under one
|
||||
or more contributor license agreements. See the NOTICE file
|
||||
@ -49,8 +48,6 @@ Argument Description
|
||||
+\--as-sequencefile+ Imports data to SequenceFiles
|
||||
+\--as-textfile+ Imports data as plain text (default)
|
||||
+\--direct+ Use direct import fast path
|
||||
+\--direct-split-size <n>+ Split the input stream every 'n' bytes when\
|
||||
importing in direct mode
|
||||
+\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
|
||||
+-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel
|
||||
+\--warehouse-dir <dir>+ HDFS parent for table destination
|
||||
|
@ -1,4 +1,3 @@
|
||||
|
||||
////
|
||||
Licensed to the Apache Software Foundation (ASF) under one
|
||||
or more contributor license agreements. See the NOTICE file
|
||||
@ -64,9 +63,7 @@ Argument Description
|
||||
+\--columns <col,col,col...>+ Columns to import from table
|
||||
+\--delete-target-dir+ Delete the import target directory\
|
||||
if it exists
|
||||
+\--direct+ Use direct import fast path
|
||||
+\--direct-split-size <n>+ Split the input stream every 'n' bytes\
|
||||
when importing in direct mode
|
||||
+\--direct+ Use direct connector if exists for the database
|
||||
+\--fetch-size <n>+ Number of entries to read from database\
|
||||
at once.
|
||||
+\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
|
||||
@ -231,13 +228,10 @@ data movement tools. For example, MySQL provides the +mysqldump+ tool
|
||||
which can export data from MySQL to other systems very quickly. By
|
||||
supplying the +\--direct+ argument, you are specifying that Sqoop
|
||||
should attempt the direct import channel. This channel may be
|
||||
higher performance than using JDBC. Currently, direct mode does not
|
||||
support imports of large object columns.
|
||||
higher performance than using JDBC.
|
||||
|
||||
When importing from PostgreSQL in conjunction with direct mode, you
|
||||
can split the import into separate files after
|
||||
individual files reach a certain size. This size limit is controlled
|
||||
with the +\--direct-split-size+ argument.
|
||||
Details about use of direct mode with each specific RDBMS, installation requirements, available
|
||||
options and limitations can be found in <<connectors>>.
|
||||
|
||||
By default, Sqoop will import a table named +foo+ to a directory named
|
||||
+foo+ inside your home directory in HDFS. For example, if your
|
||||
@ -280,10 +274,6 @@ data to a temporary directory and then rename the files into the normal
|
||||
target directory in a manner that does not conflict with existing filenames
|
||||
in that directory.
|
||||
|
||||
NOTE: When using the direct mode of import, certain database client utilities
|
||||
are expected to be present in the shell path of the task process. For MySQL
|
||||
the utilities +mysqldump+ and +mysqlimport+ are required, whereas for
|
||||
PostgreSQL the utility +psql+ is required.
|
||||
|
||||
Controlling transaction isolation
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@ -683,13 +673,6 @@ $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
|
||||
-m 8
|
||||
----
|
||||
|
||||
Enabling the MySQL "direct mode" fast path:
|
||||
|
||||
----
|
||||
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
|
||||
--direct
|
||||
----
|
||||
|
||||
Storing data in SequenceFiles, and setting the generated class name to
|
||||
+com.foocorp.Employee+:
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user