5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-03 05:50:31 +08:00

SQOOP-1337: Doc refactoring - Consolidate documentation of --direct

(Gwen Shapira via Jarek Jarcec Cecho)
This commit is contained in:
Jarek Jarcec Cecho 2014-06-23 08:34:03 -07:00
parent d902d2449f
commit c320b4fe03
5 changed files with 121 additions and 85 deletions

View File

@ -1,4 +1,3 @@
//// ////
Licensed to the Apache Software Foundation (ASF) under one Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file or more contributor license agreements. See the NOTICE file
@ -127,42 +126,6 @@ Sqoop is currently not supporting import from view in direct mode. Use
JDBC based (non direct) mode in case that you need to import view (simply JDBC based (non direct) mode in case that you need to import view (simply
omit +--direct+ parameter). omit +--direct+ parameter).
Direct-mode Transactions
^^^^^^^^^^^^^^^^^^^^^^^^
For performance, each writer will commit the current transaction
approximately every 32 MB of exported data. You can control this
by specifying the following argument _before_ any tool-specific arguments: +-D
sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
bytes. Set _size_ to 0 to disable intermediate checkpoints,
but individual files being exported will continue to be committed
independently of one another.
Sometimes you need to export large data with Sqoop to a live MySQL cluster that
is under a high load serving random queries from the users of your application.
While data consistency issues during the export can be easily solved with a
staging table, there is still a problem with the performance impact caused by
the heavy export.
First off, the resources of MySQL dedicated to the import process can affect
the performance of the live product, both on the master and on the slaves.
Second, even if the servers can handle the import with no significant
performance impact (mysqlimport should be relatively "cheap"), importing big
tables can cause serious replication lag in the cluster risking data
inconsistency.
With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
milliseconds, you can let the server relax between checkpoints and the replicas
catch up by pausing the export process after transferring the number of bytes
specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
settings of these two parameters to archieve an export pace that doesn't
endanger the stability of your MySQL cluster.
IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
parameter=value+ are Hadoop _generic arguments_ and must appear before
any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
Don't forget that these parameters are only supported with the +\--direct+
flag set.
PostgreSQL PostgreSQL
~~~~~~~~~~ ~~~~~~~~~~

View File

@ -1,4 +1,3 @@
//// ////
Licensed to the Apache Software Foundation (ASF) under one Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file or more contributor license agreements. See the NOTICE file
@ -17,7 +16,7 @@
limitations under the License. limitations under the License.
//// ////
[[connectors]]
Notes for specific connectors Notes for specific connectors
----------------------------- -----------------------------
@ -39,6 +38,80 @@ it will update appropriate row instead. As a result, Sqoop is ignoring values sp
in parameter +\--update-key+, however user needs to specify at least one valid column in parameter +\--update-key+, however user needs to specify at least one valid column
to turn on update mode itself. to turn on update mode itself.
MySQL Direct Connector
~~~~~~~~~~~~~~~~~~~~~~
MySQL Direct Connector allows faster import and export to/from MySQL using +mysqldump+ and +mysqlimport+ tools functionality
instead of SQL selects and inserts.
To use the MySQL Direct Connector, specify the +\--direct+ argument for your import or export job.
Example:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--direct
----
Passing additional parameters to mysqldump:
----
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
----
Requirements
^^^^^^^^^^^^
Utilities +mysqldump+ and +mysqlimport+ should be present in the shell path of the user running the Sqoop command on
all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
Limitations
^^^^^^^^^^^^
* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
* Importing to HBase and Accumulo is not supported
* Use of a staging table when exporting data is not supported
* Import of views is not supported
Direct-mode Transactions
^^^^^^^^^^^^^^^^^^^^^^^^
For performance, each writer will commit the current transaction
approximately every 32 MB of exported data. You can control this
by specifying the following argument _before_ any tool-specific arguments: +-D
sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
bytes. Set _size_ to 0 to disable intermediate checkpoints,
but individual files being exported will continue to be committed
independently of one another.
Sometimes you need to export large data with Sqoop to a live MySQL cluster that
is under a high load serving random queries from the users of your application.
While data consistency issues during the export can be easily solved with a
staging table, there is still a problem with the performance impact caused by
the heavy export.
First off, the resources of MySQL dedicated to the import process can affect
the performance of the live product, both on the master and on the slaves.
Second, even if the servers can handle the import with no significant
performance impact (mysqlimport should be relatively "cheap"), importing big
tables can cause serious replication lag in the cluster risking data
inconsistency.
With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
milliseconds, you can let the server relax between checkpoints and the replicas
catch up by pausing the export process after transferring the number of bytes
specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
settings of these two parameters to archieve an export pace that doesn't
endanger the stability of your MySQL cluster.
IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
parameter=value+ are Hadoop _generic arguments_ and must appear before
any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
Don't forget that these parameters are only supported with the +\--direct+
flag set.
Microsoft SQL Connector Microsoft SQL Connector
~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
@ -60,6 +133,7 @@ Argument Description
Schema support Schema support
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
If you need to work with tables that are located in non-default schemas, you can If you need to work with tables that are located in non-default schemas, you can
specify schema names via the +\--schema+ argument. Custom schemas are supported for specify schema names via the +\--schema+ argument. Custom schemas are supported for
both import and export jobs. For example: both import and export jobs. For example:
@ -98,8 +172,31 @@ Argument Description
Default is "public". Default is "public".
--------------------------------------------------------------------------------- ---------------------------------------------------------------------------------
The direct connector (used when specified +\--direct+ parameter), offers also Schema support
additional extra arguments: ^^^^^^^^^^^^^^
If you need to work with table that is located in schema other than default one,
you need to specify extra argument +\--schema+. Custom schemas are supported for
both import and export job (optional staging table however must be present in the
same schema as target table). Example invocation:
----
$ sqoop import ... --table custom_table -- --schema custom_schema
----
PostgreSQL Direct Connector
~~~~~~~~~~~~~~~~~~~~~~~~~~~
PostgreSQL Direct Connector allows faster import and export to/from PostgresSQL "COPY" command.
To use the PostgreSQL Direct Connector, specify the +\--direct+ argument for your import or export job.
When importing from PostgreSQL in conjunction with direct mode, you
can split the import into separate files after
individual files reach a certain size. This size limit is controlled
with the +\--direct-split-size+ argument.
The direct connector offers also additional extra arguments:
.Additional supported PostgreSQL extra arguments in direct mode: .Additional supported PostgreSQL extra arguments in direct mode:
[grid="all"] [grid="all"]
@ -114,19 +211,19 @@ Argument Description
Default is "FALSE". Default is "FALSE".
--------------------------------------------------------------------------------- ---------------------------------------------------------------------------------
Schema support Requirements
^^^^^^^^^^^^^^ ^^^^^^^^^^^^
If you need to work with table that is located in schema other than default one, Utility +psql+ should be present in the shell path of the user running the Sqoop command on
you need to specify extra argument +\--schema+. Custom schemas are supported for all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
both import and export job (optional staging table however must be present in the
same schema as target table). Example invocation:
----
$ sqoop import ... --table custom_table -- --schema custom_schema
----
Limitations
^^^^^^^^^^^^
* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
* Importing to HBase and Accumulo is not supported
* Import of views is not supported
pg_bulkload connector pg_bulkload connector
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~

View File

@ -1,4 +1,3 @@
//// ////
Licensed to the Apache Software Foundation (ASF) under one Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file or more contributor license agreements. See the NOTICE file
@ -93,13 +92,10 @@ additional load may decrease performance. The +\--num-mappers+ or +-m+
arguments control the number of map tasks, which is the degree of arguments control the number of map tasks, which is the degree of
parallelism used. parallelism used.
MySQL provides a direct mode for exports as well, using the Some databases provides a direct mode for exports as well. Use the +\--direct+ argument
+mysqlimport+ tool. When exporting to MySQL, use the +\--direct+ argument to specify this codepath. This may be higher-performance than the standard JDBC codepath.
to specify this codepath. This may be Details about use of direct mode with each specific RDBMS, installation requirements, available
higher-performance than the standard JDBC codepath. options and limitations can be found in <<connectors>>.
NOTE: When using export in direct mode with MySQL, the MySQL bulk utility
+mysqlimport+ must be available in the shell path of the task process.
The +\--input-null-string+ and +\--input-null-non-string+ arguments are The +\--input-null-string+ and +\--input-null-non-string+ arguments are
optional. If +\--input-null-string+ is not specified, then the string optional. If +\--input-null-string+ is not specified, then the string
@ -127,9 +123,9 @@ If the staging table contains data and the +\--clear-staging-table+ option is
specified, Sqoop will delete all of the data before starting the export job. specified, Sqoop will delete all of the data before starting the export job.
NOTE: Support for staging data prior to pushing it into the destination NOTE: Support for staging data prior to pushing it into the destination
table is not available for +--direct+ exports. It is also not available when table is not always available for +--direct+ exports. It is also not available when
export is invoked using the +--update-key+ option for updating existing data, export is invoked using the +--update-key+ option for updating existing data,
and when stored procedures are used to insert the data. and when stored procedures are used to insert the data. It is best to check the <<connectors>> section to validate.
Inserts vs. Updates Inserts vs. Updates

View File

@ -1,4 +1,3 @@
//// ////
Licensed to the Apache Software Foundation (ASF) under one Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file or more contributor license agreements. See the NOTICE file
@ -49,8 +48,6 @@ Argument Description
+\--as-sequencefile+ Imports data to SequenceFiles +\--as-sequencefile+ Imports data to SequenceFiles
+\--as-textfile+ Imports data as plain text (default) +\--as-textfile+ Imports data as plain text (default)
+\--direct+ Use direct import fast path +\--direct+ Use direct import fast path
+\--direct-split-size <n>+ Split the input stream every 'n' bytes when\
importing in direct mode
+\--inline-lob-limit <n>+ Set the maximum size for an inline LOB +\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
+-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel +-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel
+\--warehouse-dir <dir>+ HDFS parent for table destination +\--warehouse-dir <dir>+ HDFS parent for table destination

View File

@ -1,4 +1,3 @@
//// ////
Licensed to the Apache Software Foundation (ASF) under one Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file or more contributor license agreements. See the NOTICE file
@ -64,9 +63,7 @@ Argument Description
+\--columns <col,col,col...>+ Columns to import from table +\--columns <col,col,col...>+ Columns to import from table
+\--delete-target-dir+ Delete the import target directory\ +\--delete-target-dir+ Delete the import target directory\
if it exists if it exists
+\--direct+ Use direct import fast path +\--direct+ Use direct connector if exists for the database
+\--direct-split-size <n>+ Split the input stream every 'n' bytes\
when importing in direct mode
+\--fetch-size <n>+ Number of entries to read from database\ +\--fetch-size <n>+ Number of entries to read from database\
at once. at once.
+\--inline-lob-limit <n>+ Set the maximum size for an inline LOB +\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
@ -231,13 +228,10 @@ data movement tools. For example, MySQL provides the +mysqldump+ tool
which can export data from MySQL to other systems very quickly. By which can export data from MySQL to other systems very quickly. By
supplying the +\--direct+ argument, you are specifying that Sqoop supplying the +\--direct+ argument, you are specifying that Sqoop
should attempt the direct import channel. This channel may be should attempt the direct import channel. This channel may be
higher performance than using JDBC. Currently, direct mode does not higher performance than using JDBC.
support imports of large object columns.
When importing from PostgreSQL in conjunction with direct mode, you Details about use of direct mode with each specific RDBMS, installation requirements, available
can split the import into separate files after options and limitations can be found in <<connectors>>.
individual files reach a certain size. This size limit is controlled
with the +\--direct-split-size+ argument.
By default, Sqoop will import a table named +foo+ to a directory named By default, Sqoop will import a table named +foo+ to a directory named
+foo+ inside your home directory in HDFS. For example, if your +foo+ inside your home directory in HDFS. For example, if your
@ -280,10 +274,6 @@ data to a temporary directory and then rename the files into the normal
target directory in a manner that does not conflict with existing filenames target directory in a manner that does not conflict with existing filenames
in that directory. in that directory.
NOTE: When using the direct mode of import, certain database client utilities
are expected to be present in the shell path of the task process. For MySQL
the utilities +mysqldump+ and +mysqlimport+ are required, whereas for
PostgreSQL the utility +psql+ is required.
Controlling transaction isolation Controlling transaction isolation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -683,13 +673,6 @@ $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
-m 8 -m 8
---- ----
Enabling the MySQL "direct mode" fast path:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--direct
----
Storing data in SequenceFiles, and setting the generated class name to Storing data in SequenceFiles, and setting the generated class name to
+com.foocorp.Employee+: +com.foocorp.Employee+: