SQOOP-1337: Doc refactoring - Consolidate documentation of --direct

(Gwen Shapira via Jarek Jarcec Cecho)
2025-05-03 06:09:47 +08:00 · 2014-06-23 08:34:03 -07:00 · 2014-06-23 08:34:03 -07:00 · c320b4fe03
commit c320b4fe03
parent d902d2449f
5 changed files with 121 additions and 85 deletions
--- a/src/docs/user/compatibility.txt
+++ b/src/docs/user/compatibility.txt
@ -1,4 +1,3 @@
-
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -127,42 +126,6 @@ Sqoop is currently not supporting import from view in direct mode. Use
 JDBC based (non direct) mode in case that you need to import view (simply
 omit +--direct+ parameter).

-Direct-mode Transactions
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-For performance, each writer will commit the current transaction
-approximately every 32 MB of exported data. You can control this
-by specifying the following argument _before_ any tool-specific arguments: +-D
-sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
-bytes. Set _size_ to 0 to disable intermediate checkpoints,
-but individual files being exported will continue to be committed
-independently of one another.
-
-Sometimes you need to export large data with Sqoop to a live MySQL cluster that
-is under a high load serving random queries from the users of your application.
-While data consistency issues during the export can be easily solved with a
-staging table, there is still a problem with the performance impact caused by
-the heavy export.
-
-First off, the resources of MySQL dedicated to the import process can affect
-the performance of the live product, both on the master and on the slaves.
-Second, even if the servers can handle the import with no significant
-performance impact (mysqlimport should be relatively "cheap"), importing big
-tables can cause serious replication lag in the cluster risking data
-inconsistency.
-
-With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
-milliseconds, you can let the server relax between checkpoints and the replicas
-catch up by pausing the export process after transferring the number of bytes
-specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
-settings of these two parameters to archieve an export pace that doesn't
-endanger the stability of your MySQL cluster.
-
-IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
-parameter=value+ are Hadoop _generic arguments_ and must appear before
-any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
-Don't forget that these parameters are only supported with the +\--direct+
-flag set.

 PostgreSQL
 ~~~~~~~~~~
--- a/src/docs/user/connectors.txt
+++ b/src/docs/user/connectors.txt
@ -1,4 +1,3 @@
-
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -17,7 +16,7 @@
  limitations under the License.
 ////

-
+[[connectors]]
 Notes for specific connectors
 -----------------------------

@ -39,6 +38,80 @@ it will update appropriate row instead. As a result, Sqoop is ignoring values sp
 in parameter +\--update-key+, however user needs to specify at least one valid column
 to turn on update mode itself.

+
+MySQL Direct Connector
+~~~~~~~~~~~~~~~~~~~~~~
+
+MySQL Direct Connector allows faster import and export to/from MySQL using +mysqldump+ and +mysqlimport+ tools functionality
+instead of SQL selects and inserts.
+
+To use the MySQL Direct Connector, specify the +\--direct+ argument for your import or export job.
+
+Example:
+
+----
+$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
+    --direct
+----
+
+Passing additional parameters to mysqldump:
+
+----
+$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
+    --direct -- --default-character-set=latin1
+----
+
+Requirements
+^^^^^^^^^^^^
+
+Utilities +mysqldump+ and +mysqlimport+ should be present in the shell path of the user running the Sqoop command on
+all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
+
+Limitations
+^^^^^^^^^^^^
+
+* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
+* Importing to HBase and Accumulo is not supported
+* Use of a staging table when exporting data is not supported
+* Import of views is not supported
+
+Direct-mode Transactions
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+For performance, each writer will commit the current transaction
+approximately every 32 MB of exported data. You can control this
+by specifying the following argument _before_ any tool-specific arguments: +-D
+sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
+bytes. Set _size_ to 0 to disable intermediate checkpoints,
+but individual files being exported will continue to be committed
+independently of one another.
+
+Sometimes you need to export large data with Sqoop to a live MySQL cluster that
+is under a high load serving random queries from the users of your application.
+While data consistency issues during the export can be easily solved with a
+staging table, there is still a problem with the performance impact caused by
+the heavy export.
+
+First off, the resources of MySQL dedicated to the import process can affect
+the performance of the live product, both on the master and on the slaves.
+Second, even if the servers can handle the import with no significant
+performance impact (mysqlimport should be relatively "cheap"), importing big
+tables can cause serious replication lag in the cluster risking data
+inconsistency.
+
+With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
+milliseconds, you can let the server relax between checkpoints and the replicas
+catch up by pausing the export process after transferring the number of bytes
+specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
+settings of these two parameters to archieve an export pace that doesn't
+endanger the stability of your MySQL cluster.
+
+IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
+parameter=value+ are Hadoop _generic arguments_ and must appear before
+any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
+Don't forget that these parameters are only supported with the +\--direct+
+flag set.
+
 Microsoft SQL Connector
 ~~~~~~~~~~~~~~~~~~~~~~~

@ -60,6 +133,7 @@ Argument                                 Description

 Schema support
 ^^^^^^^^^^^^^^
+
 If you need to work with tables that are located in non-default schemas, you can
 specify schema names via the +\--schema+ argument. Custom schemas are supported for
 both import and export jobs. For example:
@ -98,8 +172,31 @@ Argument                                 Description
                                         Default is "public".
 ---------------------------------------------------------------------------------

-The direct connector (used when specified +\--direct+ parameter), offers also
-additional extra arguments:
+Schema support
+^^^^^^^^^^^^^^
+
+If you need to work with table that is located in schema other than default one,
+you need to specify extra argument +\--schema+. Custom schemas are supported for
+both import and export job (optional staging table however must be present in the
+same schema as target table). Example invocation:
+
+----
+$ sqoop import ... --table custom_table -- --schema custom_schema
+----
+
+PostgreSQL Direct Connector
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+PostgreSQL Direct Connector allows faster import and export to/from PostgresSQL "COPY" command.
+
+To use the PostgreSQL Direct Connector, specify the +\--direct+ argument for your import or export job.
+
+When importing from PostgreSQL in conjunction with direct mode, you
+can split the import into separate files after
+individual files reach a certain size. This size limit is controlled
+with the +\--direct-split-size+ argument.
+
+The direct connector offers also additional extra arguments:

 .Additional supported PostgreSQL extra arguments in direct mode:
 [grid="all"]
@ -114,19 +211,19 @@ Argument                                 Description
                                         Default is "FALSE".
 ---------------------------------------------------------------------------------

-Schema support
-^^^^^^^^^^^^^^
+Requirements
+^^^^^^^^^^^^

-If you need to work with table that is located in schema other than default one,
-you need to specify extra argument +\--schema+. Custom schemas are supported for
-both import and export job (optional staging table however must be present in the
-same schema as target table). Example invocation:
-
----
-$ sqoop import ... --table custom_table -- --schema custom_schema
----
+Utility +psql+ should be present in the shell path of the user running the Sqoop command on
+all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.


+Limitations
+^^^^^^^^^^^^
+
+* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
+* Importing to HBase and Accumulo is not supported
+* Import of views is not supported

 pg_bulkload connector
 ~~~~~~~~~~~~~~~~~~~~~
--- a/src/docs/user/export.txt
+++ b/src/docs/user/export.txt
@ -1,4 +1,3 @@
-
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -93,13 +92,10 @@ additional load may decrease performance. The +\--num-mappers+ or +-m+
 arguments control the number of map tasks, which is the degree of
 parallelism used.

-MySQL provides a direct mode for exports as well, using the
-+mysqlimport+ tool. When exporting to MySQL, use the +\--direct+ argument
-to specify this codepath. This may be
-higher-performance than the standard JDBC codepath.
-
-NOTE: When using export in direct mode with MySQL, the MySQL bulk utility
-+mysqlimport+ must be available in the shell path of the task process.
+Some databases provides a direct mode for exports as well. Use the +\--direct+ argument
+to specify this codepath. This may be higher-performance than the standard JDBC codepath.
+Details about use of direct mode with each specific RDBMS, installation requirements, available
+options and limitations can be found in <<connectors>>.

 The +\--input-null-string+ and +\--input-null-non-string+ arguments are
 optional. If +\--input-null-string+ is not specified, then the string
@ -127,9 +123,9 @@ If the staging table contains data and the +\--clear-staging-table+ option is
 specified, Sqoop will delete all of the data before starting the export job.

 NOTE: Support for staging data prior to pushing it into the destination
-table is not available for +--direct+ exports. It is also not available when
+table is not always available for +--direct+ exports. It is also not available when
 export is invoked using the +--update-key+ option for updating existing data,
-and when stored procedures are used to insert the data.
+and when stored procedures are used to insert the data. It is best to check the <<connectors>> section to validate.


 Inserts vs. Updates
--- a/src/docs/user/import-all-tables.txt
+++ b/src/docs/user/import-all-tables.txt
@ -1,4 +1,3 @@
-
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -49,8 +48,6 @@ Argument                     Description
 +\--as-sequencefile+         Imports data to SequenceFiles
 +\--as-textfile+             Imports data as plain text (default)
 +\--direct+                  Use direct import fast path
-+\--direct-split-size <n>+   Split the input stream every 'n' bytes when\
-                             importing in direct mode
 +\--inline-lob-limit <n>+    Set the maximum size for an inline LOB
 +-m,\--num-mappers <n>+      Use 'n' map tasks to import in parallel
 +\--warehouse-dir <dir>+     HDFS parent for table destination
--- a/src/docs/user/import.txt
+++ b/src/docs/user/import.txt
@ -1,4 +1,3 @@
-
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -64,9 +63,7 @@ Argument                          Description
 +\--columns <col,col,col...>+     Columns to import from table
 +\--delete-target-dir+            Delete the import target directory\
                                  if it exists
-+\--direct+                       Use direct import fast path
-+\--direct-split-size <n>+        Split the input stream every 'n' bytes\
-                                  when importing in direct mode
+\--direct+                       Use direct connector if exists for the database
 +\--fetch-size <n>+               Number of entries to read from database\
                                  at once.
 +\--inline-lob-limit <n>+         Set the maximum size for an inline LOB
@ -231,13 +228,10 @@ data movement tools. For example, MySQL provides the +mysqldump+ tool
 which can export data from MySQL to other systems very quickly. By
 supplying the +\--direct+ argument, you are specifying that Sqoop
 should attempt the direct import channel. This channel may be
-higher performance than using JDBC. Currently, direct mode does not
-support imports of large object columns.
+higher performance than using JDBC.

-When importing from PostgreSQL in conjunction with direct mode, you
-can split the import into separate files after
-individual files reach a certain size. This size limit is controlled
-with the +\--direct-split-size+ argument.
+Details about use of direct mode with each specific RDBMS, installation requirements, available
+options and limitations can be found in <<connectors>>.

 By default, Sqoop will import a table named +foo+ to a directory named
 +foo+ inside your home directory in HDFS. For example, if your
@ -280,10 +274,6 @@ data to a temporary directory and then rename the files into the normal
 target directory in a manner that does not conflict with existing filenames
 in that directory.

-NOTE: When using the direct mode of import, certain database client utilities
-are expected to be present in the shell path of the task process. For MySQL
-the utilities +mysqldump+ and +mysqlimport+ are required, whereas for
-PostgreSQL the utility +psql+ is required.

 Controlling transaction isolation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -683,13 +673,6 @@ $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    -m 8
 ----

-Enabling the MySQL "direct mode" fast path:
-
----
-$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
-    --direct
----
-
 Storing data in SequenceFiles, and setting the generated class name to
 +com.foocorp.Employee+: