From c320b4fe03e3ca7e16b30f382c03cef7d047d616 Mon Sep 17 00:00:00 2001
From: Jarek Jarcec Cecho <jarcec@apache.org>
Date: Mon, 23 Jun 2014 08:34:03 -0700
Subject: [PATCH] SQOOP-1337: Doc refactoring - Consolidate documentation of
 --direct

(Gwen Shapira via Jarek Jarcec Cecho)
---
 src/docs/user/compatibility.txt     |  37 --------
 src/docs/user/connectors.txt        | 125 ++++++++++++++++++++++++----
 src/docs/user/export.txt            |  16 ++--
 src/docs/user/import-all-tables.txt |   3 -
 src/docs/user/import.txt            |  25 +-----
 5 files changed, 121 insertions(+), 85 deletions(-)

diff --git a/src/docs/user/compatibility.txt b/src/docs/user/compatibility.txt
index 37e07b2c..a7344e7f 100644
--- a/src/docs/user/compatibility.txt
+++ b/src/docs/user/compatibility.txt
@@ -1,4 +1,3 @@
-
 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
@@ -127,42 +126,6 @@ Sqoop is currently not supporting import from view in direct mode. Use
 JDBC based (non direct) mode in case that you need to import view (simply
 omit +--direct+ parameter).
 
-Direct-mode Transactions
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-For performance, each writer will commit the current transaction
-approximately every 32 MB of exported data. You can control this
-by specifying the following argument _before_ any tool-specific arguments: +-D
-sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
-bytes. Set _size_ to 0 to disable intermediate checkpoints,
-but individual files being exported will continue to be committed
-independently of one another.
-
-Sometimes you need to export large data with Sqoop to a live MySQL cluster that
-is under a high load serving random queries from the users of your application.
-While data consistency issues during the export can be easily solved with a
-staging table, there is still a problem with the performance impact caused by
-the heavy export.
-
-First off, the resources of MySQL dedicated to the import process can affect
-the performance of the live product, both on the master and on the slaves.
-Second, even if the servers can handle the import with no significant
-performance impact (mysqlimport should be relatively "cheap"), importing big
-tables can cause serious replication lag in the cluster risking data
-inconsistency.
-
-With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
-milliseconds, you can let the server relax between checkpoints and the replicas
-catch up by pausing the export process after transferring the number of bytes
-specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
-settings of these two parameters to archieve an export pace that doesn't
-endanger the stability of your MySQL cluster.
-
-IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
-parameter=value+ are Hadoop _generic arguments_ and must appear before
-any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
-Don't forget that these parameters are only supported with the +\--direct+
-flag set.
 
 PostgreSQL
 ~~~~~~~~~~
diff --git a/src/docs/user/connectors.txt b/src/docs/user/connectors.txt
index cf661120..379cbd99 100644
--- a/src/docs/user/connectors.txt
+++ b/src/docs/user/connectors.txt
@@ -1,4 +1,3 @@
-
 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
@@ -17,7 +16,7 @@
   limitations under the License.
 ////
 
-
+[[connectors]]
 Notes for specific connectors
 -----------------------------
 
@@ -39,6 +38,80 @@ it will update appropriate row instead. As a result, Sqoop is ignoring values sp
 in parameter +\--update-key+, however user needs to specify at least one valid column
 to turn on update mode itself.
 
+
+MySQL Direct Connector
+~~~~~~~~~~~~~~~~~~~~~~
+
+MySQL Direct Connector allows faster import and export to/from MySQL using +mysqldump+ and +mysqlimport+ tools functionality
+instead of SQL selects and inserts.
+
+To use the MySQL Direct Connector, specify the +\--direct+ argument for your import or export job.
+
+Example:
+
+----
+$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
+    --direct
+----
+
+Passing additional parameters to mysqldump:
+
+----
+$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
+    --direct -- --default-character-set=latin1
+----
+
+Requirements
+^^^^^^^^^^^^
+
+Utilities +mysqldump+ and +mysqlimport+ should be present in the shell path of the user running the Sqoop command on
+all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
+
+Limitations
+^^^^^^^^^^^^
+
+* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
+* Importing to HBase and Accumulo is not supported
+* Use of a staging table when exporting data is not supported
+* Import of views is not supported
+
+Direct-mode Transactions
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+For performance, each writer will commit the current transaction
+approximately every 32 MB of exported data. You can control this
+by specifying the following argument _before_ any tool-specific arguments: +-D
+sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
+bytes. Set _size_ to 0 to disable intermediate checkpoints,
+but individual files being exported will continue to be committed
+independently of one another.
+
+Sometimes you need to export large data with Sqoop to a live MySQL cluster that
+is under a high load serving random queries from the users of your application.
+While data consistency issues during the export can be easily solved with a
+staging table, there is still a problem with the performance impact caused by
+the heavy export.
+
+First off, the resources of MySQL dedicated to the import process can affect
+the performance of the live product, both on the master and on the slaves.
+Second, even if the servers can handle the import with no significant
+performance impact (mysqlimport should be relatively "cheap"), importing big
+tables can cause serious replication lag in the cluster risking data
+inconsistency.
+
+With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
+milliseconds, you can let the server relax between checkpoints and the replicas
+catch up by pausing the export process after transferring the number of bytes
+specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
+settings of these two parameters to archieve an export pace that doesn't
+endanger the stability of your MySQL cluster.
+
+IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
+parameter=value+ are Hadoop _generic arguments_ and must appear before
+any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
+Don't forget that these parameters are only supported with the +\--direct+
+flag set.
+
 Microsoft SQL Connector
 ~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -60,6 +133,7 @@ Argument                                 Description
 
 Schema support
 ^^^^^^^^^^^^^^
+
 If you need to work with tables that are located in non-default schemas, you can
 specify schema names via the +\--schema+ argument. Custom schemas are supported for
 both import and export jobs. For example:
@@ -98,8 +172,31 @@ Argument                                 Description
                                          Default is "public".
 ---------------------------------------------------------------------------------
 
-The direct connector (used when specified +\--direct+ parameter), offers also
-additional extra arguments:
+Schema support
+^^^^^^^^^^^^^^
+
+If you need to work with table that is located in schema other than default one,
+you need to specify extra argument +\--schema+. Custom schemas are supported for
+both import and export job (optional staging table however must be present in the
+same schema as target table). Example invocation:
+
+----
+$ sqoop import ... --table custom_table -- --schema custom_schema
+----
+
+PostgreSQL Direct Connector
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+PostgreSQL Direct Connector allows faster import and export to/from PostgresSQL "COPY" command.
+
+To use the PostgreSQL Direct Connector, specify the +\--direct+ argument for your import or export job.
+
+When importing from PostgreSQL in conjunction with direct mode, you
+can split the import into separate files after
+individual files reach a certain size. This size limit is controlled
+with the +\--direct-split-size+ argument.
+
+The direct connector offers also additional extra arguments:
 
 .Additional supported PostgreSQL extra arguments in direct mode:
 [grid="all"]
@@ -114,19 +211,19 @@ Argument                                 Description
                                          Default is "FALSE".
 ---------------------------------------------------------------------------------
 
-Schema support
-^^^^^^^^^^^^^^
+Requirements
+^^^^^^^^^^^^
 
-If you need to work with table that is located in schema other than default one,
-you need to specify extra argument +\--schema+. Custom schemas are supported for
-both import and export job (optional staging table however must be present in the
-same schema as target table). Example invocation:
-
-----
-$ sqoop import ... --table custom_table -- --schema custom_schema
-----
+Utility +psql+ should be present in the shell path of the user running the Sqoop command on
+all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
 
 
+Limitations
+^^^^^^^^^^^^
+
+* Currently the direct connector does not support import of large object columns (BLOB and CLOB).
+* Importing to HBase and Accumulo is not supported
+* Import of views is not supported
 
 pg_bulkload connector
 ~~~~~~~~~~~~~~~~~~~~~
diff --git a/src/docs/user/export.txt b/src/docs/user/export.txt
index 8b9e4738..304810a7 100644
--- a/src/docs/user/export.txt
+++ b/src/docs/user/export.txt
@@ -1,4 +1,3 @@
-
 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
@@ -93,13 +92,10 @@ additional load may decrease performance. The +\--num-mappers+ or +-m+
 arguments control the number of map tasks, which is the degree of
 parallelism used.
 
-MySQL provides a direct mode for exports as well, using the
-+mysqlimport+ tool. When exporting to MySQL, use the +\--direct+ argument
-to specify this codepath. This may be
-higher-performance than the standard JDBC codepath.
-
-NOTE: When using export in direct mode with MySQL, the MySQL bulk utility
-+mysqlimport+ must be available in the shell path of the task process.
+Some databases provides a direct mode for exports as well. Use the +\--direct+ argument
+to specify this codepath. This may be higher-performance than the standard JDBC codepath.
+Details about use of direct mode with each specific RDBMS, installation requirements, available
+options and limitations can be found in <<connectors>>.
 
 The +\--input-null-string+ and +\--input-null-non-string+ arguments are
 optional. If +\--input-null-string+ is not specified, then the string
@@ -127,9 +123,9 @@ If the staging table contains data and the +\--clear-staging-table+ option is
 specified, Sqoop will delete all of the data before starting the export job.
 
 NOTE: Support for staging data prior to pushing it into the destination
-table is not available for +--direct+ exports. It is also not available when
+table is not always available for +--direct+ exports. It is also not available when
 export is invoked using the +--update-key+ option for updating existing data,
-and when stored procedures are used to insert the data.
+and when stored procedures are used to insert the data. It is best to check the <<connectors>> section to validate.
 
 
 Inserts vs. Updates
diff --git a/src/docs/user/import-all-tables.txt b/src/docs/user/import-all-tables.txt
index 8c3a4f51..60645f14 100644
--- a/src/docs/user/import-all-tables.txt
+++ b/src/docs/user/import-all-tables.txt
@@ -1,4 +1,3 @@
-
 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
@@ -49,8 +48,6 @@ Argument                     Description
 +\--as-sequencefile+         Imports data to SequenceFiles
 +\--as-textfile+             Imports data as plain text (default)
 +\--direct+                  Use direct import fast path
-+\--direct-split-size <n>+   Split the input stream every 'n' bytes when\
-                             importing in direct mode
 +\--inline-lob-limit <n>+    Set the maximum size for an inline LOB
 +-m,\--num-mappers <n>+      Use 'n' map tasks to import in parallel
 +\--warehouse-dir <dir>+     HDFS parent for table destination
diff --git a/src/docs/user/import.txt b/src/docs/user/import.txt
index 7a3fa435..192e97e3 100644
--- a/src/docs/user/import.txt
+++ b/src/docs/user/import.txt
@@ -1,4 +1,3 @@
-
 ////
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
@@ -64,9 +63,7 @@ Argument                          Description
 +\--columns <col,col,col...>+     Columns to import from table
 +\--delete-target-dir+            Delete the import target directory\
                                   if it exists
-+\--direct+                       Use direct import fast path
-+\--direct-split-size <n>+        Split the input stream every 'n' bytes\
-                                  when importing in direct mode
++\--direct+                       Use direct connector if exists for the database
 +\--fetch-size <n>+               Number of entries to read from database\
                                   at once.
 +\--inline-lob-limit <n>+         Set the maximum size for an inline LOB
@@ -231,13 +228,10 @@ data movement tools. For example, MySQL provides the +mysqldump+ tool
 which can export data from MySQL to other systems very quickly. By
 supplying the +\--direct+ argument, you are specifying that Sqoop
 should attempt the direct import channel. This channel may be
-higher performance than using JDBC. Currently, direct mode does not
-support imports of large object columns.
+higher performance than using JDBC.
 
-When importing from PostgreSQL in conjunction with direct mode, you
-can split the import into separate files after
-individual files reach a certain size. This size limit is controlled
-with the +\--direct-split-size+ argument.
+Details about use of direct mode with each specific RDBMS, installation requirements, available
+options and limitations can be found in <<connectors>>.
 
 By default, Sqoop will import a table named +foo+ to a directory named
 +foo+ inside your home directory in HDFS. For example, if your
@@ -280,10 +274,6 @@ data to a temporary directory and then rename the files into the normal
 target directory in a manner that does not conflict with existing filenames
 in that directory.
 
-NOTE: When using the direct mode of import, certain database client utilities
-are expected to be present in the shell path of the task process. For MySQL
-the utilities +mysqldump+ and +mysqlimport+ are required, whereas for
-PostgreSQL the utility +psql+ is required.
 
 Controlling transaction isolation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -683,13 +673,6 @@ $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
     -m 8
 ----
 
-Enabling the MySQL "direct mode" fast path:
-
-----
-$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
-    --direct
-----
-
 Storing data in SequenceFiles, and setting the generated class name to
 +com.foocorp.Employee+: