SQOOP-1337: Doc refactoring - Consolidate documentation of --direct

(Gwen Shapira via Jarek Jarcec Cecho)
2025-05-03 05:50:31 +08:00 · 2014-06-23 08:34:03 -07:00 · 2014-06-23 08:34:03 -07:00 · c320b4fe03
commit c320b4fe03
parent d902d2449f
5 changed files with 121 additions and 85 deletions
--- a/src/docs/user/compatibility.txt
+++ b/src/docs/user/compatibility.txt
@ -1,4 +1,3 @@
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -127,42 +126,6 @@ Sqoop is currently not supporting import from view in direct mode. Use
 JDBC based (non direct) mode in case that you need to import view (simply
 omit +--direct+ parameter).
 Direct-mode Transactions
 ^^^^^^^^^^^^^^^^^^^^^^^^
 For performance, each writer will commit the current transaction
 approximately every 32 MB of exported data. You can control this
 by specifying the following argument _before_ any tool-specific arguments: +-D
 sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
 bytes. Set _size_ to 0 to disable intermediate checkpoints,
 but individual files being exported will continue to be committed
 independently of one another.
 Sometimes you need to export large data with Sqoop to a live MySQL cluster that
 is under a high load serving random queries from the users of your application.
 While data consistency issues during the export can be easily solved with a
 staging table, there is still a problem with the performance impact caused by
 the heavy export.
 First off, the resources of MySQL dedicated to the import process can affect
 the performance of the live product, both on the master and on the slaves.
 Second, even if the servers can handle the import with no significant
 performance impact (mysqlimport should be relatively "cheap"), importing big
 tables can cause serious replication lag in the cluster risking data
 inconsistency.
 With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
 milliseconds, you can let the server relax between checkpoints and the replicas
 catch up by pausing the export process after transferring the number of bytes
 specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
 settings of these two parameters to archieve an export pace that doesn't
 endanger the stability of your MySQL cluster.
 IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
 parameter=value+ are Hadoop _generic arguments_ and must appear before
 any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
 Don't forget that these parameters are only supported with the +\--direct+
 flag set.
 PostgreSQL
 ~~~~~~~~~~
--- a/src/docs/user/connectors.txt
+++ b/src/docs/user/connectors.txt
@ -1,4 +1,3 @@
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -17,7 +16,7 @@
  limitations under the License.
 ////
-
+[[connectors]]
 Notes for specific connectors
 -----------------------------
@ -39,6 +38,80 @@ it will update appropriate row instead. As a result, Sqoop is ignoring values sp
 in parameter +\--update-key+, however user needs to specify at least one valid column
 to turn on update mode itself.
 MySQL Direct Connector
 ~~~~~~~~~~~~~~~~~~~~~~
 MySQL Direct Connector allows faster import and export to/from MySQL using +mysqldump+ and +mysqlimport+ tools functionality
 instead of SQL selects and inserts.
 To use the MySQL Direct Connector, specify the +\--direct+ argument for your import or export job.
 Example:
 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --direct
 ----
 Passing additional parameters to mysqldump:
 ----
 $ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
    --direct -- --default-character-set=latin1
 ----
 Requirements
 ^^^^^^^^^^^^
 Utilities +mysqldump+ and +mysqlimport+ should be present in the shell path of the user running the Sqoop command on
 all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
 Limitations
 ^^^^^^^^^^^^
 * Currently the direct connector does not support import of large object columns (BLOB and CLOB).
 * Importing to HBase and Accumulo is not supported
 * Use of a staging table when exporting data is not supported
 * Import of views is not supported
 Direct-mode Transactions
 ^^^^^^^^^^^^^^^^^^^^^^^^
 For performance, each writer will commit the current transaction
 approximately every 32 MB of exported data. You can control this
 by specifying the following argument _before_ any tool-specific arguments: +-D
 sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in
 bytes. Set _size_ to 0 to disable intermediate checkpoints,
 but individual files being exported will continue to be committed
 independently of one another.
 Sometimes you need to export large data with Sqoop to a live MySQL cluster that
 is under a high load serving random queries from the users of your application.
 While data consistency issues during the export can be easily solved with a
 staging table, there is still a problem with the performance impact caused by
 the heavy export.
 First off, the resources of MySQL dedicated to the import process can affect
 the performance of the live product, both on the master and on the slaves.
 Second, even if the servers can handle the import with no significant
 performance impact (mysqlimport should be relatively "cheap"), importing big
 tables can cause serious replication lag in the cluster risking data
 inconsistency.
 With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in
 milliseconds, you can let the server relax between checkpoints and the replicas
 catch up by pausing the export process after transferring the number of bytes
 specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different
 settings of these two parameters to archieve an export pace that doesn't
 endanger the stability of your MySQL cluster.
 IMPORTANT: Note that any arguments to Sqoop that are of the form +-D
 parameter=value+ are Hadoop _generic arguments_ and must appear before
 any tool-specific arguments (for example, +\--connect+, +\--table+, etc).
 Don't forget that these parameters are only supported with the +\--direct+
 flag set.
 Microsoft SQL Connector
 ~~~~~~~~~~~~~~~~~~~~~~~
@ -60,6 +133,7 @@ Argument                                 Description
 Schema support
 ^^^^^^^^^^^^^^
 If you need to work with tables that are located in non-default schemas, you can
 specify schema names via the +\--schema+ argument. Custom schemas are supported for
 both import and export jobs. For example:
@ -98,8 +172,31 @@ Argument                                 Description
                                         Default is "public".
 ---------------------------------------------------------------------------------
-The direct connector (used when specified +\--direct+ parameter), offers also
+Schema support
-additional extra arguments:
+^^^^^^^^^^^^^^
 If you need to work with table that is located in schema other than default one,
 you need to specify extra argument +\--schema+. Custom schemas are supported for
 both import and export job (optional staging table however must be present in the
 same schema as target table). Example invocation:
 ----
 $ sqoop import ... --table custom_table -- --schema custom_schema
 ----
 PostgreSQL Direct Connector
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 PostgreSQL Direct Connector allows faster import and export to/from PostgresSQL "COPY" command.
 To use the PostgreSQL Direct Connector, specify the +\--direct+ argument for your import or export job.
 When importing from PostgreSQL in conjunction with direct mode, you
 can split the import into separate files after
 individual files reach a certain size. This size limit is controlled
 with the +\--direct-split-size+ argument.
 The direct connector offers also additional extra arguments:
 .Additional supported PostgreSQL extra arguments in direct mode:
 [grid="all"]
@ -114,19 +211,19 @@ Argument                                 Description
                                         Default is "FALSE".
 ---------------------------------------------------------------------------------
-Schema support
+Requirements
-^^^^^^^^^^^^^^
+^^^^^^^^^^^^
-If you need to work with table that is located in schema other than default one,
+Utility +psql+ should be present in the shell path of the user running the Sqoop command on
-you need to specify extra argument +\--schema+. Custom schemas are supported for
+all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop.
 both import and export job (optional staging table however must be present in the
 same schema as target table). Example invocation:
 ----
 $ sqoop import ... --table custom_table -- --schema custom_schema
 ----
 Limitations
 ^^^^^^^^^^^^
 * Currently the direct connector does not support import of large object columns (BLOB and CLOB).
 * Importing to HBase and Accumulo is not supported
 * Import of views is not supported
 pg_bulkload connector
 ~~~~~~~~~~~~~~~~~~~~~
--- a/src/docs/user/export.txt
+++ b/src/docs/user/export.txt
@ -1,4 +1,3 @@
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -93,13 +92,10 @@ additional load may decrease performance. The +\--num-mappers+ or +-m+
 arguments control the number of map tasks, which is the degree of
 parallelism used.
-MySQL provides a direct mode for exports as well, using the
+Some databases provides a direct mode for exports as well. Use the +\--direct+ argument
-+mysqlimport+ tool. When exporting to MySQL, use the +\--direct+ argument
+to specify this codepath. This may be higher-performance than the standard JDBC codepath.
-to specify this codepath. This may be
+Details about use of direct mode with each specific RDBMS, installation requirements, available
-higher-performance than the standard JDBC codepath.
+options and limitations can be found in <<connectors>>.
 NOTE: When using export in direct mode with MySQL, the MySQL bulk utility
 +mysqlimport+ must be available in the shell path of the task process.
 The +\--input-null-string+ and +\--input-null-non-string+ arguments are
 optional. If +\--input-null-string+ is not specified, then the string
@ -127,9 +123,9 @@ If the staging table contains data and the +\--clear-staging-table+ option is
 specified, Sqoop will delete all of the data before starting the export job.
 NOTE: Support for staging data prior to pushing it into the destination
-table is not available for +--direct+ exports. It is also not available when
+table is not always available for +--direct+ exports. It is also not available when
 export is invoked using the +--update-key+ option for updating existing data,
-and when stored procedures are used to insert the data.
+and when stored procedures are used to insert the data. It is best to check the <<connectors>> section to validate.
 Inserts vs. Updates
--- a/src/docs/user/import-all-tables.txt
+++ b/src/docs/user/import-all-tables.txt
@ -1,4 +1,3 @@
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -49,8 +48,6 @@ Argument                     Description
 +\--as-sequencefile+         Imports data to SequenceFiles
 +\--as-textfile+             Imports data as plain text (default)
 +\--direct+                  Use direct import fast path
 +\--direct-split-size <n>+   Split the input stream every 'n' bytes when\
                             importing in direct mode
 +\--inline-lob-limit <n>+    Set the maximum size for an inline LOB
 +-m,\--num-mappers <n>+      Use 'n' map tasks to import in parallel
 +\--warehouse-dir <dir>+     HDFS parent for table destination
--- a/src/docs/user/import.txt
+++ b/src/docs/user/import.txt
@ -1,4 +1,3 @@
 ////
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
@ -64,9 +63,7 @@ Argument                          Description
 +\--columns <col,col,col...>+     Columns to import from table
 +\--delete-target-dir+            Delete the import target directory\
                                  if it exists
-+\--direct+                       Use direct import fast path
+\--direct+                       Use direct connector if exists for the database
 +\--direct-split-size <n>+        Split the input stream every 'n' bytes\
                                  when importing in direct mode
 +\--fetch-size <n>+               Number of entries to read from database\
                                  at once.
 +\--inline-lob-limit <n>+         Set the maximum size for an inline LOB
@ -231,13 +228,10 @@ data movement tools. For example, MySQL provides the +mysqldump+ tool
 which can export data from MySQL to other systems very quickly. By
 supplying the +\--direct+ argument, you are specifying that Sqoop
 should attempt the direct import channel. This channel may be
-higher performance than using JDBC. Currently, direct mode does not
+higher performance than using JDBC.
 support imports of large object columns.
-When importing from PostgreSQL in conjunction with direct mode, you
+Details about use of direct mode with each specific RDBMS, installation requirements, available
-can split the import into separate files after
+options and limitations can be found in <<connectors>>.
 individual files reach a certain size. This size limit is controlled
 with the +\--direct-split-size+ argument.
 By default, Sqoop will import a table named +foo+ to a directory named
 +foo+ inside your home directory in HDFS. For example, if your
@ -280,10 +274,6 @@ data to a temporary directory and then rename the files into the normal
 target directory in a manner that does not conflict with existing filenames
 in that directory.
 NOTE: When using the direct mode of import, certain database client utilities
 are expected to be present in the shell path of the task process. For MySQL
 the utilities +mysqldump+ and +mysqlimport+ are required, whereas for
 PostgreSQL the utility +psql+ is required.
 Controlling transaction isolation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -683,13 +673,6 @@ $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    -m 8
 ----
 Enabling the MySQL "direct mode" fast path:
 ----
 $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --direct
 ----
 Storing data in SequenceFiles, and setting the generated class name to
 +com.foocorp.Employee+: