From c320b4fe03e3ca7e16b30f382c03cef7d047d616 Mon Sep 17 00:00:00 2001 From: Jarek Jarcec Cecho Date: Mon, 23 Jun 2014 08:34:03 -0700 Subject: [PATCH] SQOOP-1337: Doc refactoring - Consolidate documentation of --direct (Gwen Shapira via Jarek Jarcec Cecho) --- src/docs/user/compatibility.txt | 37 -------- src/docs/user/connectors.txt | 125 ++++++++++++++++++++++++---- src/docs/user/export.txt | 16 ++-- src/docs/user/import-all-tables.txt | 3 - src/docs/user/import.txt | 25 +----- 5 files changed, 121 insertions(+), 85 deletions(-) diff --git a/src/docs/user/compatibility.txt b/src/docs/user/compatibility.txt index 37e07b2c..a7344e7f 100644 --- a/src/docs/user/compatibility.txt +++ b/src/docs/user/compatibility.txt @@ -1,4 +1,3 @@ - //// Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file @@ -127,42 +126,6 @@ Sqoop is currently not supporting import from view in direct mode. Use JDBC based (non direct) mode in case that you need to import view (simply omit +--direct+ parameter). -Direct-mode Transactions -^^^^^^^^^^^^^^^^^^^^^^^^ - -For performance, each writer will commit the current transaction -approximately every 32 MB of exported data. You can control this -by specifying the following argument _before_ any tool-specific arguments: +-D -sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in -bytes. Set _size_ to 0 to disable intermediate checkpoints, -but individual files being exported will continue to be committed -independently of one another. - -Sometimes you need to export large data with Sqoop to a live MySQL cluster that -is under a high load serving random queries from the users of your application. -While data consistency issues during the export can be easily solved with a -staging table, there is still a problem with the performance impact caused by -the heavy export. - -First off, the resources of MySQL dedicated to the import process can affect -the performance of the live product, both on the master and on the slaves. -Second, even if the servers can handle the import with no significant -performance impact (mysqlimport should be relatively "cheap"), importing big -tables can cause serious replication lag in the cluster risking data -inconsistency. - -With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in -milliseconds, you can let the server relax between checkpoints and the replicas -catch up by pausing the export process after transferring the number of bytes -specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different -settings of these two parameters to archieve an export pace that doesn't -endanger the stability of your MySQL cluster. - -IMPORTANT: Note that any arguments to Sqoop that are of the form +-D -parameter=value+ are Hadoop _generic arguments_ and must appear before -any tool-specific arguments (for example, +\--connect+, +\--table+, etc). -Don't forget that these parameters are only supported with the +\--direct+ -flag set. PostgreSQL ~~~~~~~~~~ diff --git a/src/docs/user/connectors.txt b/src/docs/user/connectors.txt index cf661120..379cbd99 100644 --- a/src/docs/user/connectors.txt +++ b/src/docs/user/connectors.txt @@ -1,4 +1,3 @@ - //// Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file @@ -17,7 +16,7 @@ limitations under the License. //// - +[[connectors]] Notes for specific connectors ----------------------------- @@ -39,6 +38,80 @@ it will update appropriate row instead. As a result, Sqoop is ignoring values sp in parameter +\--update-key+, however user needs to specify at least one valid column to turn on update mode itself. + +MySQL Direct Connector +~~~~~~~~~~~~~~~~~~~~~~ + +MySQL Direct Connector allows faster import and export to/from MySQL using +mysqldump+ and +mysqlimport+ tools functionality +instead of SQL selects and inserts. + +To use the MySQL Direct Connector, specify the +\--direct+ argument for your import or export job. + +Example: + +---- +$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ + --direct +---- + +Passing additional parameters to mysqldump: + +---- +$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \ + --direct -- --default-character-set=latin1 +---- + +Requirements +^^^^^^^^^^^^ + +Utilities +mysqldump+ and +mysqlimport+ should be present in the shell path of the user running the Sqoop command on +all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop. + +Limitations +^^^^^^^^^^^^ + +* Currently the direct connector does not support import of large object columns (BLOB and CLOB). +* Importing to HBase and Accumulo is not supported +* Use of a staging table when exporting data is not supported +* Import of views is not supported + +Direct-mode Transactions +^^^^^^^^^^^^^^^^^^^^^^^^ + +For performance, each writer will commit the current transaction +approximately every 32 MB of exported data. You can control this +by specifying the following argument _before_ any tool-specific arguments: +-D +sqoop.mysql.export.checkpoint.bytes=size+, where _size_ is a value in +bytes. Set _size_ to 0 to disable intermediate checkpoints, +but individual files being exported will continue to be committed +independently of one another. + +Sometimes you need to export large data with Sqoop to a live MySQL cluster that +is under a high load serving random queries from the users of your application. +While data consistency issues during the export can be easily solved with a +staging table, there is still a problem with the performance impact caused by +the heavy export. + +First off, the resources of MySQL dedicated to the import process can affect +the performance of the live product, both on the master and on the slaves. +Second, even if the servers can handle the import with no significant +performance impact (mysqlimport should be relatively "cheap"), importing big +tables can cause serious replication lag in the cluster risking data +inconsistency. + +With +-D sqoop.mysql.export.sleep.ms=time+, where _time_ is a value in +milliseconds, you can let the server relax between checkpoints and the replicas +catch up by pausing the export process after transferring the number of bytes +specified in +sqoop.mysql.export.checkpoint.bytes+. Experiment with different +settings of these two parameters to archieve an export pace that doesn't +endanger the stability of your MySQL cluster. + +IMPORTANT: Note that any arguments to Sqoop that are of the form +-D +parameter=value+ are Hadoop _generic arguments_ and must appear before +any tool-specific arguments (for example, +\--connect+, +\--table+, etc). +Don't forget that these parameters are only supported with the +\--direct+ +flag set. + Microsoft SQL Connector ~~~~~~~~~~~~~~~~~~~~~~~ @@ -60,6 +133,7 @@ Argument Description Schema support ^^^^^^^^^^^^^^ + If you need to work with tables that are located in non-default schemas, you can specify schema names via the +\--schema+ argument. Custom schemas are supported for both import and export jobs. For example: @@ -98,8 +172,31 @@ Argument Description Default is "public". --------------------------------------------------------------------------------- -The direct connector (used when specified +\--direct+ parameter), offers also -additional extra arguments: +Schema support +^^^^^^^^^^^^^^ + +If you need to work with table that is located in schema other than default one, +you need to specify extra argument +\--schema+. Custom schemas are supported for +both import and export job (optional staging table however must be present in the +same schema as target table). Example invocation: + +---- +$ sqoop import ... --table custom_table -- --schema custom_schema +---- + +PostgreSQL Direct Connector +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +PostgreSQL Direct Connector allows faster import and export to/from PostgresSQL "COPY" command. + +To use the PostgreSQL Direct Connector, specify the +\--direct+ argument for your import or export job. + +When importing from PostgreSQL in conjunction with direct mode, you +can split the import into separate files after +individual files reach a certain size. This size limit is controlled +with the +\--direct-split-size+ argument. + +The direct connector offers also additional extra arguments: .Additional supported PostgreSQL extra arguments in direct mode: [grid="all"] @@ -114,19 +211,19 @@ Argument Description Default is "FALSE". --------------------------------------------------------------------------------- -Schema support -^^^^^^^^^^^^^^ +Requirements +^^^^^^^^^^^^ -If you need to work with table that is located in schema other than default one, -you need to specify extra argument +\--schema+. Custom schemas are supported for -both import and export job (optional staging table however must be present in the -same schema as target table). Example invocation: - ----- -$ sqoop import ... --table custom_table -- --schema custom_schema ----- +Utility +psql+ should be present in the shell path of the user running the Sqoop command on +all nodes. To validate SSH as this user to all nodes and execute these commands. If you get an error, so will Sqoop. +Limitations +^^^^^^^^^^^^ + +* Currently the direct connector does not support import of large object columns (BLOB and CLOB). +* Importing to HBase and Accumulo is not supported +* Import of views is not supported pg_bulkload connector ~~~~~~~~~~~~~~~~~~~~~ diff --git a/src/docs/user/export.txt b/src/docs/user/export.txt index 8b9e4738..304810a7 100644 --- a/src/docs/user/export.txt +++ b/src/docs/user/export.txt @@ -1,4 +1,3 @@ - //// Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file @@ -93,13 +92,10 @@ additional load may decrease performance. The +\--num-mappers+ or +-m+ arguments control the number of map tasks, which is the degree of parallelism used. -MySQL provides a direct mode for exports as well, using the -+mysqlimport+ tool. When exporting to MySQL, use the +\--direct+ argument -to specify this codepath. This may be -higher-performance than the standard JDBC codepath. - -NOTE: When using export in direct mode with MySQL, the MySQL bulk utility -+mysqlimport+ must be available in the shell path of the task process. +Some databases provides a direct mode for exports as well. Use the +\--direct+ argument +to specify this codepath. This may be higher-performance than the standard JDBC codepath. +Details about use of direct mode with each specific RDBMS, installation requirements, available +options and limitations can be found in <>. The +\--input-null-string+ and +\--input-null-non-string+ arguments are optional. If +\--input-null-string+ is not specified, then the string @@ -127,9 +123,9 @@ If the staging table contains data and the +\--clear-staging-table+ option is specified, Sqoop will delete all of the data before starting the export job. NOTE: Support for staging data prior to pushing it into the destination -table is not available for +--direct+ exports. It is also not available when +table is not always available for +--direct+ exports. It is also not available when export is invoked using the +--update-key+ option for updating existing data, -and when stored procedures are used to insert the data. +and when stored procedures are used to insert the data. It is best to check the <> section to validate. Inserts vs. Updates diff --git a/src/docs/user/import-all-tables.txt b/src/docs/user/import-all-tables.txt index 8c3a4f51..60645f14 100644 --- a/src/docs/user/import-all-tables.txt +++ b/src/docs/user/import-all-tables.txt @@ -1,4 +1,3 @@ - //// Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file @@ -49,8 +48,6 @@ Argument Description +\--as-sequencefile+ Imports data to SequenceFiles +\--as-textfile+ Imports data as plain text (default) +\--direct+ Use direct import fast path -+\--direct-split-size + Split the input stream every 'n' bytes when\ - importing in direct mode +\--inline-lob-limit + Set the maximum size for an inline LOB +-m,\--num-mappers + Use 'n' map tasks to import in parallel +\--warehouse-dir + HDFS parent for table destination diff --git a/src/docs/user/import.txt b/src/docs/user/import.txt index 7a3fa435..192e97e3 100644 --- a/src/docs/user/import.txt +++ b/src/docs/user/import.txt @@ -1,4 +1,3 @@ - //// Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file @@ -64,9 +63,7 @@ Argument Description +\--columns + Columns to import from table +\--delete-target-dir+ Delete the import target directory\ if it exists -+\--direct+ Use direct import fast path -+\--direct-split-size + Split the input stream every 'n' bytes\ - when importing in direct mode ++\--direct+ Use direct connector if exists for the database +\--fetch-size + Number of entries to read from database\ at once. +\--inline-lob-limit + Set the maximum size for an inline LOB @@ -231,13 +228,10 @@ data movement tools. For example, MySQL provides the +mysqldump+ tool which can export data from MySQL to other systems very quickly. By supplying the +\--direct+ argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be -higher performance than using JDBC. Currently, direct mode does not -support imports of large object columns. +higher performance than using JDBC. -When importing from PostgreSQL in conjunction with direct mode, you -can split the import into separate files after -individual files reach a certain size. This size limit is controlled -with the +\--direct-split-size+ argument. +Details about use of direct mode with each specific RDBMS, installation requirements, available +options and limitations can be found in <>. By default, Sqoop will import a table named +foo+ to a directory named +foo+ inside your home directory in HDFS. For example, if your @@ -280,10 +274,6 @@ data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory. -NOTE: When using the direct mode of import, certain database client utilities -are expected to be present in the shell path of the task process. For MySQL -the utilities +mysqldump+ and +mysqlimport+ are required, whereas for -PostgreSQL the utility +psql+ is required. Controlling transaction isolation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -683,13 +673,6 @@ $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ -m 8 ---- -Enabling the MySQL "direct mode" fast path: - ----- -$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ - --direct ----- - Storing data in SequenceFiles, and setting the generated class name to +com.foocorp.Employee+: