SQOOP-42. Document saved jobs, metastore, and incremental imports.

Added manual pages and user guide sections for sqoop-job and sqoop-metastore. Updated sqoop-import documentation to describe incremental imports. From: Aaron Kimball <aaron@cloudera.com> git-svn-id: https://svn.apache.org/repos/asf/incubator/sqoop/trunk@1149968 13f79535-47bb-0310-9956-ffa450edef68
2025-05-03 22:20:52 +08:00 · 2011-07-22 20:04:14 +00:00 · 2011-07-22 20:04:14 +00:00 · d656663a14
commit d656663a14
parent 36f93eac1d
8 changed files with 508 additions and 0 deletions
--- a/src/docs/man/sqoop-import.txt
+++ b/src/docs/man/sqoop-import.txt
@ -86,6 +86,32 @@ include::hbase-args.txt[]

 include::codegen-args.txt[]

+
+Incremental import options
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sqoop can be configured to import only "new" data from the source database
+using the arguments in this section. While any import job can make use of
+these arguments, they are most powerful when used to initialize a recurring
+job with +sqoop job --create ...+. After executing a saved job, the last
+observed value of the check column is updated in the saved job.
+
+--incremental (mode)::
+  Specifies that this is an incremental import. Determines how Sqoop should
+  discover new data. "mode" may be +append+, in which case new rows are
+  expected to be added with increasing id values, or +lastmodified+, in which
+  case new data is discovered by comparing a timestamp column with the
+  timestamp at which the last import was performed.
+
+--check-column (col)::
+  Specifies a column whose value should be compared to the last imported
+  id or the last import timestamp to determine rows to import.
+
+--last-value (value)::
+  Specifies the most recent id imported, or the timestamp of the most recent
+  id. This argument is unnecessary for an initial import.
+
+
 Database-specific options
 ~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/src/docs/man/sqoop-job.txt
+++ b/src/docs/man/sqoop-job.txt
@ -0,0 +1,94 @@
+sqoop-job(1)
+============
+
+NAME
+----
+sqoop-job - Define, execute, and manipulate saved Sqoop jobs.
+
+SYNOPSIS
+--------
+'sqoop-job' <generic-options> <tool-options> [-- [<subtool-name>]
+<subtool-options>]
+
+'sqoop job' <generic-options> <tool-options> [-- [<subtool-name>]
+<subtool-options>]
+
+
+DESCRIPTION
+-----------
+
+include::../user/job-purpose.txt[]
+
+OPTIONS
+-------
+
+One of the job operation arguments is required. 
+
+Job management options
+~~~~~~~~~~~~~~~~~~~~~~
+
+--create (job-id)::
+  Define a new saved job with the specified job-id (name). A
+  second Sqoop command-line, separated by a +--+ should be specified;
+  this defines the saved job.
+
+--delete (job-id)::
+  Delete a saved job. (This deletes the job definition, but does not
+  remove any data from HDFS.)
+
+--exec (job-id)::
+  Given a job id defined with +--create+, run the saved job. Any arguments
+  following a +--+ are applied on top of the saved job, overriding the saved
+  parameters.
+
+--show (job-id)::
+  Shows the parameters for a saved job
+
+--list::
+  List all saved jobs
+
+
+Metastore connection options
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, Sqoop will use a private, local embedded database to store saved
+jobs. An alternate metastore can be configured in +conf/sqoop-site.xml+. You
+can also specify the metastore connect string here:
+
+--meta-connect (jdbc-uri)::
+  Specifies the JDBC connect string used to connect to the metastore
+
+
+Common options
+~~~~~~~~~~~~~~
+
+--help::
+  Print usage instructions
+
+--verbose::
+  Print more information while working
+
+
+ENVIRONMENT
+-----------
+
+See 'sqoop(1)'
+
+
+////
+   Licensed to Cloudera, Inc. under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+////
+
--- a/src/docs/man/sqoop-metastore.txt
+++ b/src/docs/man/sqoop-metastore.txt
@ -0,0 +1,51 @@
+sqoop-metastore(1)
+==================
+
+NAME
+----
+sqoop-metastore - Host a shared repository for saved Sqoop jobs.
+
+SYNOPSIS
+--------
+'sqoop-metastore' <generic-options> <tool-options>
+
+'sqoop metastore' <generic-options> <tool-options>
+
+
+DESCRIPTION
+-----------
+
+include::../user/metastore-purpose.txt[]
+
+OPTIONS
+-------
+
+Metastore management options
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+--shutdown::
+  Shuts down a running metastore instance on the same machine
+
+ENVIRONMENT
+-----------
+
+See 'sqoop(1)'
+
+
+////
+   Licensed to Cloudera, Inc. under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+////
+
--- a/src/docs/user/SqoopUserGuide.txt
+++ b/src/docs/user/SqoopUserGuide.txt
@ -35,6 +35,8 @@ include::import-all-tables.txt[]

 include::export.txt[]

+include::saved-jobs.txt[]
+
 include::codegen.txt[]

 include::create-hive-table.txt[]
--- a/src/docs/user/import.txt
+++ b/src/docs/user/import.txt
@ -234,6 +234,55 @@ data to a temporary directory and then rename the files into the normal
 target directory in a manner that does not conflict with existing filenames
 in that directory.

+Incremental Imports
+^^^^^^^^^^^^^^^^^^^
+
+Sqoop provides an incremental import mode which can be used to retrieve
+only rows newer than some previously-imported set of rows.
+
+The following arguments control incremental imports:
+
+
+.Incremental import arguments:
+[grid="all"]
+`-----------------------------`--------------------------------------
+Argument                      Description
+---------------------------------------------------------------------
+\--check-column (col)+       Specifies the column to be examined \
+                              when determining which rows to import.
+\--incremental (mode)+       Specifies how Sqoop determines which \
+                              rows are new. Legal values for +mode+\
+                              include +append+ and +lastmodified+.
+\--last-value (value)+       Specifies the maximum value of the \
+                              check column from the previous import.
+---------------------------------------------------------------------
+
+
+Sqoop supports two types of incremental imports: +append+ and +lastmodified+.
+You can use the +\--incremental+ argument to specify the type of incremental
+import to perform.
+
+You should specify +append+ mode when importing a table where new rows are
+continually being added with increasing row id values. You specify the column
+containing the row's id with +\--check-column+. Sqoop imports rows where the
+check column has a value greater than the one specified with +\--last-value+.
+
+An alternate table update strategy supported by Sqoop is called +lastmodified+
+mode. You should use this when rows of the source table may be updated, and
+each such update will set the value of a last-modified column to the current
+timestamp.  Rows where the check column holds a timestamp more recent than the
+timestamp specified with +\--last-value+ are imported.
+
+At the end of an incremental import, the value which should be specified as
+\--last-value+ for a subsequent import is printed to the screen. When running
+a subsequent import, you should specify +\--last-value+ in this way to ensure
+you import only the new or updated data. This is handled automatically by
+creating an incremental import as a saved job, which is the preferred
+mechanism for performing a recurring incremental import. See the section on
+saved jobs later in this document for more information.
+
+
+
 File Formats
 ^^^^^^^^^^^^

--- a/src/docs/user/job-purpose.txt
+++ b/src/docs/user/job-purpose.txt
@ -0,0 +1,27 @@
+
+////
+   Licensed to Cloudera, Inc. under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+////
+
+The job tool allows you to create and work with saved jobs. Saved jobs
+remember the parameters used to specify a job, so they can be
+re-executed by invoking the job by its handle.
+
+If a saved job is configured to perform an incremental import, state regarding
+the most recently imported rows is updated in the saved job to allow the job
+to continually import only the newest rows.
+
+
--- a/src/docs/user/metastore-purpose.txt
+++ b/src/docs/user/metastore-purpose.txt
@ -0,0 +1,26 @@
+
+////
+   Licensed to Cloudera, Inc. under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+////
+
+The +metastore+ tool configures Sqoop to host a shared metadata repository.
+Multiple users and/or remote users can define and execute saved jobs (created
+with +sqoop job+) defined in this metastore.
+
+Clients must be configured to connect to the metastore in +sqoop-site.xml+ or
+with the +--meta-connect+ argument.
+
+
--- a/src/docs/user/saved-jobs.txt
+++ b/src/docs/user/saved-jobs.txt
@ -0,0 +1,233 @@
+
+////
+   Licensed to Cloudera, Inc. under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+////
+
+Saved Jobs
+----------
+
+Imports and exports can be repeatedly performed by issuing the same command
+multiple times. Especially when using the incremental import capability,
+this is an expected scenario.
+
+Sqoop allows you to define _saved jobs_ which make this process easier. A
+saved job records the configuration information required to execute a
+Sqoop command at a later time. The section on the +sqoop-job+ tool
+describes how to create and work with saved jobs.
+
+By default, job descriptions are saved to a private repository stored
+in +$HOME/.sqoop/+. You can configure Sqoop to instead use a shared
+_metastore_, which makes saved jobs available to multiple users across a
+shared cluster. Starting the metastore is covered by the section on the
+sqoop-metastore+ tool.
+
+
+sqoop-job+
+-----------
+
+Purpose
+~~~~~~~
+
+include::job-purpose.txt[]
+
+Syntax
+~~~~~~
+
+----
+$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
+$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
+----
+
+Although the Hadoop generic arguments must preceed any export arguments,
+the job arguments can be entered in any order with respect to one
+another.
+
+.Job management options:
+[grid="all"]
+`---------------------------`------------------------------------------
+Argument                    Description
+-----------------------------------------------------------------------
+\--create <job-id>+        Define a new saved job with the specified \
+                            job-id (name). A second Sqoop \
+                            command-line, separated by a +\--+ should \
+                            be specified; this defines the saved job.
+\--delete <job-id>+        Delete a saved job.
+\--exec <job-id>+          Given a job defined with +\--create+, run \
+                            the saved job.
+\--show <job-id>+          Show the parameters for a saved job.
+\--list+                   List all saved jobs
+-----------------------------------------------------------------------
+
+Creating saved jobs is done with the +\--create+ action. This operation
+requires a +\--+ followed by a tool name and its arguments. The tool and
+its arguments will form the basis of the saved job. Consider:
+
+----
+$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
+    --table mytable
+----
+
+This creates a job named +myjob+ which can be executed later. The job is not
+run. This job is now available in the list of saved jobs:
+
+----
+$ sqoop job --list
+Available jobs:
+  myjob
+----
+
+We can inspect the configuration of a job with the +show+ action:
+
+----
+ $ sqoop job --show myjob
+ Job: myjob
+ Tool: import
+ Options:
+ ----------------------------
+ direct.import = false
+ codegen.input.delimiters.record = 0
+ hdfs.append.dir = false
+ db.table = mytable
+ ...
+----
+
+And if we are satisfied with it, we can run the job with +exec+:
+
+----
+$ sqoop job --exec myjob
+10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
+...
+----
+
+The +exec+ action allows you to override arguments of the saved job
+by supplying them after a +\--+. For example, if the database were
+changed to require a username, we could specify the username and
+password with:
+
+----
+$ sqoop job --exec myjob -- --username someuser -P
+Enter password:
+...
+----
+
+.Metastore connection options:
+[grid="all"]
+`----------------------------`-----------------------------------------
+Argument                     Description
+-----------------------------------------------------------------------
+\--meta-connect <jdbc-uri>+ Specifies the JDBC connect string used \
+                             to connect to the metastore
+-----------------------------------------------------------------------
+
+By default, a private metastore is instantiated in +$HOME/.sqoop+. If
+you have configured a hosted metastore with the +sqoop-metastore+
+tool, you can connect to it by specifying the +\--meta-connect+
+argument. This is a JDBC connect string just like the ones used to
+connect to databases for import.
+
+In +conf/sqoop-site.xml+, you can configure
+sqoop.metastore.client.autoconnect.url+ with this address, so you do not have
+to supply +\--meta-connect+ to use a remote metastore. This parameter can
+also be modified to move the private metastore to a location on your
+filesystem other than your home directory.
+
+If you configure +sqoop.metastore.client.enable.autoconnect+ with the
+value +false+, then you must explicitly supply +\--meta-connect+.
+
+.Common options:
+[grid="all"]
+`---------------------------`------------------------------------------
+Argument                    Description
+-----------------------------------------------------------------------
+\--help+                   Print usage instructions
+\--verbose+                Print more information while working
+-----------------------------------------------------------------------
+
+Saved jobs and passwords
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Sqoop metastore is not a secure resource. Multiple users can access
+its contents. For this reason, Sqoop does not store passwords in the
+metastore. If you create a job that requires a password, you will be
+prompted for that password each time you execute the job.
+
+You can enable passwords in the metastore by setting
+sqoop.metastore.client.record.password+ to +true+ in the configuration.
+
+
+Saved jobs and incremental imports
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Incremental imports are performed by comparing the values in a _check column_
+against a reference value for the most recent import. For example, if the
+\--incremental append+ argument was specified, along with +\--check-column
+id+ and +\--last-value 100+, all rows with +id > 100+ will be imported.
+If an incremental import is run from the command line, the value which
+should be specified as +\--last-value+ in a subsequent incremental import
+will be printed to the screen for your reference. If an incremental import is
+run from a saved job, this value will be retained in the saved job. Subsequent
+runs of +sqoop job \--exec someIncrementalJob+ will continue to import only
+newer rows than those previously imported.
+
+
+sqoop-metastore+
+-----------------
+
+Purpose
+~~~~~~~
+
+include::metastore-purpose.txt[]
+
+Syntax
+~~~~~~
+
+----
+$ sqoop metastore (generic-args) (metastore-args)
+$ sqoop-metastore (generic-args) (metastore-args)
+----
+
+Although the Hadoop generic arguments must preceed any metastore arguments,
+the metastore arguments can be entered in any order with respect to one
+another.
+
+.Metastore management options:
+[grid="all"]
+`---------------------------`------------------------------------------
+Argument                    Description
+-----------------------------------------------------------------------
+\--shutdown+               Shuts down a running metastore instance \
+                            on the same machine.
+-----------------------------------------------------------------------
+
+Running +sqoop-metastore+ launches a shared HSQLDB database instance on
+the current machine. Clients can connect to this metastore and create jobs
+which can be shared between users for execution.
+
+The location of the metastore's files on disk is controlled by the
+sqoop.metastore.server.location+ property in +conf/sqoop-site.xml+.
+This should point to a directory on the local filesystem.
+
+The metastore is available over TCP/IP. The port is controlled by the
+sqoop.metastore.server.port+ configuration parameter, and defaults to 16000.
+
+Clients should connect to the metastore by specifying
+sqoop.metastore.client.autoconnect.url+ or +\--meta-connect+ with the
+value +jdbc:hsqldb:hsql://<server-name>:<port>/sqoop+. For example,
+jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop+.
+
+This metastore may be hosted on a machine within the Hadoop cluster, or
+elsewhere on the network.
+