5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-03 22:20:52 +08:00

SQOOP-42. Document saved jobs, metastore, and incremental imports.

Added manual pages and user guide sections for sqoop-job and sqoop-metastore.
Updated sqoop-import documentation to describe incremental imports.

From: Aaron Kimball <aaron@cloudera.com>

git-svn-id: https://svn.apache.org/repos/asf/incubator/sqoop/trunk@1149968 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Andrew Bayer 2011-07-22 20:04:14 +00:00
parent 36f93eac1d
commit d656663a14
8 changed files with 508 additions and 0 deletions

View File

@ -86,6 +86,32 @@ include::hbase-args.txt[]
include::codegen-args.txt[]
Incremental import options
~~~~~~~~~~~~~~~~~~~~~~~~~~
Sqoop can be configured to import only "new" data from the source database
using the arguments in this section. While any import job can make use of
these arguments, they are most powerful when used to initialize a recurring
job with +sqoop job --create ...+. After executing a saved job, the last
observed value of the check column is updated in the saved job.
--incremental (mode)::
Specifies that this is an incremental import. Determines how Sqoop should
discover new data. "mode" may be +append+, in which case new rows are
expected to be added with increasing id values, or +lastmodified+, in which
case new data is discovered by comparing a timestamp column with the
timestamp at which the last import was performed.
--check-column (col)::
Specifies a column whose value should be compared to the last imported
id or the last import timestamp to determine rows to import.
--last-value (value)::
Specifies the most recent id imported, or the timestamp of the most recent
id. This argument is unnecessary for an initial import.
Database-specific options
~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@ -0,0 +1,94 @@
sqoop-job(1)
============
NAME
----
sqoop-job - Define, execute, and manipulate saved Sqoop jobs.
SYNOPSIS
--------
'sqoop-job' <generic-options> <tool-options> [-- [<subtool-name>]
<subtool-options>]
'sqoop job' <generic-options> <tool-options> [-- [<subtool-name>]
<subtool-options>]
DESCRIPTION
-----------
include::../user/job-purpose.txt[]
OPTIONS
-------
One of the job operation arguments is required.
Job management options
~~~~~~~~~~~~~~~~~~~~~~
--create (job-id)::
Define a new saved job with the specified job-id (name). A
second Sqoop command-line, separated by a +--+ should be specified;
this defines the saved job.
--delete (job-id)::
Delete a saved job. (This deletes the job definition, but does not
remove any data from HDFS.)
--exec (job-id)::
Given a job id defined with +--create+, run the saved job. Any arguments
following a +--+ are applied on top of the saved job, overriding the saved
parameters.
--show (job-id)::
Shows the parameters for a saved job
--list::
List all saved jobs
Metastore connection options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default, Sqoop will use a private, local embedded database to store saved
jobs. An alternate metastore can be configured in +conf/sqoop-site.xml+. You
can also specify the metastore connect string here:
--meta-connect (jdbc-uri)::
Specifies the JDBC connect string used to connect to the metastore
Common options
~~~~~~~~~~~~~~
--help::
Print usage instructions
--verbose::
Print more information while working
ENVIRONMENT
-----------
See 'sqoop(1)'
////
Licensed to Cloudera, Inc. under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////

View File

@ -0,0 +1,51 @@
sqoop-metastore(1)
==================
NAME
----
sqoop-metastore - Host a shared repository for saved Sqoop jobs.
SYNOPSIS
--------
'sqoop-metastore' <generic-options> <tool-options>
'sqoop metastore' <generic-options> <tool-options>
DESCRIPTION
-----------
include::../user/metastore-purpose.txt[]
OPTIONS
-------
Metastore management options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--shutdown::
Shuts down a running metastore instance on the same machine
ENVIRONMENT
-----------
See 'sqoop(1)'
////
Licensed to Cloudera, Inc. under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////

View File

@ -35,6 +35,8 @@ include::import-all-tables.txt[]
include::export.txt[]
include::saved-jobs.txt[]
include::codegen.txt[]
include::create-hive-table.txt[]

View File

@ -234,6 +234,55 @@ data to a temporary directory and then rename the files into the normal
target directory in a manner that does not conflict with existing filenames
in that directory.
Incremental Imports
^^^^^^^^^^^^^^^^^^^
Sqoop provides an incremental import mode which can be used to retrieve
only rows newer than some previously-imported set of rows.
The following arguments control incremental imports:
.Incremental import arguments:
[grid="all"]
`-----------------------------`--------------------------------------
Argument Description
---------------------------------------------------------------------
+\--check-column (col)+ Specifies the column to be examined \
when determining which rows to import.
+\--incremental (mode)+ Specifies how Sqoop determines which \
rows are new. Legal values for +mode+\
include +append+ and +lastmodified+.
+\--last-value (value)+ Specifies the maximum value of the \
check column from the previous import.
---------------------------------------------------------------------
Sqoop supports two types of incremental imports: +append+ and +lastmodified+.
You can use the +\--incremental+ argument to specify the type of incremental
import to perform.
You should specify +append+ mode when importing a table where new rows are
continually being added with increasing row id values. You specify the column
containing the row's id with +\--check-column+. Sqoop imports rows where the
check column has a value greater than the one specified with +\--last-value+.
An alternate table update strategy supported by Sqoop is called +lastmodified+
mode. You should use this when rows of the source table may be updated, and
each such update will set the value of a last-modified column to the current
timestamp. Rows where the check column holds a timestamp more recent than the
timestamp specified with +\--last-value+ are imported.
At the end of an incremental import, the value which should be specified as
+\--last-value+ for a subsequent import is printed to the screen. When running
a subsequent import, you should specify +\--last-value+ in this way to ensure
you import only the new or updated data. This is handled automatically by
creating an incremental import as a saved job, which is the preferred
mechanism for performing a recurring incremental import. See the section on
saved jobs later in this document for more information.
File Formats
^^^^^^^^^^^^

View File

@ -0,0 +1,27 @@
////
Licensed to Cloudera, Inc. under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
The job tool allows you to create and work with saved jobs. Saved jobs
remember the parameters used to specify a job, so they can be
re-executed by invoking the job by its handle.
If a saved job is configured to perform an incremental import, state regarding
the most recently imported rows is updated in the saved job to allow the job
to continually import only the newest rows.

View File

@ -0,0 +1,26 @@
////
Licensed to Cloudera, Inc. under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
The +metastore+ tool configures Sqoop to host a shared metadata repository.
Multiple users and/or remote users can define and execute saved jobs (created
with +sqoop job+) defined in this metastore.
Clients must be configured to connect to the metastore in +sqoop-site.xml+ or
with the +--meta-connect+ argument.

View File

@ -0,0 +1,233 @@
////
Licensed to Cloudera, Inc. under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
Saved Jobs
----------
Imports and exports can be repeatedly performed by issuing the same command
multiple times. Especially when using the incremental import capability,
this is an expected scenario.
Sqoop allows you to define _saved jobs_ which make this process easier. A
saved job records the configuration information required to execute a
Sqoop command at a later time. The section on the +sqoop-job+ tool
describes how to create and work with saved jobs.
By default, job descriptions are saved to a private repository stored
in +$HOME/.sqoop/+. You can configure Sqoop to instead use a shared
_metastore_, which makes saved jobs available to multiple users across a
shared cluster. Starting the metastore is covered by the section on the
+sqoop-metastore+ tool.
+sqoop-job+
-----------
Purpose
~~~~~~~
include::job-purpose.txt[]
Syntax
~~~~~~
----
$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
----
Although the Hadoop generic arguments must preceed any export arguments,
the job arguments can be entered in any order with respect to one
another.
.Job management options:
[grid="all"]
`---------------------------`------------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--create <job-id>+ Define a new saved job with the specified \
job-id (name). A second Sqoop \
command-line, separated by a +\--+ should \
be specified; this defines the saved job.
+\--delete <job-id>+ Delete a saved job.
+\--exec <job-id>+ Given a job defined with +\--create+, run \
the saved job.
+\--show <job-id>+ Show the parameters for a saved job.
+\--list+ List all saved jobs
-----------------------------------------------------------------------
Creating saved jobs is done with the +\--create+ action. This operation
requires a +\--+ followed by a tool name and its arguments. The tool and
its arguments will form the basis of the saved job. Consider:
----
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
--table mytable
----
This creates a job named +myjob+ which can be executed later. The job is not
run. This job is now available in the list of saved jobs:
----
$ sqoop job --list
Available jobs:
myjob
----
We can inspect the configuration of a job with the +show+ action:
----
$ sqoop job --show myjob
Job: myjob
Tool: import
Options:
----------------------------
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
...
----
And if we are satisfied with it, we can run the job with +exec+:
----
$ sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...
----
The +exec+ action allows you to override arguments of the saved job
by supplying them after a +\--+. For example, if the database were
changed to require a username, we could specify the username and
password with:
----
$ sqoop job --exec myjob -- --username someuser -P
Enter password:
...
----
.Metastore connection options:
[grid="all"]
`----------------------------`-----------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--meta-connect <jdbc-uri>+ Specifies the JDBC connect string used \
to connect to the metastore
-----------------------------------------------------------------------
By default, a private metastore is instantiated in +$HOME/.sqoop+. If
you have configured a hosted metastore with the +sqoop-metastore+
tool, you can connect to it by specifying the +\--meta-connect+
argument. This is a JDBC connect string just like the ones used to
connect to databases for import.
In +conf/sqoop-site.xml+, you can configure
+sqoop.metastore.client.autoconnect.url+ with this address, so you do not have
to supply +\--meta-connect+ to use a remote metastore. This parameter can
also be modified to move the private metastore to a location on your
filesystem other than your home directory.
If you configure +sqoop.metastore.client.enable.autoconnect+ with the
value +false+, then you must explicitly supply +\--meta-connect+.
.Common options:
[grid="all"]
`---------------------------`------------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--help+ Print usage instructions
+\--verbose+ Print more information while working
-----------------------------------------------------------------------
Saved jobs and passwords
~~~~~~~~~~~~~~~~~~~~~~~~
The Sqoop metastore is not a secure resource. Multiple users can access
its contents. For this reason, Sqoop does not store passwords in the
metastore. If you create a job that requires a password, you will be
prompted for that password each time you execute the job.
You can enable passwords in the metastore by setting
+sqoop.metastore.client.record.password+ to +true+ in the configuration.
Saved jobs and incremental imports
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Incremental imports are performed by comparing the values in a _check column_
against a reference value for the most recent import. For example, if the
+\--incremental append+ argument was specified, along with +\--check-column
id+ and +\--last-value 100+, all rows with +id > 100+ will be imported.
If an incremental import is run from the command line, the value which
should be specified as +\--last-value+ in a subsequent incremental import
will be printed to the screen for your reference. If an incremental import is
run from a saved job, this value will be retained in the saved job. Subsequent
runs of +sqoop job \--exec someIncrementalJob+ will continue to import only
newer rows than those previously imported.
+sqoop-metastore+
-----------------
Purpose
~~~~~~~
include::metastore-purpose.txt[]
Syntax
~~~~~~
----
$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)
----
Although the Hadoop generic arguments must preceed any metastore arguments,
the metastore arguments can be entered in any order with respect to one
another.
.Metastore management options:
[grid="all"]
`---------------------------`------------------------------------------
Argument Description
-----------------------------------------------------------------------
+\--shutdown+ Shuts down a running metastore instance \
on the same machine.
-----------------------------------------------------------------------
Running +sqoop-metastore+ launches a shared HSQLDB database instance on
the current machine. Clients can connect to this metastore and create jobs
which can be shared between users for execution.
The location of the metastore's files on disk is controlled by the
+sqoop.metastore.server.location+ property in +conf/sqoop-site.xml+.
This should point to a directory on the local filesystem.
The metastore is available over TCP/IP. The port is controlled by the
+sqoop.metastore.server.port+ configuration parameter, and defaults to 16000.
Clients should connect to the metastore by specifying
+sqoop.metastore.client.autoconnect.url+ or +\--meta-connect+ with the
value +jdbc:hsqldb:hsql://<server-name>:<port>/sqoop+. For example,
+jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop+.
This metastore may be hosted on a machine within the Hadoop cluster, or
elsewhere on the network.