mirror of
https://github.com/apache/sqoop.git
synced 2025-05-03 22:20:52 +08:00
SQOOP-42. Document saved jobs, metastore, and incremental imports.
Added manual pages and user guide sections for sqoop-job and sqoop-metastore. Updated sqoop-import documentation to describe incremental imports. From: Aaron Kimball <aaron@cloudera.com> git-svn-id: https://svn.apache.org/repos/asf/incubator/sqoop/trunk@1149968 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
36f93eac1d
commit
d656663a14
@ -86,6 +86,32 @@ include::hbase-args.txt[]
|
||||
|
||||
include::codegen-args.txt[]
|
||||
|
||||
|
||||
Incremental import options
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Sqoop can be configured to import only "new" data from the source database
|
||||
using the arguments in this section. While any import job can make use of
|
||||
these arguments, they are most powerful when used to initialize a recurring
|
||||
job with +sqoop job --create ...+. After executing a saved job, the last
|
||||
observed value of the check column is updated in the saved job.
|
||||
|
||||
--incremental (mode)::
|
||||
Specifies that this is an incremental import. Determines how Sqoop should
|
||||
discover new data. "mode" may be +append+, in which case new rows are
|
||||
expected to be added with increasing id values, or +lastmodified+, in which
|
||||
case new data is discovered by comparing a timestamp column with the
|
||||
timestamp at which the last import was performed.
|
||||
|
||||
--check-column (col)::
|
||||
Specifies a column whose value should be compared to the last imported
|
||||
id or the last import timestamp to determine rows to import.
|
||||
|
||||
--last-value (value)::
|
||||
Specifies the most recent id imported, or the timestamp of the most recent
|
||||
id. This argument is unnecessary for an initial import.
|
||||
|
||||
|
||||
Database-specific options
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
94
src/docs/man/sqoop-job.txt
Normal file
94
src/docs/man/sqoop-job.txt
Normal file
@ -0,0 +1,94 @@
|
||||
sqoop-job(1)
|
||||
============
|
||||
|
||||
NAME
|
||||
----
|
||||
sqoop-job - Define, execute, and manipulate saved Sqoop jobs.
|
||||
|
||||
SYNOPSIS
|
||||
--------
|
||||
'sqoop-job' <generic-options> <tool-options> [-- [<subtool-name>]
|
||||
<subtool-options>]
|
||||
|
||||
'sqoop job' <generic-options> <tool-options> [-- [<subtool-name>]
|
||||
<subtool-options>]
|
||||
|
||||
|
||||
DESCRIPTION
|
||||
-----------
|
||||
|
||||
include::../user/job-purpose.txt[]
|
||||
|
||||
OPTIONS
|
||||
-------
|
||||
|
||||
One of the job operation arguments is required.
|
||||
|
||||
Job management options
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
--create (job-id)::
|
||||
Define a new saved job with the specified job-id (name). A
|
||||
second Sqoop command-line, separated by a +--+ should be specified;
|
||||
this defines the saved job.
|
||||
|
||||
--delete (job-id)::
|
||||
Delete a saved job. (This deletes the job definition, but does not
|
||||
remove any data from HDFS.)
|
||||
|
||||
--exec (job-id)::
|
||||
Given a job id defined with +--create+, run the saved job. Any arguments
|
||||
following a +--+ are applied on top of the saved job, overriding the saved
|
||||
parameters.
|
||||
|
||||
--show (job-id)::
|
||||
Shows the parameters for a saved job
|
||||
|
||||
--list::
|
||||
List all saved jobs
|
||||
|
||||
|
||||
Metastore connection options
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default, Sqoop will use a private, local embedded database to store saved
|
||||
jobs. An alternate metastore can be configured in +conf/sqoop-site.xml+. You
|
||||
can also specify the metastore connect string here:
|
||||
|
||||
--meta-connect (jdbc-uri)::
|
||||
Specifies the JDBC connect string used to connect to the metastore
|
||||
|
||||
|
||||
Common options
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
--help::
|
||||
Print usage instructions
|
||||
|
||||
--verbose::
|
||||
Print more information while working
|
||||
|
||||
|
||||
ENVIRONMENT
|
||||
-----------
|
||||
|
||||
See 'sqoop(1)'
|
||||
|
||||
|
||||
////
|
||||
Licensed to Cloudera, Inc. under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
////
|
||||
|
51
src/docs/man/sqoop-metastore.txt
Normal file
51
src/docs/man/sqoop-metastore.txt
Normal file
@ -0,0 +1,51 @@
|
||||
sqoop-metastore(1)
|
||||
==================
|
||||
|
||||
NAME
|
||||
----
|
||||
sqoop-metastore - Host a shared repository for saved Sqoop jobs.
|
||||
|
||||
SYNOPSIS
|
||||
--------
|
||||
'sqoop-metastore' <generic-options> <tool-options>
|
||||
|
||||
'sqoop metastore' <generic-options> <tool-options>
|
||||
|
||||
|
||||
DESCRIPTION
|
||||
-----------
|
||||
|
||||
include::../user/metastore-purpose.txt[]
|
||||
|
||||
OPTIONS
|
||||
-------
|
||||
|
||||
Metastore management options
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
--shutdown::
|
||||
Shuts down a running metastore instance on the same machine
|
||||
|
||||
ENVIRONMENT
|
||||
-----------
|
||||
|
||||
See 'sqoop(1)'
|
||||
|
||||
|
||||
////
|
||||
Licensed to Cloudera, Inc. under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
////
|
||||
|
@ -35,6 +35,8 @@ include::import-all-tables.txt[]
|
||||
|
||||
include::export.txt[]
|
||||
|
||||
include::saved-jobs.txt[]
|
||||
|
||||
include::codegen.txt[]
|
||||
|
||||
include::create-hive-table.txt[]
|
||||
|
@ -234,6 +234,55 @@ data to a temporary directory and then rename the files into the normal
|
||||
target directory in a manner that does not conflict with existing filenames
|
||||
in that directory.
|
||||
|
||||
Incremental Imports
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Sqoop provides an incremental import mode which can be used to retrieve
|
||||
only rows newer than some previously-imported set of rows.
|
||||
|
||||
The following arguments control incremental imports:
|
||||
|
||||
|
||||
.Incremental import arguments:
|
||||
[grid="all"]
|
||||
`-----------------------------`--------------------------------------
|
||||
Argument Description
|
||||
---------------------------------------------------------------------
|
||||
+\--check-column (col)+ Specifies the column to be examined \
|
||||
when determining which rows to import.
|
||||
+\--incremental (mode)+ Specifies how Sqoop determines which \
|
||||
rows are new. Legal values for +mode+\
|
||||
include +append+ and +lastmodified+.
|
||||
+\--last-value (value)+ Specifies the maximum value of the \
|
||||
check column from the previous import.
|
||||
---------------------------------------------------------------------
|
||||
|
||||
|
||||
Sqoop supports two types of incremental imports: +append+ and +lastmodified+.
|
||||
You can use the +\--incremental+ argument to specify the type of incremental
|
||||
import to perform.
|
||||
|
||||
You should specify +append+ mode when importing a table where new rows are
|
||||
continually being added with increasing row id values. You specify the column
|
||||
containing the row's id with +\--check-column+. Sqoop imports rows where the
|
||||
check column has a value greater than the one specified with +\--last-value+.
|
||||
|
||||
An alternate table update strategy supported by Sqoop is called +lastmodified+
|
||||
mode. You should use this when rows of the source table may be updated, and
|
||||
each such update will set the value of a last-modified column to the current
|
||||
timestamp. Rows where the check column holds a timestamp more recent than the
|
||||
timestamp specified with +\--last-value+ are imported.
|
||||
|
||||
At the end of an incremental import, the value which should be specified as
|
||||
+\--last-value+ for a subsequent import is printed to the screen. When running
|
||||
a subsequent import, you should specify +\--last-value+ in this way to ensure
|
||||
you import only the new or updated data. This is handled automatically by
|
||||
creating an incremental import as a saved job, which is the preferred
|
||||
mechanism for performing a recurring incremental import. See the section on
|
||||
saved jobs later in this document for more information.
|
||||
|
||||
|
||||
|
||||
File Formats
|
||||
^^^^^^^^^^^^
|
||||
|
||||
|
27
src/docs/user/job-purpose.txt
Normal file
27
src/docs/user/job-purpose.txt
Normal file
@ -0,0 +1,27 @@
|
||||
|
||||
////
|
||||
Licensed to Cloudera, Inc. under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
////
|
||||
|
||||
The job tool allows you to create and work with saved jobs. Saved jobs
|
||||
remember the parameters used to specify a job, so they can be
|
||||
re-executed by invoking the job by its handle.
|
||||
|
||||
If a saved job is configured to perform an incremental import, state regarding
|
||||
the most recently imported rows is updated in the saved job to allow the job
|
||||
to continually import only the newest rows.
|
||||
|
||||
|
26
src/docs/user/metastore-purpose.txt
Normal file
26
src/docs/user/metastore-purpose.txt
Normal file
@ -0,0 +1,26 @@
|
||||
|
||||
////
|
||||
Licensed to Cloudera, Inc. under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
////
|
||||
|
||||
The +metastore+ tool configures Sqoop to host a shared metadata repository.
|
||||
Multiple users and/or remote users can define and execute saved jobs (created
|
||||
with +sqoop job+) defined in this metastore.
|
||||
|
||||
Clients must be configured to connect to the metastore in +sqoop-site.xml+ or
|
||||
with the +--meta-connect+ argument.
|
||||
|
||||
|
233
src/docs/user/saved-jobs.txt
Normal file
233
src/docs/user/saved-jobs.txt
Normal file
@ -0,0 +1,233 @@
|
||||
|
||||
////
|
||||
Licensed to Cloudera, Inc. under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
////
|
||||
|
||||
Saved Jobs
|
||||
----------
|
||||
|
||||
Imports and exports can be repeatedly performed by issuing the same command
|
||||
multiple times. Especially when using the incremental import capability,
|
||||
this is an expected scenario.
|
||||
|
||||
Sqoop allows you to define _saved jobs_ which make this process easier. A
|
||||
saved job records the configuration information required to execute a
|
||||
Sqoop command at a later time. The section on the +sqoop-job+ tool
|
||||
describes how to create and work with saved jobs.
|
||||
|
||||
By default, job descriptions are saved to a private repository stored
|
||||
in +$HOME/.sqoop/+. You can configure Sqoop to instead use a shared
|
||||
_metastore_, which makes saved jobs available to multiple users across a
|
||||
shared cluster. Starting the metastore is covered by the section on the
|
||||
+sqoop-metastore+ tool.
|
||||
|
||||
|
||||
+sqoop-job+
|
||||
-----------
|
||||
|
||||
Purpose
|
||||
~~~~~~~
|
||||
|
||||
include::job-purpose.txt[]
|
||||
|
||||
Syntax
|
||||
~~~~~~
|
||||
|
||||
----
|
||||
$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
|
||||
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
|
||||
----
|
||||
|
||||
Although the Hadoop generic arguments must preceed any export arguments,
|
||||
the job arguments can be entered in any order with respect to one
|
||||
another.
|
||||
|
||||
.Job management options:
|
||||
[grid="all"]
|
||||
`---------------------------`------------------------------------------
|
||||
Argument Description
|
||||
-----------------------------------------------------------------------
|
||||
+\--create <job-id>+ Define a new saved job with the specified \
|
||||
job-id (name). A second Sqoop \
|
||||
command-line, separated by a +\--+ should \
|
||||
be specified; this defines the saved job.
|
||||
+\--delete <job-id>+ Delete a saved job.
|
||||
+\--exec <job-id>+ Given a job defined with +\--create+, run \
|
||||
the saved job.
|
||||
+\--show <job-id>+ Show the parameters for a saved job.
|
||||
+\--list+ List all saved jobs
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
Creating saved jobs is done with the +\--create+ action. This operation
|
||||
requires a +\--+ followed by a tool name and its arguments. The tool and
|
||||
its arguments will form the basis of the saved job. Consider:
|
||||
|
||||
----
|
||||
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
|
||||
--table mytable
|
||||
----
|
||||
|
||||
This creates a job named +myjob+ which can be executed later. The job is not
|
||||
run. This job is now available in the list of saved jobs:
|
||||
|
||||
----
|
||||
$ sqoop job --list
|
||||
Available jobs:
|
||||
myjob
|
||||
----
|
||||
|
||||
We can inspect the configuration of a job with the +show+ action:
|
||||
|
||||
----
|
||||
$ sqoop job --show myjob
|
||||
Job: myjob
|
||||
Tool: import
|
||||
Options:
|
||||
----------------------------
|
||||
direct.import = false
|
||||
codegen.input.delimiters.record = 0
|
||||
hdfs.append.dir = false
|
||||
db.table = mytable
|
||||
...
|
||||
----
|
||||
|
||||
And if we are satisfied with it, we can run the job with +exec+:
|
||||
|
||||
----
|
||||
$ sqoop job --exec myjob
|
||||
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
|
||||
...
|
||||
----
|
||||
|
||||
The +exec+ action allows you to override arguments of the saved job
|
||||
by supplying them after a +\--+. For example, if the database were
|
||||
changed to require a username, we could specify the username and
|
||||
password with:
|
||||
|
||||
----
|
||||
$ sqoop job --exec myjob -- --username someuser -P
|
||||
Enter password:
|
||||
...
|
||||
----
|
||||
|
||||
.Metastore connection options:
|
||||
[grid="all"]
|
||||
`----------------------------`-----------------------------------------
|
||||
Argument Description
|
||||
-----------------------------------------------------------------------
|
||||
+\--meta-connect <jdbc-uri>+ Specifies the JDBC connect string used \
|
||||
to connect to the metastore
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
By default, a private metastore is instantiated in +$HOME/.sqoop+. If
|
||||
you have configured a hosted metastore with the +sqoop-metastore+
|
||||
tool, you can connect to it by specifying the +\--meta-connect+
|
||||
argument. This is a JDBC connect string just like the ones used to
|
||||
connect to databases for import.
|
||||
|
||||
In +conf/sqoop-site.xml+, you can configure
|
||||
+sqoop.metastore.client.autoconnect.url+ with this address, so you do not have
|
||||
to supply +\--meta-connect+ to use a remote metastore. This parameter can
|
||||
also be modified to move the private metastore to a location on your
|
||||
filesystem other than your home directory.
|
||||
|
||||
If you configure +sqoop.metastore.client.enable.autoconnect+ with the
|
||||
value +false+, then you must explicitly supply +\--meta-connect+.
|
||||
|
||||
.Common options:
|
||||
[grid="all"]
|
||||
`---------------------------`------------------------------------------
|
||||
Argument Description
|
||||
-----------------------------------------------------------------------
|
||||
+\--help+ Print usage instructions
|
||||
+\--verbose+ Print more information while working
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
Saved jobs and passwords
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The Sqoop metastore is not a secure resource. Multiple users can access
|
||||
its contents. For this reason, Sqoop does not store passwords in the
|
||||
metastore. If you create a job that requires a password, you will be
|
||||
prompted for that password each time you execute the job.
|
||||
|
||||
You can enable passwords in the metastore by setting
|
||||
+sqoop.metastore.client.record.password+ to +true+ in the configuration.
|
||||
|
||||
|
||||
Saved jobs and incremental imports
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Incremental imports are performed by comparing the values in a _check column_
|
||||
against a reference value for the most recent import. For example, if the
|
||||
+\--incremental append+ argument was specified, along with +\--check-column
|
||||
id+ and +\--last-value 100+, all rows with +id > 100+ will be imported.
|
||||
If an incremental import is run from the command line, the value which
|
||||
should be specified as +\--last-value+ in a subsequent incremental import
|
||||
will be printed to the screen for your reference. If an incremental import is
|
||||
run from a saved job, this value will be retained in the saved job. Subsequent
|
||||
runs of +sqoop job \--exec someIncrementalJob+ will continue to import only
|
||||
newer rows than those previously imported.
|
||||
|
||||
|
||||
+sqoop-metastore+
|
||||
-----------------
|
||||
|
||||
Purpose
|
||||
~~~~~~~
|
||||
|
||||
include::metastore-purpose.txt[]
|
||||
|
||||
Syntax
|
||||
~~~~~~
|
||||
|
||||
----
|
||||
$ sqoop metastore (generic-args) (metastore-args)
|
||||
$ sqoop-metastore (generic-args) (metastore-args)
|
||||
----
|
||||
|
||||
Although the Hadoop generic arguments must preceed any metastore arguments,
|
||||
the metastore arguments can be entered in any order with respect to one
|
||||
another.
|
||||
|
||||
.Metastore management options:
|
||||
[grid="all"]
|
||||
`---------------------------`------------------------------------------
|
||||
Argument Description
|
||||
-----------------------------------------------------------------------
|
||||
+\--shutdown+ Shuts down a running metastore instance \
|
||||
on the same machine.
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
Running +sqoop-metastore+ launches a shared HSQLDB database instance on
|
||||
the current machine. Clients can connect to this metastore and create jobs
|
||||
which can be shared between users for execution.
|
||||
|
||||
The location of the metastore's files on disk is controlled by the
|
||||
+sqoop.metastore.server.location+ property in +conf/sqoop-site.xml+.
|
||||
This should point to a directory on the local filesystem.
|
||||
|
||||
The metastore is available over TCP/IP. The port is controlled by the
|
||||
+sqoop.metastore.server.port+ configuration parameter, and defaults to 16000.
|
||||
|
||||
Clients should connect to the metastore by specifying
|
||||
+sqoop.metastore.client.autoconnect.url+ or +\--meta-connect+ with the
|
||||
value +jdbc:hsqldb:hsql://<server-name>:<port>/sqoop+. For example,
|
||||
+jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop+.
|
||||
|
||||
This metastore may be hosted on a machine within the Hadoop cluster, or
|
||||
elsewhere on the network.
|
||||
|
Loading…
Reference in New Issue
Block a user