diff --git a/src/docs/man/sqoop-import.txt b/src/docs/man/sqoop-import.txt index 28255567..dc4fc8ac 100644 --- a/src/docs/man/sqoop-import.txt +++ b/src/docs/man/sqoop-import.txt @@ -86,6 +86,32 @@ include::hbase-args.txt[] include::codegen-args.txt[] + +Incremental import options +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Sqoop can be configured to import only "new" data from the source database +using the arguments in this section. While any import job can make use of +these arguments, they are most powerful when used to initialize a recurring +job with +sqoop job --create ...+. After executing a saved job, the last +observed value of the check column is updated in the saved job. + +--incremental (mode):: + Specifies that this is an incremental import. Determines how Sqoop should + discover new data. "mode" may be +append+, in which case new rows are + expected to be added with increasing id values, or +lastmodified+, in which + case new data is discovered by comparing a timestamp column with the + timestamp at which the last import was performed. + +--check-column (col):: + Specifies a column whose value should be compared to the last imported + id or the last import timestamp to determine rows to import. + +--last-value (value):: + Specifies the most recent id imported, or the timestamp of the most recent + id. This argument is unnecessary for an initial import. + + Database-specific options ~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/src/docs/man/sqoop-job.txt b/src/docs/man/sqoop-job.txt new file mode 100644 index 00000000..6244f34b --- /dev/null +++ b/src/docs/man/sqoop-job.txt @@ -0,0 +1,94 @@ +sqoop-job(1) +============ + +NAME +---- +sqoop-job - Define, execute, and manipulate saved Sqoop jobs. + +SYNOPSIS +-------- +'sqoop-job' [-- [] +] + +'sqoop job' [-- [] +] + + +DESCRIPTION +----------- + +include::../user/job-purpose.txt[] + +OPTIONS +------- + +One of the job operation arguments is required. + +Job management options +~~~~~~~~~~~~~~~~~~~~~~ + +--create (job-id):: + Define a new saved job with the specified job-id (name). A + second Sqoop command-line, separated by a +--+ should be specified; + this defines the saved job. + +--delete (job-id):: + Delete a saved job. (This deletes the job definition, but does not + remove any data from HDFS.) + +--exec (job-id):: + Given a job id defined with +--create+, run the saved job. Any arguments + following a +--+ are applied on top of the saved job, overriding the saved + parameters. + +--show (job-id):: + Shows the parameters for a saved job + +--list:: + List all saved jobs + + +Metastore connection options +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, Sqoop will use a private, local embedded database to store saved +jobs. An alternate metastore can be configured in +conf/sqoop-site.xml+. You +can also specify the metastore connect string here: + +--meta-connect (jdbc-uri):: + Specifies the JDBC connect string used to connect to the metastore + + +Common options +~~~~~~~~~~~~~~ + +--help:: + Print usage instructions + +--verbose:: + Print more information while working + + +ENVIRONMENT +----------- + +See 'sqoop(1)' + + +//// + Licensed to Cloudera, Inc. under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +//// + diff --git a/src/docs/man/sqoop-metastore.txt b/src/docs/man/sqoop-metastore.txt new file mode 100644 index 00000000..f00504c1 --- /dev/null +++ b/src/docs/man/sqoop-metastore.txt @@ -0,0 +1,51 @@ +sqoop-metastore(1) +================== + +NAME +---- +sqoop-metastore - Host a shared repository for saved Sqoop jobs. + +SYNOPSIS +-------- +'sqoop-metastore' + +'sqoop metastore' + + +DESCRIPTION +----------- + +include::../user/metastore-purpose.txt[] + +OPTIONS +------- + +Metastore management options +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +--shutdown:: + Shuts down a running metastore instance on the same machine + +ENVIRONMENT +----------- + +See 'sqoop(1)' + + +//// + Licensed to Cloudera, Inc. under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +//// + diff --git a/src/docs/user/SqoopUserGuide.txt b/src/docs/user/SqoopUserGuide.txt index 921ec5fc..aa7514c2 100644 --- a/src/docs/user/SqoopUserGuide.txt +++ b/src/docs/user/SqoopUserGuide.txt @@ -35,6 +35,8 @@ include::import-all-tables.txt[] include::export.txt[] +include::saved-jobs.txt[] + include::codegen.txt[] include::create-hive-table.txt[] diff --git a/src/docs/user/import.txt b/src/docs/user/import.txt index 4a6519a4..966d3c56 100644 --- a/src/docs/user/import.txt +++ b/src/docs/user/import.txt @@ -234,6 +234,55 @@ data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory. +Incremental Imports +^^^^^^^^^^^^^^^^^^^ + +Sqoop provides an incremental import mode which can be used to retrieve +only rows newer than some previously-imported set of rows. + +The following arguments control incremental imports: + + +.Incremental import arguments: +[grid="all"] +`-----------------------------`-------------------------------------- +Argument Description +--------------------------------------------------------------------- ++\--check-column (col)+ Specifies the column to be examined \ + when determining which rows to import. ++\--incremental (mode)+ Specifies how Sqoop determines which \ + rows are new. Legal values for +mode+\ + include +append+ and +lastmodified+. ++\--last-value (value)+ Specifies the maximum value of the \ + check column from the previous import. +--------------------------------------------------------------------- + + +Sqoop supports two types of incremental imports: +append+ and +lastmodified+. +You can use the +\--incremental+ argument to specify the type of incremental +import to perform. + +You should specify +append+ mode when importing a table where new rows are +continually being added with increasing row id values. You specify the column +containing the row's id with +\--check-column+. Sqoop imports rows where the +check column has a value greater than the one specified with +\--last-value+. + +An alternate table update strategy supported by Sqoop is called +lastmodified+ +mode. You should use this when rows of the source table may be updated, and +each such update will set the value of a last-modified column to the current +timestamp. Rows where the check column holds a timestamp more recent than the +timestamp specified with +\--last-value+ are imported. + +At the end of an incremental import, the value which should be specified as ++\--last-value+ for a subsequent import is printed to the screen. When running +a subsequent import, you should specify +\--last-value+ in this way to ensure +you import only the new or updated data. This is handled automatically by +creating an incremental import as a saved job, which is the preferred +mechanism for performing a recurring incremental import. See the section on +saved jobs later in this document for more information. + + + File Formats ^^^^^^^^^^^^ diff --git a/src/docs/user/job-purpose.txt b/src/docs/user/job-purpose.txt new file mode 100644 index 00000000..db15b59d --- /dev/null +++ b/src/docs/user/job-purpose.txt @@ -0,0 +1,27 @@ + +//// + Licensed to Cloudera, Inc. under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +//// + +The job tool allows you to create and work with saved jobs. Saved jobs +remember the parameters used to specify a job, so they can be +re-executed by invoking the job by its handle. + +If a saved job is configured to perform an incremental import, state regarding +the most recently imported rows is updated in the saved job to allow the job +to continually import only the newest rows. + + diff --git a/src/docs/user/metastore-purpose.txt b/src/docs/user/metastore-purpose.txt new file mode 100644 index 00000000..60eb1734 --- /dev/null +++ b/src/docs/user/metastore-purpose.txt @@ -0,0 +1,26 @@ + +//// + Licensed to Cloudera, Inc. under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +//// + +The +metastore+ tool configures Sqoop to host a shared metadata repository. +Multiple users and/or remote users can define and execute saved jobs (created +with +sqoop job+) defined in this metastore. + +Clients must be configured to connect to the metastore in +sqoop-site.xml+ or +with the +--meta-connect+ argument. + + diff --git a/src/docs/user/saved-jobs.txt b/src/docs/user/saved-jobs.txt new file mode 100644 index 00000000..6fc3e641 --- /dev/null +++ b/src/docs/user/saved-jobs.txt @@ -0,0 +1,233 @@ + +//// + Licensed to Cloudera, Inc. under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +//// + +Saved Jobs +---------- + +Imports and exports can be repeatedly performed by issuing the same command +multiple times. Especially when using the incremental import capability, +this is an expected scenario. + +Sqoop allows you to define _saved jobs_ which make this process easier. A +saved job records the configuration information required to execute a +Sqoop command at a later time. The section on the +sqoop-job+ tool +describes how to create and work with saved jobs. + +By default, job descriptions are saved to a private repository stored +in +$HOME/.sqoop/+. You can configure Sqoop to instead use a shared +_metastore_, which makes saved jobs available to multiple users across a +shared cluster. Starting the metastore is covered by the section on the ++sqoop-metastore+ tool. + + ++sqoop-job+ +----------- + +Purpose +~~~~~~~ + +include::job-purpose.txt[] + +Syntax +~~~~~~ + +---- +$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)] +$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)] +---- + +Although the Hadoop generic arguments must preceed any export arguments, +the job arguments can be entered in any order with respect to one +another. + +.Job management options: +[grid="all"] +`---------------------------`------------------------------------------ +Argument Description +----------------------------------------------------------------------- ++\--create + Define a new saved job with the specified \ + job-id (name). A second Sqoop \ + command-line, separated by a +\--+ should \ + be specified; this defines the saved job. ++\--delete + Delete a saved job. ++\--exec + Given a job defined with +\--create+, run \ + the saved job. ++\--show + Show the parameters for a saved job. ++\--list+ List all saved jobs +----------------------------------------------------------------------- + +Creating saved jobs is done with the +\--create+ action. This operation +requires a +\--+ followed by a tool name and its arguments. The tool and +its arguments will form the basis of the saved job. Consider: + +---- +$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \ + --table mytable +---- + +This creates a job named +myjob+ which can be executed later. The job is not +run. This job is now available in the list of saved jobs: + +---- +$ sqoop job --list +Available jobs: + myjob +---- + +We can inspect the configuration of a job with the +show+ action: + +---- + $ sqoop job --show myjob + Job: myjob + Tool: import + Options: + ---------------------------- + direct.import = false + codegen.input.delimiters.record = 0 + hdfs.append.dir = false + db.table = mytable + ... +---- + +And if we are satisfied with it, we can run the job with +exec+: + +---- +$ sqoop job --exec myjob +10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation +... +---- + +The +exec+ action allows you to override arguments of the saved job +by supplying them after a +\--+. For example, if the database were +changed to require a username, we could specify the username and +password with: + +---- +$ sqoop job --exec myjob -- --username someuser -P +Enter password: +... +---- + +.Metastore connection options: +[grid="all"] +`----------------------------`----------------------------------------- +Argument Description +----------------------------------------------------------------------- ++\--meta-connect + Specifies the JDBC connect string used \ + to connect to the metastore +----------------------------------------------------------------------- + +By default, a private metastore is instantiated in +$HOME/.sqoop+. If +you have configured a hosted metastore with the +sqoop-metastore+ +tool, you can connect to it by specifying the +\--meta-connect+ +argument. This is a JDBC connect string just like the ones used to +connect to databases for import. + +In +conf/sqoop-site.xml+, you can configure ++sqoop.metastore.client.autoconnect.url+ with this address, so you do not have +to supply +\--meta-connect+ to use a remote metastore. This parameter can +also be modified to move the private metastore to a location on your +filesystem other than your home directory. + +If you configure +sqoop.metastore.client.enable.autoconnect+ with the +value +false+, then you must explicitly supply +\--meta-connect+. + +.Common options: +[grid="all"] +`---------------------------`------------------------------------------ +Argument Description +----------------------------------------------------------------------- ++\--help+ Print usage instructions ++\--verbose+ Print more information while working +----------------------------------------------------------------------- + +Saved jobs and passwords +~~~~~~~~~~~~~~~~~~~~~~~~ + +The Sqoop metastore is not a secure resource. Multiple users can access +its contents. For this reason, Sqoop does not store passwords in the +metastore. If you create a job that requires a password, you will be +prompted for that password each time you execute the job. + +You can enable passwords in the metastore by setting ++sqoop.metastore.client.record.password+ to +true+ in the configuration. + + +Saved jobs and incremental imports +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Incremental imports are performed by comparing the values in a _check column_ +against a reference value for the most recent import. For example, if the ++\--incremental append+ argument was specified, along with +\--check-column +id+ and +\--last-value 100+, all rows with +id > 100+ will be imported. +If an incremental import is run from the command line, the value which +should be specified as +\--last-value+ in a subsequent incremental import +will be printed to the screen for your reference. If an incremental import is +run from a saved job, this value will be retained in the saved job. Subsequent +runs of +sqoop job \--exec someIncrementalJob+ will continue to import only +newer rows than those previously imported. + + ++sqoop-metastore+ +----------------- + +Purpose +~~~~~~~ + +include::metastore-purpose.txt[] + +Syntax +~~~~~~ + +---- +$ sqoop metastore (generic-args) (metastore-args) +$ sqoop-metastore (generic-args) (metastore-args) +---- + +Although the Hadoop generic arguments must preceed any metastore arguments, +the metastore arguments can be entered in any order with respect to one +another. + +.Metastore management options: +[grid="all"] +`---------------------------`------------------------------------------ +Argument Description +----------------------------------------------------------------------- ++\--shutdown+ Shuts down a running metastore instance \ + on the same machine. +----------------------------------------------------------------------- + +Running +sqoop-metastore+ launches a shared HSQLDB database instance on +the current machine. Clients can connect to this metastore and create jobs +which can be shared between users for execution. + +The location of the metastore's files on disk is controlled by the ++sqoop.metastore.server.location+ property in +conf/sqoop-site.xml+. +This should point to a directory on the local filesystem. + +The metastore is available over TCP/IP. The port is controlled by the ++sqoop.metastore.server.port+ configuration parameter, and defaults to 16000. + +Clients should connect to the metastore by specifying ++sqoop.metastore.client.autoconnect.url+ or +\--meta-connect+ with the +value +jdbc:hsqldb:hsql://:/sqoop+. For example, ++jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop+. + +This metastore may be hosted on a machine within the Hadoop cluster, or +elsewhere on the network. +