From 739733853e2daeef9e76aa1240b8b552bc881261 Mon Sep 17 00:00:00 2001 From: Jarek Jarcec Cecho Date: Wed, 8 Apr 2015 20:14:19 -0700 Subject: [PATCH] SQOOP-2264: Exclude and remove SqoopUserGuide.xml from git repository (Qian Xu via Jarek Jarcec Cecho) (cherry picked from commit c51c7daeeda1704375ceacb97fc5d5793a6b9f71) Conflicts: src/docs/user/SqoopUserGuide.xml --- src/docs/.gitignore | 2 + src/docs/Makefile | 1 + src/docs/user/SqoopUserGuide.xml | 1151 ------------------------------ 3 files changed, 3 insertions(+), 1151 deletions(-) delete mode 100644 src/docs/user/SqoopUserGuide.xml diff --git a/src/docs/.gitignore b/src/docs/.gitignore index 1ac4643e..519ebf4b 100644 --- a/src/docs/.gitignore +++ b/src/docs/.gitignore @@ -17,3 +17,5 @@ /Sqoop-manpage.xml /sqoop.1 /Sqoop-web.html +/dev/SqoopDevGuide.xml +/user/SqoopUserGuide.xml diff --git a/src/docs/Makefile b/src/docs/Makefile index 5bdf2f8c..224c0234 100644 --- a/src/docs/Makefile +++ b/src/docs/Makefile @@ -105,6 +105,7 @@ supportfiles: clean: -rm $(BUILD_DIR)/sqoop-*.1.gz + -rm $(BUILD_DIR)/SqoopDevGuide.html -rm $(BUILD_DIR)/SqoopUserGuide.html -rm user/SqoopUserGuide.html -rm dev/SqoopDevGuide.html diff --git a/src/docs/user/SqoopUserGuide.xml b/src/docs/user/SqoopUserGuide.xml deleted file mode 100644 index 61784264..00000000 --- a/src/docs/user/SqoopUserGuide.xml +++ /dev/null @@ -1,1151 +0,0 @@ - - - - - -
- - Sqoop User Guide (v1.4.6) - - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. -
-Introduction -Sqoop is a tool designed to transfer data between Hadoop and -relational databases or mainframes. You can use Sqoop to import data from a -relational database management system (RDBMS) such as MySQL or Oracle or a -mainframe into the Hadoop Distributed File System (HDFS), -transform the data in Hadoop MapReduce, and then export the data back -into an RDBMS. -Sqoop automates most of this process, relying on the database to -describe the schema for the data to be imported. Sqoop uses MapReduce -to import and export the data, which provides parallel operation as -well as fault tolerance. -This document describes how to get started using Sqoop to move data -between databases and Hadoop or mainframe to Hadoop and provides reference -information for the operation of the Sqoop command-line tool suite. This -document is intended for: - - - -System and application programmers - - - - -System administrators - - - - -Database administrators - - - - -Data analysts - - - - -Data engineers - - - -
-
-Supported Releases -This documentation applies to Sqoop v1.4.6. -
-
-Sqoop Releases -Sqoop is an open source software product of the Apache Software Foundation. -Software development for Sqoop occurs at http://sqoop.apache.org -At that site you can obtain: - - - -New releases of Sqoop as well as its most recent source code - - - - -An issue tracker - - - - -A wiki that contains Sqoop documentation - - - -
-
-Prerequisites -The following prerequisite knowledge is required for this product: - - - -Basic computer technology and terminology - - - - -Familiarity with command-line interfaces such as bash - - - - -Relational database management systems - - - - -Basic familiarity with the purpose and operation of Hadoop - - - -Before you can use Sqoop, a release of Hadoop must be installed and -configured. Sqoop is currently supporting 4 major Hadoop releases - 0.20, -0.23, 1.0 and 2.0. -This document assumes you are using a Linux or Linux-like environment. -If you are using Windows, you may be able to use cygwin to accomplish -most of the following tasks. If you are using Mac OS X, you should see -few (if any) compatibility errors. Sqoop is predominantly operated and -tested on Linux. -
-
-Basic Usage -With Sqoop, you can import data from a relational database system or a -mainframe into HDFS. The input to the import process is either database table -or mainframe datasets. For databases, Sqoop will read the table row-by-row -into HDFS. For mainframe datasets, Sqoop will read records from each mainframe -dataset into HDFS. The output of this import process is a set of files -containing a copy of the imported table or datasets. -The import process is performed in parallel. For this reason, the -output will be in multiple files. These files may be delimited text -files (for example, with commas or tabs separating each field), or -binary Avro or SequenceFiles containing serialized record data. -A by-product of the import process is a generated Java class which -can encapsulate one row of the imported table. This class is used -during the import process by Sqoop itself. The Java source code for -this class is also provided to you, for use in subsequent MapReduce -processing of the data. This class can serialize and deserialize data -to and from the SequenceFile format. It can also parse the -delimited-text form of a record. These abilities allow you to quickly -develop MapReduce applications that use the HDFS-stored records in -your processing pipeline. You are also free to parse the delimiteds -record data yourself, using any other tools you prefer. -After manipulating the imported records (for example, with MapReduce -or Hive) you may have a result data set which you can then export -back to the relational database. Sqoop’s export process will read -a set of delimited text files from HDFS in parallel, parse them into -records, and insert them as new rows in a target database table, for -consumption by external applications or users. -Sqoop includes some other commands which allow you to inspect the -database you are working with. For example, you can list the available -database schemas (with the sqoop-list-databases tool) and tables -within a schema (with the sqoop-list-tables tool). Sqoop also -includes a primitive SQL execution shell (the sqoop-eval tool). -Most aspects of the import, code generation, and export processes can -be customized. For databases, you can control the specific row range or -columns imported. You can specify particular delimiters and escape characters -for the file-based representation of the data, as well as the file format -used. You can also control the class or package names used in -generated code. Subsequent sections of this document explain how to -specify these and other arguments to Sqoop. -
-
-Sqoop Tools -Sqoop is a collection of related tools. To use Sqoop, you specify the -tool you want to use and the arguments that control the tool. -If Sqoop is compiled from its own source, you can run Sqoop without a formal -installation process by running the bin/sqoop program. Users -of a packaged deployment of Sqoop (such as an RPM shipped with Apache Bigtop) -will see this program installed as /usr/bin/sqoop. The remainder of this -documentation will refer to this program as sqoop. For example: -$ sqoop tool-name [tool-arguments] -The following examples that begin with a $ character indicate -that the commands must be entered at a terminal prompt (such as -bash). The $ character represents the prompt itself; you should -not start these commands by typing a $. You can also enter commands -inline in the text of a paragraph; for example, sqoop help. These -examples do not show a $ prefix, but you should enter them the same -way. Don’t confuse the $ shell prompt in the examples with the $ -that precedes an environment variable name. For example, the string -literal $HADOOP_HOME includes a "$". -Sqoop ships with a help tool. To display a list of all available -tools, type the following command: -$ sqoop help -usage: sqoop COMMAND [ARGS] - -Available commands: - codegen Generate code to interact with database records - create-hive-table Import a table definition into Hive - eval Evaluate a SQL statement and display the results - export Export an HDFS directory to a database table - help List available commands - import Import a table from a database to HDFS - import-all-tables Import tables from a database to HDFS - import-mainframe Import mainframe datasets to HDFS - list-databases List available databases on a server - list-tables List available tables in a database - version Display version information - -See 'sqoop help COMMAND' for information on a specific command. -You can display help for a specific tool by entering: sqoop help -(tool-name); for example, sqoop help import. -You can also add the --help argument to any command: sqoop import ---help. -
-Using Command Aliases -In addition to typing the sqoop (toolname) syntax, you can use alias -scripts that specify the sqoop-(toolname) syntax. For example, the -scripts sqoop-import, sqoop-export, etc. each select a specific -tool. -
-
-Controlling the Hadoop Installation -You invoke Sqoop through the program launch capability provided by -Hadoop. The sqoop command-line program is a wrapper which runs the -bin/hadoop script shipped with Hadoop. If you have multiple -installations of Hadoop present on your machine, you can select the -Hadoop installation by setting the $HADOOP_COMMON_HOME and -$HADOOP_MAPRED_HOME environment variables. -For example: -$ HADOOP_COMMON_HOME=/path/to/some/hadoop \ - HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \ - sqoop import --arguments... -or: -$ export HADOOP_COMMON_HOME=/some/path/to/hadoop -$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce -$ sqoop import --arguments... -If either of these variables are not set, Sqoop will fall back to -$HADOOP_HOME. If it is not set either, Sqoop will use the default -installation locations for Apache Bigtop, /usr/lib/hadoop and -/usr/lib/hadoop-mapreduce, respectively. -The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, -unless the $HADOOP_CONF_DIR environment variable is set. -
-
-Using Generic and Specific Arguments -To control the operation of each Sqoop tool, you use generic and -specific arguments. -For example: -$ sqoop help import -usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS] - -Common arguments: - --connect <jdbc-uri> Specify JDBC connect string - --connect-manager <class-name> Specify connection manager class to use - --driver <class-name> Manually specify JDBC driver class to use - --hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME - --help Print usage instructions - --password-file Set path for file containing authentication password - -P Read password from console - --password <password> Set authentication password - --username <username> Set authentication username - --verbose Print more information while working - --hadoop-home <dir> Deprecated. Override $HADOOP_HOME - -[...] - -Generic Hadoop command-line arguments: -(must preceed any tool-specific arguments) -Generic options supported are --conf <configuration file> specify an application configuration file --D <property=value> use value for given property --fs <local|namenode:port> specify a namenode --jt <local|jobtracker:port> specify a job tracker --files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster --libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. --archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines. - -The general command line syntax is -bin/hadoop command [genericOptions] [commandOptions] -You must supply the generic arguments -conf, -D, and so on after the -tool name but before any tool-specific arguments (such as ---connect). Note that generic Hadoop arguments are preceeded by a -single dash character (-), whereas tool-specific arguments start -with two dashes (--), unless they are single character arguments such as -P. -The -conf, -D, -fs and -jt arguments control the configuration -and Hadoop server settings. For example, the -D mapred.job.name=<job_name> can -be used to set the name of the MR job that Sqoop launches, if not specified, -the name defaults to the jar name for the job - which is derived from the used -table name. -The -files, -libjars, and -archives arguments are not typically used with -Sqoop, but they are included as part of Hadoop’s internal argument-parsing -system. -
-
-Using Options Files to Pass Arguments -When using Sqoop, the command line options that do not change from -invocation to invocation can be put in an options file for convenience. -An options file is a text file where each line identifies an option in -the order that it appears otherwise on the command line. Option files -allow specifying a single option on multiple lines by using the -back-slash character at the end of intermediate lines. Also supported -are comments within option files that begin with the hash character. -Comments must be specified on a new line and may not be mixed with -option text. All comments and empty lines are ignored when option -files are expanded. Unless options appear as quoted strings, any -leading or trailing spaces are ignored. Quoted strings if used must -not extend beyond the line on which they are specified. -Option files can be specified anywhere in the command line as long as -the options within them follow the otherwise prescribed rules of -options ordering. For instance, regardless of where the options are -loaded from, they must follow the ordering such that generic options -appear first, tool specific options next, finally followed by options -that are intended to be passed to child programs. -To specify an options file, simply create an options file in a -convenient location and pass it to the command line via ---options-file argument. -Whenever an options file is specified, it is expanded on the -command line before the tool is invoked. You can specify more than -one option files within the same invocation if needed. -For example, the following Sqoop invocation for import can -be specified alternatively as shown below: -$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST - -$ sqoop --options-file /users/homer/work/import.txt --table TEST -where the options file /users/homer/work/import.txt contains the following: -import ---connect -jdbc:mysql://localhost/db ---username -foo -The options file can have empty lines and comments for readability purposes. -So the above example would work exactly the same if the options file -/users/homer/work/import.txt contained the following: -# -# Options file for Sqoop import -# - -# Specifies the tool being invoked -import - -# Connect parameter and value ---connect -jdbc:mysql://localhost/db - -# Username parameter and value ---username -foo - -# -# Remaining options should be specified in the command line. -# -
-
-Using Tools -The following sections will describe each tool’s operation. The -tools are listed in the most likely order you will find them useful. -
-
-
-<literal>sqoop-import</literal> -
-Purpose -The import tool imports an individual table from an RDBMS to HDFS. -Each row from a table is represented as a separate record in HDFS. -Records can be stored as text files (one record per line), or in -binary representation as Avro or SequenceFiles. -
-
-Syntax -$ sqoop import (generic-args) (import-args) -$ sqoop-import (generic-args) (import-args) -While the Hadoop generic arguments must precede any import arguments, -you can type the import arguments in any order with respect to one -another. -In this document, arguments are grouped into collections -organized by function. Some collections are present in several tools -(for example, the "common" arguments). An extended description of their -functionality is given only on the first presentation in this -document. - -Common arguments - - - - - - - Argument - - - Description - - - - - - - --connect <jdbc-uri> - - - Specify JDBC connect string - - - - - --connection-manager <class-name> - - - Specify connection manager class to use - - - - - --driver <class-name> - - - Manually specify JDBC driver class to use - - - - - --hadoop-mapred-home <dir> - - - Override $HADOOP_MAPRED_HOME - - - - - --help - - - Print usage instructions - - - - - --password-file - - - Set path for a file containing the authentication password - - - - - -P - - - Read password from console - - - - - --password <password> - - - Set authentication password - - - - - --username <username> - - - Set authentication username - - - - - --verbose - - - Print more information while working - - - - - --connection-param-file <filename> - - - Optional properties file that provides connection parameters - - - - - --relaxed-isolation - - - Set connection transaction isolation to read uncommitted for the mappers. - - - - -
-
-Connecting to a Database Server -Sqoop is designed to import tables from a database into HDFS. To do -so, you must specify a connect string that describes how to connect to the -database. The connect string is similar to a URL, and is communicated to -Sqoop with the --connect argument. This describes the server and -database to connect to; it may also specify the port. For example: -$ sqoop import --connect jdbc:mysql://database.example.com/employees -This string will connect to a MySQL database named employees on the -host database.example.com. It’s important that you do not use the URL -localhost if you intend to use Sqoop with a distributed Hadoop -cluster. The connect string you supply will be used on TaskTracker nodes -throughout your MapReduce cluster; if you specify the -literal name localhost, each node will connect to a different -database (or more likely, no database at all). Instead, you should use -the full hostname or IP address of the database host that can be seen -by all your remote nodes. -You might need to authenticate against the database before you can -access it. You can use the --username to supply a username to the database. -Sqoop provides couple of different ways to supply a password, -secure and non-secure, to the database which is detailed below. -Secure way of supplying password to the databaseYou should save the password in a file on the users home directory with 400 -permissions and specify the path to that file using the --password-file -argument, and is the preferred method of entering credentials. Sqoop will -then read the password from the file and pass it to the MapReduce cluster -using secure means with out exposing the password in the job configuration. -The file containing the password can either be on the Local FS or HDFS. -For example: -$ sqoop import --connect jdbc:mysql://database.example.com/employees \ - --username venkatesh --password-file ${user.home}/.password -Sqoop will read entire content of the password file and use it as -a password. This will include any trailing white space characters such as -new line characters that are added by default by most of the text editors. -You need to make sure that your password file contains only characters -that belongs to your password. On the command line you can use command -echo with switch -n to store password without any trailing white space -characters. For example to store password secret you would call -echo -n "secret" > password.file. -Another way of supplying passwords is using the -P argument which will -read a password from a console prompt. -Protecting password from preying eyesHadoop 2.6.0 provides an API to separate password storage from applications. -This API is called the credential provided API and there is a new -credential command line tool to manage passwords and their aliases. -The passwords are stored with their aliases in a keystore that is password -protected. The keystore password can be the provided to a password prompt -on the command line, via an environment variable or defaulted to a software -defined constant. Please check the Hadoop documentation on the usage -of this facility. -Once the password is stored using the Credential Provider facility and -the Hadoop configuration has been suitably updated, all applications can -optionally use the alias in place of the actual password and at runtime -resolve the alias for the password to use. -Since the keystore or similar technology used for storing the credential -provider is shared across components, passwords for various applications, -various database and other passwords can be securely stored in them and only -the alias needs to be exposed in configuration files, protecting the password -from being visible. -Sqoop has been enhanced to allow usage of this funcionality if it is -available in the underlying Hadoop version being used. One new option -has been introduced to provide the alias on the command line instead of the -actual password (--password-alias). The argument value this option is -the alias on the storage associated with the actual password. -Example usage is as follows: -$ sqoop import --connect jdbc:mysql://database.example.com/employees \ - --username dbuser --password-alias mydb.password.alias -Similarly, if the command line option is not preferred, the alias can be saved -in the file provided with --password-file option. Along with this, the -Sqoop configuration parameter org.apache.sqoop.credentials.loader.class -should be set to the classname that provides the alias resolution: -org.apache.sqoop.util.password.CredentialProviderPasswordLoader -Example usage is as follows (assuming .password.alias has the alias for -the real password) : -$ sqoop import --connect jdbc:mysql://database.example.com/employees \ - --username dbuser --password-file ${user.home}/.password-alias -The --password parameter is insecure, as other users may -be able to read your password from the command-line arguments via -the output of programs such as ps. The -P argument is the preferred -method over using the --password argument. Credentials may still be -transferred between nodes of the MapReduce cluster using insecure means. -For example: -$ sqoop import --connect jdbc:mysql://database.example.com/employees \ - --username aaron --password 12345 -Sqoop automatically supports several databases, including MySQL. Connect -strings beginning with jdbc:mysql:// are handled automatically in Sqoop. (A -full list of databases with built-in support is provided in the "Supported -Databases" section. For some, you may need to install the JDBC driver -yourself.) -You can use Sqoop with any other -JDBC-compliant database. First, download the appropriate JDBC -driver for the type of database you want to import, and install the .jar -file in the $SQOOP_HOME/lib directory on your client machine. (This will -be /usr/lib/sqoop/lib if you installed from an RPM or Debian package.) -Each driver .jar file also has a specific driver class which defines -the entry-point to the driver. For example, MySQL’s Connector/J library has -a driver class of com.mysql.jdbc.Driver. Refer to your database -vendor-specific documentation to determine the main driver class. -This class must be provided as an argument to Sqoop with --driver. -For example, to connect to a SQLServer database, first download the driver from -microsoft.com and install it in your Sqoop lib path. -Then run Sqoop. For example: -$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \ - --connect <connect-string> ... -When connecting to a database using JDBC, you can optionally specify extra -JDBC parameters via a property file using the option ---connection-param-file. The contents of this file are parsed as standard -Java properties and passed into the driver while creating a connection. -The parameters specified via the optional property file are only -applicable to JDBC connections. Any fastpath connectors that use connections -other than JDBC will ignore these parameters. - -Validation arguments <link linkend="validation">More Details</link> - - - - - - - Argument - - - Description - - - - - - - --validate - - - Enable validation of data copied, supports single table copy only. - - - - - --validator <class-name> - - - Specify validator class to use. - - - - - --validation-threshold <class-name> - - - Specify validation threshold class to use. - - - - - --validation-failurehandler <class-name> - - - Specify validation failure handler class to use. - - - - -
- -Import control arguments: - - - - - - - Argument - - - Description - - - - - - - --append - - - Append data to an existing dataset in HDFS - - - - - --as-avrodatafile - - - Imports data to Avro Data Files - - - - - --as-sequencefile - - - Imports data to SequenceFiles - - - - - --as-textfile - - - Imports data as plain text (default) - - - - - --as-parquetfile - - - Imports data to Parquet Files - - - - - --boundary-query <statement> - - - Boundary query to use for creating splits - - - - - --columns <col,col,col…> - - - Columns to import from table - - - - - --delete-target-dir - - - Delete the import target directory if it exists - - - - - --direct - - - Use direct connector if exists for the database - - - - - --fetch-size <n> - - - Number of entries to read from database at once. - - - - - --inline-lob-limit <n> - - - Set the maximum size for an inline LOB - - - - - -m,--num-mappers <n> - - - Use n map tasks to import in parallel - - - - - -e,--query <statement> - - - Import the results of statement. - - - - - --split-by <column-name> - - - Column of the table used to split work units - - - - - --table <table-name> - - - Table to read - - - - - --target-dir <dir> - - - HDFS destination dir - - - - - --warehouse-dir <dir> - - - HDFS parent for table destination - - - - - --where <where clause> - - - WHERE clause to use during import - - - - - -z,--compress - - - Enable compression - - - - - --compression-codec <c> - - - Use Hadoop codec (default gzip) - - - - - --null-string <null-string> - - - The string to be written for a null value for string columns - - - - - --null-non-string <null-string> - - - The string to be written for a null value for non-string columns - - - - -
-The --null-string and --null-non-string arguments are optional.\ -If not specified, then the string "null" will be used. -
-
-Selecting the Data to Import -Sqoop typically imports data in a table-centric fashion. Use the ---table argument to select the table to import. For example, --table -employees. This argument can also identify a VIEW or other table-like -entity in a database. -By default, all columns within a table are selected for import. -Imported data is written to HDFS in its "natural order;" that is, a -table containing columns A, B, and C result in an import of data such -as: -A1,B1,C1 -A2,B2,C2 -... -You can select a subset of columns and control their ordering by using -the --columns argument. This should include a comma-delimited list -of columns to import. For example: --columns "name,employee_id,jobtitle". -You can control which rows are imported by adding a SQL WHERE clause -to the import statement. By default, Sqoop generates statements of the -form SELECT <column list> FROM <table name>. You can append a -WHERE clause to this with the --where argument. For example: --where -"id > 400". Only rows where the id column has a value greater than -400 will be imported. -By default sqoop will use query select min(<split-by>), max(<split-by>) from -<table name> to find out boundaries for creating splits. In some cases this query -is not the most optimal so you can specify any arbitrary query returning two -numeric columns using --boundary-query argument. -
-
-Free-form Query Imports -Sqoop can also import the result set of an arbitrary SQL query. Instead of -using the --table, --columns and --where arguments, you can specify -a SQL statement with the --query argument. -When importing a free-form query, you must specify a destination directory -with --target-dir. -If you want to import the results of a query in parallel, then each map task -will need to execute a copy of the query, with results partitioned by bounding -conditions inferred by Sqoop. Your query must include the token $CONDITIONS -which each Sqoop process will replace with a unique condition expression. -You must also select a splitting column with --split-by. -For example: -$ sqoop import \ - --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ - --split-by a.id --target-dir /user/foo/joinresults -Alternately, the query can be executed once and imported serially, by -specifying a single map task with -m 1: -$ sqoop import \ - --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \ - -m 1 --target-dir /user/foo/joinresults -If you are issuing the query wrapped with double quotes ("), -you will have to use \$CONDITIONS instead of just $CONDITIONS -to disallow your shell from treating it as a shell variable. -For example, a double quoted query may look like: -"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS" -The facility of using free-form query in the current version of Sqoop -is limited to simple queries where there are no ambiguous projections and -no OR conditions in the WHERE clause. Use of complex queries such as -queries that have sub-queries or joins leading to ambiguous projections can -lead to unexpected results. -
-
-Controlling Parallelism -Sqoop imports data in parallel from most database sources. You can -specify the number -of map tasks (parallel processes) to use to perform the import by -using the -m or --num-mappers argument. Each of these arguments -takes an integer value which corresponds to the degree of parallelism -to employ. By default, four tasks are used. Some databases may see -improved performance by increasing this value to 8 or 16. Do not -increase the degree of parallelism greater than that available within -your MapReduce cluster; tasks will run serially and will likely -increase the amount of time required to perform the import. Likewise, -do not increase the degree of parallism higher than that which your -database can reasonably support. Connecting 100 concurrent clients to -your database may increase the load on the database server to a point -where performance suffers as a result. -When performing parallel imports, Sqoop needs a criterion by which it -can split the workload. Sqoop uses a splitting column to split the -workload. By default, Sqoop will identify the primary key column (if -present) in a table and use it as the splitting column. The low and -high values for the splitting column are retrieved from the database, -and the map tasks operate on evenly-sized components of the total -range. For example, if you had a table with a primary key column of -id whose minimum value was 0 and maximum value was 1000, and Sqoop -was directed to use 4 tasks, Sqoop would run four processes which each -execute SQL statements of the form SELECT * FROM sometable WHERE id ->= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), -(500, 750), and (750, 1001) in the different tasks. -If the actual values for the primary key are not uniformly distributed -across its range, then this can result in unbalanced tasks. You should -explicitly choose a different column with the --split-by argument. -For example, --split-by employee_id. Sqoop cannot currently split on -multi-column indices. If your table has no index column, or has a -multi-column key, then you must also manually choose a splitting -column. -
-
-Controlling Distributed Cache -Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every -time when start a Sqoop job. When launched by Oozie this is unnecessary -since Oozie use its own Sqoop share lib which keeps Sqoop dependencies -in the distributed cache. Oozie will do the localization on each -worker node for the Sqoop dependencies only once during the first Sqoop -job and reuse the jars on worker node for subsquencial jobs. Using -option --skip-dist-cache in Sqoop command when launched by Oozie will -skip the step which Sqoop copies its dependencies to job cache and save -massive I/O. -
-
-Controlling the Import Process -By default, the import process will use JDBC which provides a -reasonable cross-vendor import channel. Some databases can perform -imports in a more high-performance fashion by using database-specific -data movement tools. For example, MySQL provides the mysqldump tool -which can export data from MySQL to other systems very quickly. By -supplying the --direct argument, you are specifying that Sqoop -should attempt the direct import channel. This channel may be -higher performance than using JDBC. -Details about use of direct mode with each specific RDBMS, installation requirements, available -options and limitations can be found in . -By default, Sqoop will import a table named foo to a directory named -foo inside your home directory in HDFS. For example, if your -username is someuser, then the import tool will write to -/user/someuser/foo/(files). You can adjust the parent directory of -the import with the --warehouse-dir argument. For example: -$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \ - ... -This command would write to a set of files in the /shared/foo/ directory. -You can also explicitly choose the target directory, like so: -$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \ - ... -This will import the files into the /dest directory. --target-dir is -incompatible with --warehouse-dir. -When using direct mode, you can specify additional arguments which -should be passed to the underlying tool. If the argument --- is given on the command-line, then subsequent arguments are sent -directly to the underlying tool. For example, the following adjusts -the character set used by mysqldump: -$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \ - --direct -- --default-character-set=latin1 -By default, imports go to a new target location. If the destination directory -already exists in HDFS, Sqoop will refuse to import and overwrite that -directory’s contents. If you use the --append argument, Sqoop will import -data to a temporary directory and then rename the files into the normal -target directory in a manner that does not conflict with existing filenames -in that directory. -
-
-Controlling transaction isolation -By default, Sqoop uses the read committed transaction isolation in the mappers -to import data. This may not be the ideal in all ETL workflows and it may -desired to reduce the isolation guarantees. The --relaxed-isolation option -can be used to instruct Sqoop to use read uncommitted isolation level. -The read-uncommitted isolation level is not supported on all databases -(for example, Oracle), so specifying the option --relaxed-isolation -may not be supported on all databases. -
-
-Controlling type mapping -Sqoop is preconfigured to map most SQL types to appropriate Java or Hive -representatives. However the default mapping might not be suitable for -everyone and might be overridden by --map-column-java (for changing -mapping to Java) or --map-column-hive (for changing Hive mapping). - -Parameters for overriding mapping - - - - - - - Argument - - - Description - - - - - - - --map-column-java <mapping> - - - Override mapping from SQL to Java type for configured columns. - - - - - --map-column-hive <mapping> - - - Override mapping from SQL to Hive type for configured columns. - - - - -
-Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example: -$ sqoop import ... --map-column-java id=String,value=Integer -Sqoop will rise exception in case that some configured mapping will not be used. -
-
-Incremental Imports -Sqoop provides an incremental import mode which can be used to retrieve -only rows newer than some previously-imported set of rows. -The following arguments control incremental imports: - -Incremental import arguments: - - - - - - - Argument - - - Description - - - - - - - --check-column (col) - - - Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR) - - - - - --incremental (mode) - - - Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified. - - - - - --last-value (value) - - - Specifies the maximum value of the check column from the previous import. - - - - -
-Sqoop supports two types of incremental imports: append and lastmodified. -You can use the --incremental argument to specify the type of incremental -import to perform. -You should specify append mode when importing a table where new rows are -continually being added with increasing row id values. You specify the column -containing the row’s id with --check-column. Sqoop imports rows where the -check column has a value greater than the one specified with --last-value. -An alternate table update strategy supported by Sqoop is called lastmodified -mode. You should use this when rows of the source table may be updated, and -each such update will set the value of a last-modified column to the current -timestamp. Rows where the check column holds a timestamp more recent than the -timestamp specified with --last-value are imported. -At the end of an incremental import, the value which should be specified as ---last-value for a subsequent import is printed to the screen. When running -a subsequent import, you should specify --last-value in this way to ensure -you import only the new or updated data. This is handled automatically by -creating an incremental import as a saved job, which is the preferred -mechanism for performing a recurring incremental import. See the section on -saved jobs later in this document for more information. -
-
-File Formats -You can import data in one of two file formats: delimited text or -SequenceFiles. -Delimited text is the default import format. You can also specify it -explicitly by using the --as-textfile argument. This argument will write -string-based representations of each record to the output files, with -delimiter characters between individual columns and rows. These -delimiters may be commas, tabs, or other characters. (The delimiters -can be selected; see "Output line formatting arguments.") The -following is the results of an example text-based import: -1,here is a message,2010-05-01 -2,happy new year!,2010-01-01 -3,another message,2009-11-12 -Delimited text is appropriate for most non-binary data types. It also -readily supports further manipulation by other tools, such as Hive.