sqoop/doc/Sqoop-manpage.txt

sqoop(1)
========

////
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
////

NAME
----
sqoop - SQL-to-Hadoop import tool

SYNOPSIS
--------
'sqoop' <options>

DESCRIPTION
-----------
Sqoop is a tool designed to help users of large data import existing
relational databases into their Hadoop clusters. Sqoop uses JDBC to
connect to a database, examine each table's schema, and auto-generate
the necessary classes to import data into HDFS. It then instantiates
a MapReduce job to read tables from the database via the DBInputFormat
(JDBC-based InputFormat). Tables are read into a set of files loaded
into HDFS. Both SequenceFile and text-based targets are supported. Sqoop
also supports high-performance imports from select databases including MySQL.

OPTIONS
-------

The +--connect+ option is always required. To perform an import, one of
+--table+ or +--all-tables+ is required as well. Alternatively, you can
specify +--generate-only+ or one of the arguments in "Additional commands."


Database connection options
~~~~~~~~~~~~~~~~~~~~~~~~~~~

--connect (jdbc-uri)::
  Specify JDBC connect string (required)

--driver (class-name)::
  Manually specify JDBC driver class to use

--username (username)::
  Set authentication username

--password (password)::
  Set authentication password
  (Note: This is very insecure. You should use -P instead.)

-P::
  Prompt for user password

--direct::
  Use direct import fast path (mysql only)

Import control options
~~~~~~~~~~~~~~~~~~~~~~

--all-tables::
  Import all tables in database
  (Ignores +--table+, +--columns+, +--order-by+, and +--where+)

--columns (col,col,col...)::
  Columns to export from table

--split-by (column-name)::
  Column of the table used to split the table for parallel import

--hadoop-home (dir)::
  Override $HADOOP_HOME

--hive-home (dir)::
  Override $HIVE_HOME

--warehouse-dir (dir)::
  Tables are uploaded to the HDFS path +/warehouse/dir/(tablename)/+

--as-sequencefile::
  Imports data to SequenceFiles

--as-textfile::
  Imports data as plain text (default)

--hive-import::
  If set, then import the table into Hive

--hive-create-only::
  Creates table in hive and skips the data import step

--hive-overwrite::
  Overwrites existing table in hive.
  By default it does not overwrite existing table.

--table (table-name)::
  The table to import

--hive-table (table-name)::
  When used with --hive-import, overrides the destination table name

--where (clause)::
  Import only the rows for which _clause_ is true.
  e.g.: `--where "user_id > 400 AND hidden == 0"`

--compress::
-z::
  Uses gzip to compress data as it is written to HDFS

--direct-split-size (size)::
  When using direct mode, write to multiple files of
  approximately _size_ bytes each.

Export control options
~~~~~~~~~~~~~~~~~~~~~~

--export-dir (dir)::
  Export from an HDFS path into a table (set with
  --table)

Output line formatting options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include::output-formatting.txt[]
include::output-formatting-args.txt[]

Input line parsing options
~~~~~~~~~~~~~~~~~~~~~~~~~~

include::input-formatting.txt[]
include::input-formatting-args.txt[]

Code generation options
~~~~~~~~~~~~~~~~~~~~~~~

--bindir (dir)::
  Output directory for compiled objects

--class-name (name)::
  Sets the name of the class to generate. By default, classes are
  named after the table they represent. Using this parameters
  ignores +--package-name+.

--generate-only::
  Stop after code generation; do not import

--outdir (dir)::
  Output directory for generated code

--package-name (package)::
  Puts auto-generated classes in the named Java package

Library loading options
~~~~~~~~~~~~~~~~~~~~~~~
--jar-file (file)::
  Disable code generation; use specified jar

--class-name (name)::
  The class within the jar that represents the table to import/export

Additional commands
~~~~~~~~~~~~~~~~~~~

These commands cause Sqoop to report information and exit;
no import or code generation is performed.

--debug-sql (statement)::
  Execute 'statement' in SQL and display the results

--help::
  Display usage information and exit

--list-databases::
  List all databases available and exit

--list-tables::
  List tables in database and exit

Database-specific options
~~~~~~~~~~~~~~~~~~~~~~~~~

Additional arguments may be passed to the database manager
after a lone '-' on the command-line.

In MySQL direct mode, additional arguments are passed directly to
mysqldump.

ENVIRONMENT
-----------

JAVA_HOME::
  As part of its import process, Sqoop generates and compiles Java code
  by invoking the Java compiler *javac*(1). As a result, JAVA_HOME must
  be set to the location of your JDK (note: This cannot just be a JRE).
  e.g., +/usr/java/default+. Hadoop (and Sqoop) requires Sun Java 1.6 which
  can be downloaded from http://java.sun.com.

HADOOP_HOME::
  The location of the Hadoop jar files. If you installed Hadoop via RPM
  or DEB, these are in +/usr/lib/hadoop-20+.

HIVE_HOME::
  If you are performing a Hive import, you must identify the location of
  Hive's jars and configuration. If you installed Hive via RPM or DEB,
  these are in +/usr/lib/hive+.