5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-04 13:10:39 +08:00
sqoop/doc/full-db-import.txt
Andrew Bayer 4b193ec488 MAPREDUCE-906. Update Sqoop documentation. Contributed by Aaron Kimball
From: Christopher Douglas <cdouglas@apache.org>

git-svn-id: https://svn.apache.org/repos/asf/incubator/sqoop/trunk@1149834 13f79535-47bb-0310-9956-ffa450edef68
2011-07-22 20:03:26 +00:00

93 lines
3.7 KiB
Plaintext

////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
Automatic Full-database Import
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to import all the tables in a database, you can use the
+--all-tables+ command to do so:
----
$ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables
----
This will query the database for the available tables, generate an ORM
class for each table, and run a MapReduce job to import each one.
Hadoop uses the DBInputFormat to read from a database into a Mapper
instance. To read a table into a MapReduce program requires creating a
class to hold the fields of one row of the table. One of the benefits
of Sqoop is that it generates this class definition for you, based on
the table definition in the database.
The generated +.java+ files are, by default, placed in the current
directory. You can supply a different directory with the +--outdir+
parameter. These are then compiled into +.class+ and +.jar+ files for use
by the MapReduce job that it launches. These files are created in a
temporary directory. You can redirect this target with +--bindir+.
Each table will be imported into a separate directory in HDFS, with
the same name as the table. For instance, if my Hadoop username is
aaron, the above command would have generated the following
directories in HDFS:
----
/user/aaron/employee_names
/user/aaron/payroll_checks
/user/aaron/job_descriptions
/user/aaron/office_supplies
----
You can change the base directory under which the tables are loaded
with the +--warehouse-dir+ parameter. For example:
----
$ sqoop --connect jdbc:mysql://database.example.com/employees --all-tables \
--warehouse-dir /common/warehouse
----
This would create the following directories instead:
----
/common/warehouse/employee_names
/common/warehouse/payroll_checks
/common/warehouse/job_descriptions
/common/warehouse/office_supplies
----
By default the data will be read into text files in HDFS. Each of the
columns will be represented as comma-delimited text. Each row is
terminated by a newline. See the section on "Controlling the Output
Format" below for information on how to change these delimiters.
If you want to leverage compression and binary file formats, the
+--as-sequencefile+ argument to Sqoop will import the table
to a set of SequenceFiles instead. This stores each field of each
database record in a separate object in a SequenceFile.
This representation is also likely to be higher performance when used
as an input to subsequent MapReduce programs as it does not require
parsing. For completeness, Sqoop provides an +--as-textfile+ option, which is
implied by default. An +--as-textfile+ on the command-line will override
a previous +--as-sequencefile+ argument.
The SequenceFile format will embed the records from the database as
objects using the code generated by Sqoop. It is important that you
retain the +.java+ file for this class, as you will need to be able to
instantiate the same type to read the objects back later, in other
user-defined applications.