mirror of
https://github.com/apache/sqoop.git
synced 2025-05-04 02:39:53 +08:00
MAPREDUCE-1036. Document Sqoop API. Contributed by Aaron Kimball
From: Christopher Douglas <cdouglas@apache.org> git-svn-id: https://svn.apache.org/repos/asf/incubator/sqoop/trunk@1149838 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
9afc7a8aee
commit
a0229d9738
@ -61,3 +61,5 @@ include::hive.txt[]
|
||||
|
||||
include::supported-dbs.txt[]
|
||||
|
||||
include::api-reference.txt[]
|
||||
|
||||
|
243
doc/api-reference.txt
Normal file
243
doc/api-reference.txt
Normal file
@ -0,0 +1,243 @@
|
||||
|
||||
////
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
////
|
||||
|
||||
Developer API Reference
|
||||
-----------------------
|
||||
|
||||
This section is intended to specify the APIs available to application writers
|
||||
integrating with Sqoop, and those modifying Sqoop. The next three subsections
|
||||
are written from the following three perspectives: those using classes generated
|
||||
by Sqoop, and its public library; those writing Sqoop extensions (i.e.,
|
||||
additional ConnManager implementations that interact with more databases); and
|
||||
those modifying Sqoop's internals. Each section describes the system in
|
||||
successively greater depth.
|
||||
|
||||
|
||||
The External API
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Sqoop auto-generates classes that represent the tables imported into HDFS. The
|
||||
class contains member fields for each column of the imported table; an instance
|
||||
of the class holds one row of the table. The generated classes implement the
|
||||
serialization APIs used in Hadoop, namely the _Writable_ and _DBWritable_
|
||||
interfaces. They also contain other convenience methods: a +parse()+ method
|
||||
that interprets delimited text fields, and a +toString()+ method that preserves
|
||||
the user's chosen delimiters. The full set of methods guaranteed to exist in an
|
||||
auto-generated class are specified in the interface
|
||||
+org.apache.hadoop.sqoop.lib.SqoopRecord+.
|
||||
|
||||
Instances of _SqoopRecord_ may depend on Sqoop's public API. This is all classes
|
||||
in the +org.apache.hadoop.sqoop.lib+ package. These are briefly described below.
|
||||
Clients of Sqoop should not need to directly interact with any of these classes,
|
||||
although classes generated by Sqoop will depend on them. Therefore, these APIs
|
||||
are considered public and care will be taken when forward-evolving them.
|
||||
|
||||
* The +RecordParser+ class will parse a line of text into a list of fields,
|
||||
using controllable delimiters and quote characters.
|
||||
* The static +FieldFormatter+ class provides a method which handles quoting and
|
||||
escaping of characters in a field which will be used in
|
||||
+SqoopRecord.toString()+ implementations.
|
||||
* Marshaling data between _ResultSet_ and _PreparedStatement_ objects and
|
||||
_SqoopRecords_ is done via +JdbcWritableBridge+.
|
||||
* +BigDecimalSerializer+ contains a pair of methods that facilitate
|
||||
serialization of +BigDecimal+ objects over the _Writable_ interface.
|
||||
|
||||
The Extension API
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
This section covers the API and primary classes used by extensions for Sqoop
|
||||
which allow Sqoop to interface with more database vendors.
|
||||
|
||||
While Sqoop uses JDBC and +DBInputFormat+ (and +DataDrivenDBInputFormat+) to
|
||||
read from databases, differences in the SQL supported by different vendors as
|
||||
well as JDBC metadata necessitates vendor-specific codepaths for most databases.
|
||||
Sqoop's solution to this problem is by introducing the ConnManager API
|
||||
(+org.apache.hadoop.sqoop.manager.ConnMananger+).
|
||||
|
||||
+ConnManager+ is an abstract class defining all methods that interact with the
|
||||
database itself. Most implementations of +ConnManager+ will extend the
|
||||
+org.apache.hadoop.sqoop.manager.SqlManager+ abstract class, which uses standard
|
||||
SQL to perform most actions. Subclasses are required to implement the
|
||||
+getConnection()+ method which returns the actual JDBC connection to the
|
||||
database. Subclasses are free to override all other methods as well. The
|
||||
+SqlManager+ class itself exposes a protected API that allows developers to
|
||||
selectively override behavior. For example, the +getColNamesQuery()+ method
|
||||
allows the SQL query used by +getColNames()+ to be modified without needing to
|
||||
rewrite the majority of +getColNames()+.
|
||||
|
||||
+ConnManager+ implementations receive a lot of their configuration data from a
|
||||
Sqoop-specific class, +ImportOptions+. While +ImportOptions+ does not currently
|
||||
contain many setter methods, clients should not assume +ImportOptions+ are
|
||||
immutable. More setter methods may be added in the future. +ImportOptions+ does
|
||||
not directly store specific per-manager options. Instead, it contains a
|
||||
reference to the +Configuration+ returned by +Tool.getConf()+ after parsing
|
||||
command-line arguments with the +GenericOptionsParser+. This allows extension
|
||||
arguments via "+-D any.specific.param=any.value+" without requiring any layering
|
||||
of options parsing or modification of +ImportOptions+.
|
||||
|
||||
All existing +ConnManager+ implementations are stateless. Thus, the system which
|
||||
instantiates +ConnManagers+ may implement multiple instances of the same
|
||||
+ConnMananger+ class over Sqoop's lifetime. If a caching layer is required, we
|
||||
can add one later, but it is not currently available.
|
||||
|
||||
+ConnManagers+ are currently created by instances of the abstract class +ManagerFactory+ (See
|
||||
MAPREDUCE-750). One +ManagerFactory+ implementation currently serves all of
|
||||
Sqoop: +org.apache.hadoop.sqoop.manager.DefaultManagerFactory+. Extensions
|
||||
should not modify +DefaultManagerFactory+. Instead, an extension-specific
|
||||
+ManagerFactory+ implementation should be provided with the new ConnManager.
|
||||
+ManagerFactory+ has a single method of note, named +accept()+. This method will
|
||||
determine whether it can instantiate a +ConnManager+ for the user's
|
||||
+ImportOptions+. If so, it returns the +ConnManager+ instance. Otherwise, it
|
||||
returns +null+.
|
||||
|
||||
The +ManagerFactory+ implementations used are governed by the
|
||||
+sqoop.connection.factories+ setting in sqoop-site.xml. Users of extension
|
||||
libraries can install the 3rd-party library containing a new +ManagerFactory+
|
||||
and +ConnManager+(s), and configure sqoop-site.xml to use the new
|
||||
+ManagerFactory+. The +DefaultManagerFactory+ principly discriminates between
|
||||
databases by parsing the connect string stored in +ImportOptions+.
|
||||
|
||||
Extension authors may make use of classes in the +org.apache.hadoop.sqoop.io+,
|
||||
+mapred+, +mapreduce+, and +util+ packages to facilitate their implementations.
|
||||
These packages and classes are described in more detail in the following
|
||||
section.
|
||||
|
||||
|
||||
Sqoop Internals
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
This section describes the internal architecture of Sqoop.
|
||||
|
||||
The Sqoop program is driven by the +org.apache.hadoop.sqoop.Sqoop+ main class.
|
||||
A limited number of additional classes are in the same package; +ImportOptions+
|
||||
(described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+
|
||||
instances).
|
||||
|
||||
General program flow
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The general program flow is as follows:
|
||||
|
||||
+org.apache.hadoop.sqoop.Sqoop+ is the main class and implements _Tool_. A new
|
||||
instance is launched with +ToolRunner+. It parses its arguments using the
|
||||
+ImportOptions+ class. Within the +ImportOptions+, an +ImportAction+ will be
|
||||
chosen by the user. This may be import all tables, import one specific table,
|
||||
execute a SQL statement, or others.
|
||||
|
||||
A +ConnManager+ is then instantiated based on the data in the +ImportOptions+.
|
||||
The +ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+; the
|
||||
mechanics of this were described in an earlier section.
|
||||
|
||||
Then in the +run()+ method, using a case statement, it determines which actions
|
||||
the user needs performed based on the +ImportAction+ enum. Usually this involves
|
||||
determining a list of tables to import, generating user code for them, and
|
||||
running a MapReduce job per table to read the data. The import itself does not
|
||||
specifically need to be run via a MapReduce job; the +ConnManager.importTable()+
|
||||
method is left to determine how best to run the import. Each of these actions is
|
||||
controlled by the +ConnMananger+, except for the generating of code, which is
|
||||
done by the +CompilationManager+ and +ClassWriter+. (Both in the
|
||||
+org.apache.hadoop.sqoop.orm+ package.) Importing into Hive is also taken care
|
||||
of via the +org.apache.hadoop.sqoop.hive.HiveImport+ class after the
|
||||
+importTable()+ has completed. This is done without concern for the
|
||||
+ConnManager+ implementation used.
|
||||
|
||||
A ConnManager's +importTable()+ method receives a single argument of type
|
||||
+ImportJobContext+ which contains parameters to the method. This class may be
|
||||
extended with additional parameters in the future, which optionally further
|
||||
direct the import operation. Similarly, the +exportTable()+ method receives an
|
||||
argument of type +ExportJobContext+. These classes contain the name of the table
|
||||
to import/export, a reference to the +ImportOptions+ object, and other related
|
||||
data.
|
||||
|
||||
Subpackages
|
||||
^^^^^^^^^^^
|
||||
|
||||
The following subpackages under +org.apache.hadoop.sqoop+ exist:
|
||||
|
||||
* +hive+ - Facilitates importing data to Hive.
|
||||
* +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and
|
||||
_Writer_).
|
||||
* +lib+ - The external public API (described earlier).
|
||||
* +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their
|
||||
implementations.
|
||||
* +mapred+ - Classes interfacing with the old (pre-0.20) MapReduce API.
|
||||
* +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API....
|
||||
* +orm+ - Code auto-generation.
|
||||
* +util+ - Miscellaneous utility classes.
|
||||
|
||||
The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations
|
||||
used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single
|
||||
BufferedWriter to be opened to a client which will, under the hood, write to
|
||||
multiple files in series as they reach a target threshold size. This allows
|
||||
unsplittable compression libraries (e.g., gzip) to be used in conjunction with
|
||||
Sqoop import while still allowing subsequent MapReduce jobs to use multiple
|
||||
input splits per dataset.
|
||||
|
||||
Code in the +mapred+ package should be considered deprecated. The +mapreduce+
|
||||
package contains +DataDrivenImportJob+, which uses the +DataDrivenDBInputFormat+
|
||||
introduced in 0.21. The mapred package contains +ImportJob+, which uses the
|
||||
older +DBInputFormat+. Most +ConnManager+ implementations use
|
||||
+DataDrivenImportJob+; +DataDrivenDBInputFormat+ does not currently work with
|
||||
Oracle in all circumstances, so it remains on the old code-path.
|
||||
|
||||
The +orm+ package contains code used for class generation. It depends on the
|
||||
JDK's tools.jar which provides the com.sun.tools.javac package.
|
||||
|
||||
The +util+ package contains various utilities used throughout Sqoop:
|
||||
|
||||
* +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the
|
||||
current thread. This is principly used to load auto-generated code into the
|
||||
current thread when running MapReduce in local (standalone) mode.
|
||||
* +DirectImportUtils+ contains convenience methods used by direct HDFS
|
||||
importers.
|
||||
* +Executor+ launches external processes and connects these to stream handlers
|
||||
generated by an AsyncSink (see more detail below).
|
||||
* +ExportError+ is thrown by +ConnManagers+ when exports fail.
|
||||
* +ImportError+ is thrown by +ConnManagers+ when imports fail.
|
||||
* +JdbcUrl+ handles parsing of connect strings, which are URL-like but not
|
||||
specification-conforming. (In particular, JDBC connect strings may have
|
||||
+multi:part:scheme://+ components.)
|
||||
* +PerfCounters+ are used to estimate transfer rates for display to the user.
|
||||
* +ResultSetPrinter+ will pretty-print a _ResultSet_.
|
||||
|
||||
In several places, Sqoop reads the stdout from external processes. The most
|
||||
straightforward cases are direct-mode imports as performed by the
|
||||
+LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by
|
||||
+Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr
|
||||
(+Process.getErrorStream()+) must be handled. Failure to read enough data from
|
||||
both of these streams will cause the external process to block before writing
|
||||
more. Consequently, these must both be handled, and preferably asynchronously.
|
||||
|
||||
In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and
|
||||
reads it to completion. These are realized by +AsyncSink+ implementations. The
|
||||
+org.apache.hadoop.sqoop.util.AsyncSink+ abstract class defines the operations
|
||||
this factory must perform. +processStream()+ will spawn another thread to
|
||||
immediately begin handling the data read from the +InputStream+ argument; it
|
||||
must read this stream to completion. The +join()+ method allows external threads
|
||||
to wait until this processing is complete.
|
||||
|
||||
Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will
|
||||
repeat everything on the +InputStream+ as log4j INFO statements. The
|
||||
+NullAsyncSink+ consumes all its input and does nothing.
|
||||
|
||||
The various +ConnManagers+ that make use of external processes have their own
|
||||
+AsyncSink+ implementations as inner classes, which read from the database tools
|
||||
and forward the data along to HDFS, possibly performing formatting conversions
|
||||
in the meantime.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user