diff --git a/doc/SqoopUserGuide.txt b/doc/SqoopUserGuide.txt index 3c84767b..f1bf350d 100644 --- a/doc/SqoopUserGuide.txt +++ b/doc/SqoopUserGuide.txt @@ -61,3 +61,5 @@ include::hive.txt[] include::supported-dbs.txt[] +include::api-reference.txt[] + diff --git a/doc/api-reference.txt b/doc/api-reference.txt new file mode 100644 index 00000000..6fdfd12a --- /dev/null +++ b/doc/api-reference.txt @@ -0,0 +1,243 @@ + +//// + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +//// + +Developer API Reference +----------------------- + +This section is intended to specify the APIs available to application writers +integrating with Sqoop, and those modifying Sqoop. The next three subsections +are written from the following three perspectives: those using classes generated +by Sqoop, and its public library; those writing Sqoop extensions (i.e., +additional ConnManager implementations that interact with more databases); and +those modifying Sqoop's internals. Each section describes the system in +successively greater depth. + + +The External API +~~~~~~~~~~~~~~~~ + +Sqoop auto-generates classes that represent the tables imported into HDFS. The +class contains member fields for each column of the imported table; an instance +of the class holds one row of the table. The generated classes implement the +serialization APIs used in Hadoop, namely the _Writable_ and _DBWritable_ +interfaces. They also contain other convenience methods: a +parse()+ method +that interprets delimited text fields, and a +toString()+ method that preserves +the user's chosen delimiters. The full set of methods guaranteed to exist in an +auto-generated class are specified in the interface ++org.apache.hadoop.sqoop.lib.SqoopRecord+. + +Instances of _SqoopRecord_ may depend on Sqoop's public API. This is all classes +in the +org.apache.hadoop.sqoop.lib+ package. These are briefly described below. +Clients of Sqoop should not need to directly interact with any of these classes, +although classes generated by Sqoop will depend on them. Therefore, these APIs +are considered public and care will be taken when forward-evolving them. + +* The +RecordParser+ class will parse a line of text into a list of fields, + using controllable delimiters and quote characters. +* The static +FieldFormatter+ class provides a method which handles quoting and + escaping of characters in a field which will be used in + +SqoopRecord.toString()+ implementations. +* Marshaling data between _ResultSet_ and _PreparedStatement_ objects and + _SqoopRecords_ is done via +JdbcWritableBridge+. +* +BigDecimalSerializer+ contains a pair of methods that facilitate + serialization of +BigDecimal+ objects over the _Writable_ interface. + +The Extension API +~~~~~~~~~~~~~~~~~ + +This section covers the API and primary classes used by extensions for Sqoop +which allow Sqoop to interface with more database vendors. + +While Sqoop uses JDBC and +DBInputFormat+ (and +DataDrivenDBInputFormat+) to +read from databases, differences in the SQL supported by different vendors as +well as JDBC metadata necessitates vendor-specific codepaths for most databases. +Sqoop's solution to this problem is by introducing the ConnManager API +(+org.apache.hadoop.sqoop.manager.ConnMananger+). + ++ConnManager+ is an abstract class defining all methods that interact with the +database itself. Most implementations of +ConnManager+ will extend the ++org.apache.hadoop.sqoop.manager.SqlManager+ abstract class, which uses standard +SQL to perform most actions. Subclasses are required to implement the ++getConnection()+ method which returns the actual JDBC connection to the +database. Subclasses are free to override all other methods as well. The ++SqlManager+ class itself exposes a protected API that allows developers to +selectively override behavior. For example, the +getColNamesQuery()+ method +allows the SQL query used by +getColNames()+ to be modified without needing to +rewrite the majority of +getColNames()+. + ++ConnManager+ implementations receive a lot of their configuration data from a +Sqoop-specific class, +ImportOptions+. While +ImportOptions+ does not currently +contain many setter methods, clients should not assume +ImportOptions+ are +immutable. More setter methods may be added in the future. +ImportOptions+ does +not directly store specific per-manager options. Instead, it contains a +reference to the +Configuration+ returned by +Tool.getConf()+ after parsing +command-line arguments with the +GenericOptionsParser+. This allows extension +arguments via "+-D any.specific.param=any.value+" without requiring any layering +of options parsing or modification of +ImportOptions+. + +All existing +ConnManager+ implementations are stateless. Thus, the system which +instantiates +ConnManagers+ may implement multiple instances of the same ++ConnMananger+ class over Sqoop's lifetime. If a caching layer is required, we +can add one later, but it is not currently available. + ++ConnManagers+ are currently created by instances of the abstract class +ManagerFactory+ (See +MAPREDUCE-750). One +ManagerFactory+ implementation currently serves all of +Sqoop: +org.apache.hadoop.sqoop.manager.DefaultManagerFactory+. Extensions +should not modify +DefaultManagerFactory+. Instead, an extension-specific ++ManagerFactory+ implementation should be provided with the new ConnManager. ++ManagerFactory+ has a single method of note, named +accept()+. This method will +determine whether it can instantiate a +ConnManager+ for the user's ++ImportOptions+. If so, it returns the +ConnManager+ instance. Otherwise, it +returns +null+. + +The +ManagerFactory+ implementations used are governed by the ++sqoop.connection.factories+ setting in sqoop-site.xml. Users of extension +libraries can install the 3rd-party library containing a new +ManagerFactory+ +and +ConnManager+(s), and configure sqoop-site.xml to use the new ++ManagerFactory+. The +DefaultManagerFactory+ principly discriminates between +databases by parsing the connect string stored in +ImportOptions+. + +Extension authors may make use of classes in the +org.apache.hadoop.sqoop.io+, ++mapred+, +mapreduce+, and +util+ packages to facilitate their implementations. +These packages and classes are described in more detail in the following +section. + + +Sqoop Internals +~~~~~~~~~~~~~~~ + +This section describes the internal architecture of Sqoop. + +The Sqoop program is driven by the +org.apache.hadoop.sqoop.Sqoop+ main class. +A limited number of additional classes are in the same package; +ImportOptions+ +(described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+ +instances). + +General program flow +^^^^^^^^^^^^^^^^^^^^ + +The general program flow is as follows: + ++org.apache.hadoop.sqoop.Sqoop+ is the main class and implements _Tool_. A new +instance is launched with +ToolRunner+. It parses its arguments using the ++ImportOptions+ class. Within the +ImportOptions+, an +ImportAction+ will be +chosen by the user. This may be import all tables, import one specific table, +execute a SQL statement, or others. + +A +ConnManager+ is then instantiated based on the data in the +ImportOptions+. +The +ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+; the +mechanics of this were described in an earlier section. + +Then in the +run()+ method, using a case statement, it determines which actions +the user needs performed based on the +ImportAction+ enum. Usually this involves +determining a list of tables to import, generating user code for them, and +running a MapReduce job per table to read the data. The import itself does not +specifically need to be run via a MapReduce job; the +ConnManager.importTable()+ +method is left to determine how best to run the import. Each of these actions is +controlled by the +ConnMananger+, except for the generating of code, which is +done by the +CompilationManager+ and +ClassWriter+. (Both in the ++org.apache.hadoop.sqoop.orm+ package.) Importing into Hive is also taken care +of via the +org.apache.hadoop.sqoop.hive.HiveImport+ class after the ++importTable()+ has completed. This is done without concern for the ++ConnManager+ implementation used. + +A ConnManager's +importTable()+ method receives a single argument of type ++ImportJobContext+ which contains parameters to the method. This class may be +extended with additional parameters in the future, which optionally further +direct the import operation. Similarly, the +exportTable()+ method receives an +argument of type +ExportJobContext+. These classes contain the name of the table +to import/export, a reference to the +ImportOptions+ object, and other related +data. + +Subpackages +^^^^^^^^^^^ + +The following subpackages under +org.apache.hadoop.sqoop+ exist: + +* +hive+ - Facilitates importing data to Hive. +* +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and + _Writer_). +* +lib+ - The external public API (described earlier). +* +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their + implementations. +* +mapred+ - Classes interfacing with the old (pre-0.20) MapReduce API. +* +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API.... +* +orm+ - Code auto-generation. +* +util+ - Miscellaneous utility classes. + +The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations +used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single +BufferedWriter to be opened to a client which will, under the hood, write to +multiple files in series as they reach a target threshold size. This allows +unsplittable compression libraries (e.g., gzip) to be used in conjunction with +Sqoop import while still allowing subsequent MapReduce jobs to use multiple +input splits per dataset. + +Code in the +mapred+ package should be considered deprecated. The +mapreduce+ +package contains +DataDrivenImportJob+, which uses the +DataDrivenDBInputFormat+ +introduced in 0.21. The mapred package contains +ImportJob+, which uses the +older +DBInputFormat+. Most +ConnManager+ implementations use ++DataDrivenImportJob+; +DataDrivenDBInputFormat+ does not currently work with +Oracle in all circumstances, so it remains on the old code-path. + +The +orm+ package contains code used for class generation. It depends on the +JDK's tools.jar which provides the com.sun.tools.javac package. + +The +util+ package contains various utilities used throughout Sqoop: + +* +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the + current thread. This is principly used to load auto-generated code into the + current thread when running MapReduce in local (standalone) mode. +* +DirectImportUtils+ contains convenience methods used by direct HDFS + importers. +* +Executor+ launches external processes and connects these to stream handlers + generated by an AsyncSink (see more detail below). +* +ExportError+ is thrown by +ConnManagers+ when exports fail. +* +ImportError+ is thrown by +ConnManagers+ when imports fail. +* +JdbcUrl+ handles parsing of connect strings, which are URL-like but not + specification-conforming. (In particular, JDBC connect strings may have + +multi:part:scheme://+ components.) +* +PerfCounters+ are used to estimate transfer rates for display to the user. +* +ResultSetPrinter+ will pretty-print a _ResultSet_. + +In several places, Sqoop reads the stdout from external processes. The most +straightforward cases are direct-mode imports as performed by the ++LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by ++Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr +(+Process.getErrorStream()+) must be handled. Failure to read enough data from +both of these streams will cause the external process to block before writing +more. Consequently, these must both be handled, and preferably asynchronously. + +In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and +reads it to completion. These are realized by +AsyncSink+ implementations. The ++org.apache.hadoop.sqoop.util.AsyncSink+ abstract class defines the operations +this factory must perform. +processStream()+ will spawn another thread to +immediately begin handling the data read from the +InputStream+ argument; it +must read this stream to completion. The +join()+ method allows external threads +to wait until this processing is complete. + +Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will +repeat everything on the +InputStream+ as log4j INFO statements. The ++NullAsyncSink+ consumes all its input and does nothing. + +The various +ConnManagers+ that make use of external processes have their own ++AsyncSink+ implementations as inner classes, which read from the database tools +and forward the data along to HDFS, possibly performing formatting conversions +in the meantime. + +