5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-21 11:21:39 +08:00

SQOOP-1225: Sqoop 2 documentation for connector development

(Masatake Iwasaki via Jarek Jarcec Cecho)
This commit is contained in:
Jarek Jarcec Cecho 2013-11-10 18:50:21 -08:00
parent ec252cadc1
commit 7353df9809

View File

@ -18,8 +18,10 @@
Sqoop 2 Connector Development
=============================
This document describes you how to implement connector for Sqoop 2.
This document describes you how to implement connector for Sqoop 2
using the code of built-in connector ( ``GenericJdbcConnector`` ) as example.
.. contents::
What is Connector?
++++++++++++++++++
@ -33,9 +35,9 @@ Interaction with Hadoop is taken cared by common modules of Sqoop 2 framework.
Connector Implementation
++++++++++++++++++++++++
The SqoopConnector class defines functionality
The ``SqoopConnector`` class defines functionality
which must be provided by Connectors.
Each Connector must extends SqoopConnector and overrides methods shown below.
Each Connector must extends ``SqoopConnector`` and overrides methods shown below.
::
public abstract String getVersion();
@ -47,24 +49,24 @@ Each Connector must extends SqoopConnector and overrides methods shown below.
public abstract Validator getValidator();
public abstract MetadataUpgrader getMetadataUpgrader();
The getImporter method returns Importer_ instance
The ``getImporter`` method returns Importer_ instance
which is a placeholder for the modules needed for import.
The getExporter method returns Exporter_ instance
The ``getExporter`` method returns Exporter_ instance
which is a placeholder for the modules needed for export.
Methods such as getBundle, getConnectionConfigurationClass,
getJobConfigurationClass and getValidator
Methods such as ``getBundle`` , ``getConnectionConfigurationClass`` ,
``getJobConfigurationClass`` and ``getValidator``
are concerned to `Connector configurations`_ .
Importer
========
Connector#getImporter method returns Importer instance
Connector's ``getImporter`` method returns ``Importer`` instance
which is a placeholder for the modules needed for import
such as Partitioner_ and Extractor_ .
Built-in GenericJdbcConnector defines Importer like this.
Built-in ``GenericJdbcConnector`` defines ``Importer`` like this.
::
private static final Importer IMPORTER = new Importer(
@ -87,7 +89,7 @@ Extractor
Extractor (E for ETL) extracts data from external database and
writes it to Sqoop framework for import.
Extractor must overrides extract method.
Extractor must overrides ``extract`` method.
::
public abstract void extract(ExtractorContext context,
@ -95,10 +97,10 @@ Extractor must overrides extract method.
JobConfiguration jobConfiguration,
Partition partition);
The extract method extracts data from database in some way and
writes it to DataWriter (provided by context) as `Intermediate representation`_ .
The ``extract`` method extracts data from database in some way and
writes it to ``DataWriter`` (provided by context) as `Intermediate representation`_ .
Extractor must iterates in the extract method until the data from database exhausts.
Extractor must iterates in the ``extract`` method until the data from database exhausts.
::
while (resultSet.next()) {
@ -111,13 +113,16 @@ Extractor must iterates in the extract method until the data from database exhau
Partitioner
-----------
Partitioner creates Partition instances based on configurations.
The number of Partition instances is interpreted as the number of map tasks.
Partition instances are passed to Extractor_ as the argument of extract method.
Partitioner creates ``Partition`` instances based on configurations.
The number of ``Partition`` instances is decided
based on the value users specified as the numbers of ectractors
in job configuration.
``Partition`` instances are passed to Extractor_ as the argument of ``extract`` method.
Extractor_ determines which portion of the data to extract by Partition.
There is no actual convention for Partition classes
other than being actually Writable and toString()-able.
other than being actually ``Writable`` and ``toString()`` -able.
::
public abstract class Partition {
@ -126,7 +131,7 @@ other than being actually Writable and toString()-able.
public abstract String toString();
}
Connectors can define the design of Partition on their own.
Connectors can define the design of ``Partition`` on their own.
Initializer and Destroyer
@ -141,10 +146,10 @@ Destroyer is instantiated after MapReduce job is finished for clean up.
Exporter
========
Connector#getExporter method returns Exporter instance
Connector's ``getExporter`` method returns ``Exporter`` instance
which is a placeholder for the modules needed for export
such as Loader_ .
Built-in GenericJdbcConnector defines Exporter like this.
Built-in ``GenericJdbcConnector`` defines ``Exporter`` like this.
::
private static final Exporter EXPORTER = new Exporter(
@ -166,17 +171,17 @@ Loader
Loader (L for ETL) receives data from Sqoop framework and
loads it to external database.
Loader must overrides load method.
Loader must overrides ``load`` method.
::
public abstract void load(LoaderContext context,
ConnectionConfiguration connectionConfiguration,
JobConfiguration jobConfiguration) throws Exception;
The load method reads data from DataReader (provided by context)
The ``load`` method reads data from ``DataReader`` (provided by context)
in `Intermediate representation`_ and loads it to database in some way.
Loader must iterates in the load method until the data from DataReader exhausts.
Loader must iterates in the ``load`` method until the data from ``DataReader`` exhausts.
::
while ((array = context.getDataReader().readArrayRecord()) != null) {
@ -196,26 +201,103 @@ Destroyer is instantiated after MapReduce job is finished for clean up.
Connector Configurations
++++++++++++++++++++++++
Connector specifications
========================
Framework of the Sqoop loads definitions of connectors
from the file named ``sqoopconnector.properties``
which each connector implementation provides.
::
# Generic JDBC Connector Properties
org.apache.sqoop.connector.class = org.apache.sqoop.connector.jdbc.GenericJdbcConnector
org.apache.sqoop.connector.name = generic-jdbc-connector
Configurations
==============
The definition of the configurations are represented
by models defined in org.apache.sqoop.model package.
Implementation of ``SqoopConnector`` overrides methods such as
``getConnectionConfigurationClass`` and ``getJobConfigurationClass``
returning configuration class.
::
@Override
public Class getConnectionConfigurationClass() {
return ConnectionConfiguration.class;
}
ConnectionConfigurationClass
----------------------------
@Override
public Class getJobConfigurationClass(MJob.Type jobType) {
switch (jobType) {
case IMPORT:
return ImportJobConfiguration.class;
case EXPORT:
return ExportJobConfiguration.class;
default:
return null;
}
}
Configurations are represented
by models defined in ``org.apache.sqoop.model`` package.
Annotations such as
``ConfigurationClass`` , ``FormClass`` , ``Form`` and ``Input``
are provided for defining configurations of each connectors
using these models.
JobConfigurationClass
---------------------
``ConfigurationClass`` is place holder for ``FormClasses`` .
::
@ConfigurationClass
public class ConnectionConfiguration {
@Form public ConnectionForm connection;
public ConnectionConfiguration() {
connection = new ConnectionForm();
}
}
Each ``FormClass`` defines names and types of configs.
::
@FormClass
public class ConnectionForm {
@Input(size = 128) public String jdbcDriver;
@Input(size = 128) public String connectionString;
@Input(size = 40) public String username;
@Input(size = 40, sensitive = true) public String password;
@Input public Map<String, String> jdbcProperties;
}
ResourceBundle
==============
Resources for Configurations_ are stored in properties file
accessed by getBundle method of the Connector.
Resources used by client user interfaces are defined in properties file.
::
# jdbc driver
connection.jdbcDriver.label = JDBC Driver Class
connection.jdbcDriver.help = Enter the fully qualified class name of the JDBC \
driver that will be used for establishing this connection.
# connect string
connection.connectionString.label = JDBC Connection String
connection.connectionString.help = Enter the value of JDBC connection string to be \
used by this connector for creating connections.
...
Those resources are loaded by ``getBundle`` method of connector.
::
@Override
public ResourceBundle getBundle(Locale locale) {
return ResourceBundle.getBundle(
GenericJdbcConnectorConstants.RESOURCE_BUNDLE_NAME, locale);
}
Validator
@ -227,24 +309,94 @@ Validator validates configurations set by users.
Internal of Sqoop2 MapReduce Job
++++++++++++++++++++++++++++++++
Sqoop 2 provides common MapReduce modules such as SqoopMapper and SqoopReducer
Sqoop 2 provides common MapReduce modules such as ``SqoopMapper`` and ``SqoopReducer``
for the both of import and export.
- InputFormat create splits using Partitioner.
- For import, ``Extractor`` provided by connector extracts data from databases,
and ``Loader`` provided by Sqoop2 loads data into Hadoop.
- SqoopMapper invokes Extractor's extract method.
- For export, ``Extractor`` provided by Sqoop2 exracts data from Hadoop,
and ``Loader`` provided by connector loads data into databases.
- SqoopReducer do no actual works.
The diagram below describes the initialization phase of IMPORT job.
``SqoopInputFormat`` create splits using ``Partitioner`` .
::
- OutputFormat invokes Loader's load method (via SqoopOutputFormatLoadExecutor).
,----------------. ,-----------.
|SqoopInputFormat| |Partitioner|
`-------+--------' `-----+-----'
getSplits | |
----------->| |
| getPartitions |
|------------------------>|
| | ,---------.
| |-------> |Partition|
| | `----+----'
|<- - - - - - - - - - - - | |
| | | ,----------.
|-------------------------------------------------->|SqoopSplit|
| | | `----+-----'
.. todo: sequence diagram like figure.
The diagram below describes the map phase of IMPORT job.
``SqoopMapper`` invokes extractor's ``extract`` method.
::
For import, Extractor provided by Connector extracts data from databases,
and Loader provided by Sqoop2 loads data into Hadoop.
,-----------.
|SqoopMapper|
`-----+-----'
run |
--------->| ,-------------.
|---------------------------------->|MapDataWriter|
| `------+------'
| ,---------. |
|--------------> |Extractor| |
| `----+----' |
| extract | |
|-------------------->| |
| | |
read from DB | |
<-------------------------------| write* |
| |------------------->|
| | | ,----.
| | |---------->|Data|
| | | `-+--'
| | |
| | | context.write
| | |-------------------------->
The diagram below decribes the reduce phase of EXPORT job.
``OutputFormat`` invokes loader's ``load`` method (via ``SqoopOutputFormatLoadExecutor`` ).
::
,-------. ,---------------------.
|Reducer| |SqoopNullOutputFormat|
`---+---' `----------+----------'
| | ,-----------------------------.
| |-> |SqoopOutputFormatLoadExecutor|
| | `--------------+--------------' ,----.
| | |---------------------> |Data|
| | | `-+--'
| | | ,-----------------. |
| | |-> |SqoopRecordWriter| |
getRecordWriter | | `--------+--------' |
----------------------->| getRecordWriter | | |
| |----------------->| | | ,--------------.
| | |-----------------------------> |ConsumerThread|
| | | | | `------+-------'
| |<- - - - - - - - -| | | | ,------.
<- - - - - - - - - - - -| | | | |--->|Loader|
| | | | | | `--+---'
| | | | | | |
| | | | | | load |
run | | | | | |------>|
----->| | write | | | | |
|------------------------------------------------>| setContent | | read* |
| | | |----------->| getContent |<------|
| | | | |<-----------| |
| | | | | | - - ->|
| | | | | | | write into DB
| | | | | | |-------------->
For export, Extractor provided Sqoop2 exracts data from Hadoop,
and Loader provided by Connector loads data into databases.
.. _`Intermediate representation`: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation