5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-21 19:31:13 +08:00

SQOOP-1481: SQOOP2: Document the public apis and end-end design for the SQ2 Connector

(Abraham Elmahrek via Jarek Jarcec Cecho)
This commit is contained in:
Jarek Jarcec Cecho 2014-10-09 08:16:52 -07:00 committed by Abraham Elmahrek
parent 4c964a97be
commit b600036573
4 changed files with 97 additions and 123 deletions

View File

@ -329,14 +329,14 @@ Create new job object.
+------------------------+------------------------------------------------------------------+ +------------------------+------------------------------------------------------------------+
| Argument | Description | | Argument | Description |
+========================+==================================================================+ +========================+==================================================================+
| ``-x``, ``--xid <x>`` | Create new job object for connection with id ``<x>`` | | ``-f``, ``--from <x>`` | Create new job object with a FROM connection with id ``<x>`` |
+------------------------+------------------------------------------------------------------+ +------------------------+------------------------------------------------------------------+
| ``-t``, ``--type <t>`` | Create new job object with type ``<t>`` (``import``, ``export``) | | ``-t``, ``--to <t>`` | Create new job object with a TO connection with id ``<x>`` |
+------------------------+------------------------------------------------------------------+ +------------------------+------------------------------------------------------------------+
Example: :: Example: ::
create job --xid 1 create job --from 1 --to 2
Update Command Update Command
-------------- --------------

View File

@ -26,17 +26,15 @@ using the code of built-in connector ( ``GenericJdbcConnector`` ) as example.
What is Connector? What is Connector?
++++++++++++++++++ ++++++++++++++++++
Connector provides interaction with external databases. The connector provides the facilities to interact with external data sources.
Connector reads data from databases for import, The connector can read from, or write to, a data source.
and write data to databases for export.
Interaction with Hadoop is taken cared by common modules of Sqoop 2 framework.
When do we add a new connector? When do we add a new connector?
=============================== ===============================
You add a new connector when you need to extract data from a new data source, or load You add a new connector when you need to extract data from a new data source, or load
data to a new target. data to a new target.
In addition to the connector API, Sqoop 2 also has an engine interface. In addition to the connector API, Sqoop 2 also has an engine interface.
At the moment the only engine is MapReduce,but we may support additional engines in the future. At the moment the only engine is MapReduce, but we may support additional engines in the future.
Since many parallel execution engines are capable of reading/writing data Since many parallel execution engines are capable of reading/writing data
there may be a question of whether support for specific data stores should be done there may be a question of whether support for specific data stores should be done
through a new connector or new engine. through a new connector or new engine.
@ -51,57 +49,73 @@ Connector Implementation
The ``SqoopConnector`` class defines functionality The ``SqoopConnector`` class defines functionality
which must be provided by Connectors. which must be provided by Connectors.
Each Connector must extends ``SqoopConnector`` and overrides methods shown below. Each Connector must extend ``SqoopConnector`` and override the methods shown below.
:: ::
public abstract String getVersion(); public abstract String getVersion();
public abstract ResourceBundle getBundle(Locale locale); public abstract ResourceBundle getBundle(Locale locale);
public abstract Class getConnectionConfigurationClass(); public abstract Class getLinkConfigurationClass();
public abstract Class getJobConfigurationClass(MJob.Type jobType); public abstract Class getJobConfigurationClass(Direction direction);
public abstract Importer getImporter(); public abstract From getFrom();
public abstract Exporter getExporter(); public abstract To getTo();
public abstract Validator getValidator(); public abstract Validator getValidator();
public abstract MetadataUpgrader getMetadataUpgrader(); public abstract MetadataUpgrader getMetadataUpgrader();
The ``getImporter`` method returns Importer_ instance Connectors can optionally override the following methods:
which is a placeholder for the modules needed for import. ::
The ``getExporter`` method returns Exporter_ instance public List<Direction> getSupportedDirections();
which is a placeholder for the modules needed for export.
The ``getFrom`` method returns From_ instance
which is a placeholder for the modules needed to read from a data source.
The ``getTo`` method returns Exporter_ instance
which is a placeholder for the modules needed to write to a data source.
Methods such as ``getBundle`` , ``getConnectionConfigurationClass`` , Methods such as ``getBundle`` , ``getConnectionConfigurationClass`` ,
``getJobConfigurationClass`` and ``getValidator`` ``getJobConfigurationClass`` and ``getValidator``
are concerned to `Connector configurations`_ . are concerned to `Connector configurations`_ .
The ``getSupportedDirections`` method returns a list of directions
Importer that a connector supports. This should be some subset of:
========
Connector's ``getImporter`` method returns ``Importer`` instance
which is a placeholder for the modules needed for import
such as Partitioner_ and Extractor_ .
Built-in ``GenericJdbcConnector`` defines ``Importer`` like this.
:: ::
private static final Importer IMPORTER = new Importer( public List<Direction> getSupportedDirections() {
GenericJdbcImportInitializer.class, return Arrays.asList(new Direction[]{
GenericJdbcImportPartitioner.class, Direction.FROM,
GenericJdbcImportExtractor.class, Direction.TO
GenericJdbcImportDestroyer.class); });
}
From
====
The connector's ``getFrom`` method returns ``From`` instance
which is a placeholder for the modules needed for reading
from a data source. Modules such as Partitioner_ and Extractor_ .
The built-in ``GenericJdbcConnector`` defines ``From`` like this.
::
private static final From FROM = new From(
GenericJdbcFromInitializer.class,
GenericJdbcPartitioner.class,
GenericJdbcExtractor.class,
GenericJdbcFromDestroyer.class);
... ...
@Override @Override
public Importer getImporter() { public From getFrom() {
return IMPORTER; return FROM;
} }
Extractor Extractor
--------- ---------
Extractor (E for ETL) extracts data from external database and Extractor (E for ETL) extracts data from external database.
writes it to Sqoop framework for import.
Extractor must overrides ``extract`` method. Extractor must overrides ``extract`` method.
:: ::
@ -114,7 +128,13 @@ Extractor must overrides ``extract`` method.
The ``extract`` method extracts data from database in some way and The ``extract`` method extracts data from database in some way and
writes it to ``DataWriter`` (provided by context) as `Intermediate representation`_ . writes it to ``DataWriter`` (provided by context) as `Intermediate representation`_ .
Extractor must iterates in the ``extract`` method until the data from database exhausts. Extractors use Writer's provided by the ExtractorContext to send a record through the
framework.
::
context.getDataWriter().writeArrayRecord(array);
The extractor must iterate through the entire dataset in the ``extract`` method.
:: ::
while (resultSet.next()) { while (resultSet.next()) {
@ -127,9 +147,9 @@ Extractor must iterates in the ``extract`` method until the data from database e
Partitioner Partitioner
----------- -----------
Partitioner creates ``Partition`` instances based on configurations. The Partitioner creates ``Partition`` instances based on configurations.
The number of ``Partition`` instances is decided The number of ``Partition`` instances is decided
based on the value users specified as the numbers of ectractors based on the value users specified as the numbers of extractors
in job configuration. in job configuration.
``Partition`` instances are passed to Extractor_ as the argument of ``extract`` method. ``Partition`` instances are passed to Extractor_ as the argument of ``extract`` method.
@ -157,35 +177,35 @@ for doing preparation such as adding dependent jar files.
Destroyer is instantiated after MapReduce job is finished for clean up. Destroyer is instantiated after MapReduce job is finished for clean up.
Exporter To
======== ==
Connector's ``getExporter`` method returns ``Exporter`` instance The Connector's ``getTo`` method returns a ``To`` instance
which is a placeholder for the modules needed for export which is a placeholder for the modules needed for writing
such as Loader_ . to a data source such as Loader_ .
Built-in ``GenericJdbcConnector`` defines ``Exporter`` like this. The built-in ``GenericJdbcConnector`` defines ``To`` like this.
:: ::
private static final Exporter EXPORTER = new Exporter( private static final To TO = new To(
GenericJdbcExportInitializer.class, GenericJdbcToInitializer.class,
GenericJdbcExportLoader.class, GenericJdbcLoader.class,
GenericJdbcExportDestroyer.class); GenericJdbcToDestroyer.class);
... ...
@Override @Override
public Exporter getExporter() { public To getTo() {
return EXPORTER; return TO;
} }
Loader Loader
------ ------
Loader (L for ETL) receives data from Sqoop framework and A loader (L for ETL) receives data from the Sqoop framework and
loads it to external database. loads it to an external database.
Loader must overrides ``load`` method. Loaders must overrides ``load`` method.
:: ::
public abstract void load(LoaderContext context, public abstract void load(LoaderContext context,
@ -195,7 +215,7 @@ Loader must overrides ``load`` method.
The ``load`` method reads data from ``DataReader`` (provided by context) The ``load`` method reads data from ``DataReader`` (provided by context)
in `Intermediate representation`_ and loads it to database in some way. in `Intermediate representation`_ and loads it to database in some way.
Loader must iterates in the ``load`` method until the data from ``DataReader`` exhausts. Loader must iterate in the ``load`` method until the data from ``DataReader`` is exhausted.
:: ::
while ((array = context.getDataReader().readArrayRecord()) != null) { while ((array = context.getDataReader().readArrayRecord()) != null) {
@ -218,7 +238,7 @@ Connector Configurations
Connector specifications Connector specifications
======================== ========================
Framework of the Sqoop loads definitions of connectors Sqoop loads definitions of connectors
from the file named ``sqoopconnector.properties`` from the file named ``sqoopconnector.properties``
which each connector implementation provides. which each connector implementation provides.
:: ::
@ -231,7 +251,7 @@ which each connector implementation provides.
Configurations Configurations
============== ==============
Implementation of ``SqoopConnector`` overrides methods such as Implementations of ``SqoopConnector`` overrides methods such as
``getConnectionConfigurationClass`` and ``getJobConfigurationClass`` ``getConnectionConfigurationClass`` and ``getJobConfigurationClass``
returning configuration class. returning configuration class.
:: ::
@ -242,12 +262,12 @@ returning configuration class.
} }
@Override @Override
public Class getJobConfigurationClass(MJob.Type jobType) { public Class getJobConfigurationClass(Direction direction) {
switch (jobType) { switch (direction) {
case IMPORT: case FROM:
return ImportJobConfiguration.class; return FromJobConfiguration.class;
case EXPORT: case TO:
return ExportJobConfiguration.class; return ToJobConfiguration.class;
default: default:
return null; return null;
} }
@ -260,7 +280,7 @@ Annotations such as
are provided for defining configurations of each connectors are provided for defining configurations of each connectors
using these models. using these models.
``ConfigurationClass`` is place holder for ``FormClasses`` . ``ConfigurationClass`` is a place holder for ``FormClasses`` .
:: ::
@ConfigurationClass @ConfigurationClass
@ -323,16 +343,12 @@ Validator validates configurations set by users.
Internal of Sqoop2 MapReduce Job Internal of Sqoop2 MapReduce Job
++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++
Sqoop 2 provides common MapReduce modules such as ``SqoopMapper`` and ``SqoopReducer`` Sqoop 2 provides common MapReduce modules such as ``SqoopMapper`` and ``SqoopReducer``.
for the both of import and export.
- For import, ``Extractor`` provided by connector extracts data from databases, When reading from a data source, the ``Extractor`` provided by the FROM connector extracts data from a database,
and ``Loader`` provided by Sqoop2 loads data into Hadoop. and the ``Loader``, provided by the TO connector, loads data into another database.
- For export, ``Extractor`` provided by Sqoop2 exracts data from Hadoop, The diagram below describes the initialization phase of a job.
and ``Loader`` provided by connector loads data into databases.
The diagram below describes the initialization phase of IMPORT job.
``SqoopInputFormat`` create splits using ``Partitioner`` . ``SqoopInputFormat`` create splits using ``Partitioner`` .
:: ::
@ -351,8 +367,8 @@ The diagram below describes the initialization phase of IMPORT job.
|-------------------------------------------------->|SqoopSplit| |-------------------------------------------------->|SqoopSplit|
| | | `----+-----' | | | `----+-----'
The diagram below describes the map phase of IMPORT job. The diagram below describes the map phase of a job.
``SqoopMapper`` invokes extractor's ``extract`` method. ``SqoopMapper`` invokes FROM connector's extractor's ``extract`` method.
:: ::
,-----------. ,-----------.
@ -378,8 +394,8 @@ The diagram below describes the map phase of IMPORT job.
| | | context.write | | | context.write
| | |--------------------------> | | |-------------------------->
The diagram below decribes the reduce phase of EXPORT job. The diagram below decribes the reduce phase of a job.
``OutputFormat`` invokes loader's ``load`` method (via ``SqoopOutputFormatLoadExecutor`` ). ``OutputFormat`` invokes TO connector's loader's ``load`` method (via ``SqoopOutputFormatLoadExecutor`` ).
:: ::
,-------. ,---------------------. ,-------. ,---------------------.

View File

@ -418,7 +418,7 @@ So far, the resource contains only explanations for fields of forms. For example
"name":"generic-jdbc-connector", "name":"generic-jdbc-connector",
"class":"org.apache.sqoop.connector.jdbc.GenericJdbcConnector", "class":"org.apache.sqoop.connector.jdbc.GenericJdbcConnector",
"job-forms":{ "job-forms":{
"IMPORT":[ "FROM":[
{ {
"id":2, "id":2,
"inputs":[ "inputs":[
@ -475,7 +475,7 @@ So far, the resource contains only explanations for fields of forms. For example
"type":"CONNECTION" "type":"CONNECTION"
} }
], ],
"EXPORT":[ "TO":[
{ {
"id":3, "id":3,
"inputs":[ "inputs":[
@ -624,35 +624,7 @@ the id of the form to track these parameter inputs.
}, },
"framework-version":"1", "framework-version":"1",
"job-forms":{ "job-forms":{
"IMPORT":[ "FROM":[
{
"id":5,
"inputs":[
{
"id":18,
"values":"HDFS",
"name":"output.storageType",
"type":"ENUM",
"sensitive":false
},
{
"id":19,
"values":"TEXT_FILE,SEQUENCE_FILE",
"name":"output.outputFormat",
"type":"ENUM",
"sensitive":false
},
{
"id":20,
"name":"output.outputDirectory",
"type":"STRING",
"size":255,
"sensitive":false
}
],
"name":"output",
"type":"CONNECTION"
},
{ {
"id":6, "id":6,
"inputs":[ "inputs":[
@ -673,23 +645,9 @@ the id of the form to track these parameter inputs.
"type":"CONNECTION" "type":"CONNECTION"
} }
], ],
"EXPORT":[ "TO":[
{ {
"id":7, "id":1,
"inputs":[
{
"id":23,
"name":"input.inputDirectory",
"type":"STRING",
"size":255,
"sensitive":false
}
],
"name":"input",
"type":"CONNECTION"
},
{
"id":8,
"inputs":[ "inputs":[
{ {
"id":24, "id":24,
@ -1272,7 +1230,7 @@ Provide the id of the job in the url [jid] part. If you provide ``all`` in the [
} }
], ],
"connector-id":1, "connector-id":1,
"type":"IMPORT", "type":"FROM",
"framework":[ "framework":[
{ {
"id":5, "id":5,

View File

@ -65,7 +65,7 @@ public List<Direction> getSupportedDirections() {
/** /**
* @return Get job configuration group class per direction type or null if not supported * @return Get job configuration group class per direction type or null if not supported
*/ */
public abstract Class getJobConfigurationClass(Direction jobType); public abstract Class getJobConfigurationClass(Direction direction);
/** /**
* @return an <tt>From</tt> that provides classes for performing import. * @return an <tt>From</tt> that provides classes for performing import.