From 7353df98091c1e002b441a1b053e9c1feeef1867 Mon Sep 17 00:00:00 2001 From: Jarek Jarcec Cecho Date: Sun, 10 Nov 2013 18:50:21 -0800 Subject: [PATCH] SQOOP-1225: Sqoop 2 documentation for connector development (Masatake Iwasaki via Jarek Jarcec Cecho) --- docs/src/site/sphinx/ConnectorDevelopment.rst | 234 +++++++++++++++--- 1 file changed, 193 insertions(+), 41 deletions(-) diff --git a/docs/src/site/sphinx/ConnectorDevelopment.rst b/docs/src/site/sphinx/ConnectorDevelopment.rst index 918ca007..51213822 100644 --- a/docs/src/site/sphinx/ConnectorDevelopment.rst +++ b/docs/src/site/sphinx/ConnectorDevelopment.rst @@ -18,8 +18,10 @@ Sqoop 2 Connector Development ============================= -This document describes you how to implement connector for Sqoop 2. +This document describes you how to implement connector for Sqoop 2 +using the code of built-in connector ( ``GenericJdbcConnector`` ) as example. +.. contents:: What is Connector? ++++++++++++++++++ @@ -33,9 +35,9 @@ Interaction with Hadoop is taken cared by common modules of Sqoop 2 framework. Connector Implementation ++++++++++++++++++++++++ -The SqoopConnector class defines functionality +The ``SqoopConnector`` class defines functionality which must be provided by Connectors. -Each Connector must extends SqoopConnector and overrides methods shown below. +Each Connector must extends ``SqoopConnector`` and overrides methods shown below. :: public abstract String getVersion(); @@ -47,24 +49,24 @@ Each Connector must extends SqoopConnector and overrides methods shown below. public abstract Validator getValidator(); public abstract MetadataUpgrader getMetadataUpgrader(); -The getImporter method returns Importer_ instance +The ``getImporter`` method returns Importer_ instance which is a placeholder for the modules needed for import. -The getExporter method returns Exporter_ instance +The ``getExporter`` method returns Exporter_ instance which is a placeholder for the modules needed for export. -Methods such as getBundle, getConnectionConfigurationClass, -getJobConfigurationClass and getValidator +Methods such as ``getBundle`` , ``getConnectionConfigurationClass`` , +``getJobConfigurationClass`` and ``getValidator`` are concerned to `Connector configurations`_ . Importer ======== -Connector#getImporter method returns Importer instance +Connector's ``getImporter`` method returns ``Importer`` instance which is a placeholder for the modules needed for import such as Partitioner_ and Extractor_ . -Built-in GenericJdbcConnector defines Importer like this. +Built-in ``GenericJdbcConnector`` defines ``Importer`` like this. :: private static final Importer IMPORTER = new Importer( @@ -87,7 +89,7 @@ Extractor Extractor (E for ETL) extracts data from external database and writes it to Sqoop framework for import. -Extractor must overrides extract method. +Extractor must overrides ``extract`` method. :: public abstract void extract(ExtractorContext context, @@ -95,10 +97,10 @@ Extractor must overrides extract method. JobConfiguration jobConfiguration, Partition partition); -The extract method extracts data from database in some way and -writes it to DataWriter (provided by context) as `Intermediate representation`_ . +The ``extract`` method extracts data from database in some way and +writes it to ``DataWriter`` (provided by context) as `Intermediate representation`_ . -Extractor must iterates in the extract method until the data from database exhausts. +Extractor must iterates in the ``extract`` method until the data from database exhausts. :: while (resultSet.next()) { @@ -111,13 +113,16 @@ Extractor must iterates in the extract method until the data from database exhau Partitioner ----------- -Partitioner creates Partition instances based on configurations. -The number of Partition instances is interpreted as the number of map tasks. -Partition instances are passed to Extractor_ as the argument of extract method. +Partitioner creates ``Partition`` instances based on configurations. +The number of ``Partition`` instances is decided +based on the value users specified as the numbers of ectractors +in job configuration. + +``Partition`` instances are passed to Extractor_ as the argument of ``extract`` method. Extractor_ determines which portion of the data to extract by Partition. There is no actual convention for Partition classes -other than being actually Writable and toString()-able. +other than being actually ``Writable`` and ``toString()`` -able. :: public abstract class Partition { @@ -126,7 +131,7 @@ other than being actually Writable and toString()-able. public abstract String toString(); } -Connectors can define the design of Partition on their own. +Connectors can define the design of ``Partition`` on their own. Initializer and Destroyer @@ -141,10 +146,10 @@ Destroyer is instantiated after MapReduce job is finished for clean up. Exporter ======== -Connector#getExporter method returns Exporter instance +Connector's ``getExporter`` method returns ``Exporter`` instance which is a placeholder for the modules needed for export such as Loader_ . -Built-in GenericJdbcConnector defines Exporter like this. +Built-in ``GenericJdbcConnector`` defines ``Exporter`` like this. :: private static final Exporter EXPORTER = new Exporter( @@ -166,17 +171,17 @@ Loader Loader (L for ETL) receives data from Sqoop framework and loads it to external database. -Loader must overrides load method. +Loader must overrides ``load`` method. :: public abstract void load(LoaderContext context, ConnectionConfiguration connectionConfiguration, JobConfiguration jobConfiguration) throws Exception; -The load method reads data from DataReader (provided by context) +The ``load`` method reads data from ``DataReader`` (provided by context) in `Intermediate representation`_ and loads it to database in some way. -Loader must iterates in the load method until the data from DataReader exhausts. +Loader must iterates in the ``load`` method until the data from ``DataReader`` exhausts. :: while ((array = context.getDataReader().readArrayRecord()) != null) { @@ -196,26 +201,103 @@ Destroyer is instantiated after MapReduce job is finished for clean up. Connector Configurations ++++++++++++++++++++++++ +Connector specifications +======================== + +Framework of the Sqoop loads definitions of connectors +from the file named ``sqoopconnector.properties`` +which each connector implementation provides. +:: + + # Generic JDBC Connector Properties + org.apache.sqoop.connector.class = org.apache.sqoop.connector.jdbc.GenericJdbcConnector + org.apache.sqoop.connector.name = generic-jdbc-connector + + Configurations ============== -The definition of the configurations are represented -by models defined in org.apache.sqoop.model package. +Implementation of ``SqoopConnector`` overrides methods such as +``getConnectionConfigurationClass`` and ``getJobConfigurationClass`` +returning configuration class. +:: + @Override + public Class getConnectionConfigurationClass() { + return ConnectionConfiguration.class; + } -ConnectionConfigurationClass ----------------------------- + @Override + public Class getJobConfigurationClass(MJob.Type jobType) { + switch (jobType) { + case IMPORT: + return ImportJobConfiguration.class; + case EXPORT: + return ExportJobConfiguration.class; + default: + return null; + } + } +Configurations are represented +by models defined in ``org.apache.sqoop.model`` package. +Annotations such as +``ConfigurationClass`` , ``FormClass`` , ``Form`` and ``Input`` +are provided for defining configurations of each connectors +using these models. -JobConfigurationClass ---------------------- +``ConfigurationClass`` is place holder for ``FormClasses`` . +:: + + @ConfigurationClass + public class ConnectionConfiguration { + + @Form public ConnectionForm connection; + + public ConnectionConfiguration() { + connection = new ConnectionForm(); + } + } + +Each ``FormClass`` defines names and types of configs. +:: + + @FormClass + public class ConnectionForm { + @Input(size = 128) public String jdbcDriver; + @Input(size = 128) public String connectionString; + @Input(size = 40) public String username; + @Input(size = 40, sensitive = true) public String password; + @Input public Map jdbcProperties; + } ResourceBundle ============== -Resources for Configurations_ are stored in properties file -accessed by getBundle method of the Connector. +Resources used by client user interfaces are defined in properties file. +:: + + # jdbc driver + connection.jdbcDriver.label = JDBC Driver Class + connection.jdbcDriver.help = Enter the fully qualified class name of the JDBC \ + driver that will be used for establishing this connection. + + # connect string + connection.connectionString.label = JDBC Connection String + connection.connectionString.help = Enter the value of JDBC connection string to be \ + used by this connector for creating connections. + + ... + +Those resources are loaded by ``getBundle`` method of connector. +:: + + @Override + public ResourceBundle getBundle(Locale locale) { + return ResourceBundle.getBundle( + GenericJdbcConnectorConstants.RESOURCE_BUNDLE_NAME, locale); + } Validator @@ -227,24 +309,94 @@ Validator validates configurations set by users. Internal of Sqoop2 MapReduce Job ++++++++++++++++++++++++++++++++ -Sqoop 2 provides common MapReduce modules such as SqoopMapper and SqoopReducer +Sqoop 2 provides common MapReduce modules such as ``SqoopMapper`` and ``SqoopReducer`` for the both of import and export. -- InputFormat create splits using Partitioner. +- For import, ``Extractor`` provided by connector extracts data from databases, + and ``Loader`` provided by Sqoop2 loads data into Hadoop. -- SqoopMapper invokes Extractor's extract method. +- For export, ``Extractor`` provided by Sqoop2 exracts data from Hadoop, + and ``Loader`` provided by connector loads data into databases. -- SqoopReducer do no actual works. +The diagram below describes the initialization phase of IMPORT job. +``SqoopInputFormat`` create splits using ``Partitioner`` . +:: -- OutputFormat invokes Loader's load method (via SqoopOutputFormatLoadExecutor). + ,----------------. ,-----------. + |SqoopInputFormat| |Partitioner| + `-------+--------' `-----+-----' + getSplits | | + ----------->| | + | getPartitions | + |------------------------>| + | | ,---------. + | |-------> |Partition| + | | `----+----' + |<- - - - - - - - - - - - | | + | | | ,----------. + |-------------------------------------------------->|SqoopSplit| + | | | `----+-----' -.. todo: sequence diagram like figure. +The diagram below describes the map phase of IMPORT job. +``SqoopMapper`` invokes extractor's ``extract`` method. +:: -For import, Extractor provided by Connector extracts data from databases, -and Loader provided by Sqoop2 loads data into Hadoop. + ,-----------. + |SqoopMapper| + `-----+-----' + run | + --------->| ,-------------. + |---------------------------------->|MapDataWriter| + | `------+------' + | ,---------. | + |--------------> |Extractor| | + | `----+----' | + | extract | | + |-------------------->| | + | | | + read from DB | | + <-------------------------------| write* | + | |------------------->| + | | | ,----. + | | |---------->|Data| + | | | `-+--' + | | | + | | | context.write + | | |--------------------------> + +The diagram below decribes the reduce phase of EXPORT job. +``OutputFormat`` invokes loader's ``load`` method (via ``SqoopOutputFormatLoadExecutor`` ). +:: + + ,-------. ,---------------------. + |Reducer| |SqoopNullOutputFormat| + `---+---' `----------+----------' + | | ,-----------------------------. + | |-> |SqoopOutputFormatLoadExecutor| + | | `--------------+--------------' ,----. + | | |---------------------> |Data| + | | | `-+--' + | | | ,-----------------. | + | | |-> |SqoopRecordWriter| | + getRecordWriter | | `--------+--------' | + ----------------------->| getRecordWriter | | | + | |----------------->| | | ,--------------. + | | |-----------------------------> |ConsumerThread| + | | | | | `------+-------' + | |<- - - - - - - - -| | | | ,------. + <- - - - - - - - - - - -| | | | |--->|Loader| + | | | | | | `--+---' + | | | | | | | + | | | | | | load | + run | | | | | |------>| + ----->| | write | | | | | + |------------------------------------------------>| setContent | | read* | + | | | |----------->| getContent |<------| + | | | | |<-----------| | + | | | | | | - - ->| + | | | | | | | write into DB + | | | | | | |--------------> -For export, Extractor provided Sqoop2 exracts data from Hadoop, -and Loader provided by Connector loads data into databases. .. _`Intermediate representation`: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation