From d4ff097ea33721b706811968cb1903fa4be6fa46 Mon Sep 17 00:00:00 2001
From: Jarek Jarcec Cecho <jarcec@apache.org>
Date: Sun, 13 Jul 2014 20:18:04 -0700
Subject: [PATCH] SQOOP-1344: Add documentation for Oracle connector

(David Robson via Jarek Jarcec Cecho)
---
 src/docs/user/connectors.txt | 1299 ++++++++++++++++++++++++++++++++++
 1 file changed, 1299 insertions(+)

diff --git a/src/docs/user/connectors.txt b/src/docs/user/connectors.txt
index 379cbd99..c04900f3 100644
--- a/src/docs/user/connectors.txt
+++ b/src/docs/user/connectors.txt
@@ -463,3 +463,1302 @@ representation.
 
 It is suggested that the null value be specified as empty string for
 performance and consistency.
+
+Data Connector for Oracle and Hadoop
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+About
+^^^^^
+
+The Data Connector for Oracle and Hadoop is now included in Sqoop.
+
+It can be enabled by specifying the +\--direct+ argument for your import or
+export job.
+
+Jobs
+++++
+
+The Data Connector for Oracle and Hadoop inspects each Sqoop job and assumes
+responsibility for the ones it can perform better than the Oracle manager built
+into Sqoop.
+
+Data Connector for Oracle and Hadoop accepts responsibility for the following
+Sqoop Job types:
+
+- *Import* jobs that are *Non-Incremental*.
+- *Export* jobs
+- Data Connector for Oracle and Hadoop does not accept responsibility for other
+Sqoop job types. For example Data Connector for Oracle and Hadoop does not
+accept *eval* jobs etc.
+
+Data Connector for Oracle and Hadoop accepts responsibility for those Sqoop Jobs
+with the following attributes:
+
+- Oracle-related
+- Table-Based - Jobs where the table argument is used and the specified object
+is a table.
++
+NOTE: Data Connector for Oracle and Hadoop does not process index-organized
+tables.
+
+- There are at least 2 mappers — Jobs where the Sqoop command-line does not
+include: +--num-mappers 1+
+
+How The Standard Oracle Manager Works for Imports
++++++++++++++++++++++++++++++++++++++++++++++++++
+
+The Oracle manager built into Sqoop uses a range-based query for each mapper.
+Each mapper executes a query of the form:
+
+----
+SELECT * FROM sometable WHERE id >= lo AND id < hi
+----
+
+The *lo* and *hi* values are based on the number of mappers and the minimum and
+maximum values of the data in the column the table is being split by.
+
+If no suitable index exists on the table then these queries result in full
+table-scans within Oracle. Even with a suitable index, multiple mappers may
+fetch data stored within the same Oracle blocks, resulting in redundant IO
+calls.
+
+How The Data Connector for Oracle and Hadoop Works for Imports
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+The Data Connector for Oracle and Hadoop generates queries for the mappers of
+the form:
+
+----
+SELECT *
+  FROM sometable
+ WHERE rowid >= dbms_rowid.rowid_create(1, 893, 1, 279, 0) AND
+       rowid <= dbms_rowid.rowid_create(1, 893, 1, 286, 32767)
+----
+
+The Data Connector for Oracle and Hadoop queries ensure that:
+
+- No two mappers read data from the same Oracle block. This minimizes
+redundant IO.
+- The table does not require indexes.
+- The Sqoop command line does not need to specify a +--split-by+ column.
+
+Data Connector for Oracle and Hadoop Exports
+++++++++++++++++++++++++++++++++++++++++++++
+
+Benefits of the Data Connector for Oracle and Hadoop:
+
+- *Merge-Export facility* - Update Oracle tables by modifying changed rows AND
+inserting rows from the HDFS file that did not previously exist in the Oracle
+table. The Connector for Oracle and Hadoop's Merge-Export is unique - there is
+no Sqoop equivalent.
+- *Lower impact on the Oracle database* - Update the rows in the Oracle table
+that have changed, not all rows in the Oracle table. This has performance
+benefits and reduces the impact of the query on Oracle (for example, the Oracle
+redo logs).
+- *Improved performance* - With partitioned tables, mappers utilize temporary
+Oracle tables which allow parallel inserts and direct path writes.
+
+Requirements
+^^^^^^^^^^^^
+
+Ensure The Oracle Database JDBC Driver Is Setup Correctly
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+You may want to ensure the Oracle Database 11g Release 2 JDBC driver is setup
+correctly on your system. This driver is required for Sqoop to work with Oracle.
+
+The Oracle Database 11g Release 2 JDBC driver file is +ojdbc6.jar+ (3.2Mb).
+
+If this file is not on your system then download it from:
+http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html
+
+This file should be put into the +$SQOOP_HOME/lib+ directory.
+
+Oracle Roles and Privileges
++++++++++++++++++++++++++++
+The Oracle user for The Data Connector for Oracle and Hadoop requires the
+following roles and privileges:
+
+- +create session+
+
+In addition, the user must have the select any dictionary privilege or
+select_catalog_role role or all of the following object privileges:
+
+- +select on v_$instance+
+- +select on dba_tables+
+- +select on dba_tab_columns+
+- +select on dba_objects+
+- +select on dba_extents+
+- +select on dba_segments+ — Required for Sqoop imports only
+- +select on v_$database+ — Required for Sqoop imports only
+- +select on v_$parameter+ — Required for Sqoop imports only
+
+NOTE: The user also requires the alter session privilege to make use of session
+tracing functionality. See "oraoop.oracle.session.initialization.statements"
+for more information.
+
+Additional Oracle Roles And Privileges Required for Export
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+The Oracle user for Data Connector for Oracle and Hadoop requires:
+
+- Quota on the tablespace in which the Oracle export tables are located.
++
+An example Oracle command to achieve this is
++
+----
+alter user username quota unlimited on tablespace
+----
+
+- The following privileges:
++
+[frame="topbot",options="header",cols="2*v"]
+|===============================================================================
+|Type of Export                         |Privileges required
+|All Export
+|+create table+
++select on dba_tab_partitions+
++select on dba_tab_subpartitions+
++select on dba_indexes+
++select on dba_ind_columns+
+|Insert-Export with a template
+table into another schema
+|+select any table+
++create any table+
++insert any table+
++alter any table+ (partitioning)
+|Insert-Export without a
+template table into another
+schema
+|+select,insert on table+ (no partitioning)
++select,alter on table+ (partitioning)
+|Update-Export into another
+schema
+|+select,update on table+ (no partitioning)
++select,delete,alter,insert on table+
+(partitioning)
+|Merge-Export into another
+schema
+|+select,insert,update on table+ (no
+partitioning)
++select,insert,delete,alter on table+
+(partitioning)
+|===============================================================================
+
+Supported Data Types
+++++++++++++++++++++
+
+The following Oracle data types are supported by the Data Connector for
+Oracle and Hadoop:
+
+[frame="none",grid="none",cols="2,1l,2"]
+|===============================================================================
+|BINARY_DOUBLE                      |      |NCLOB
+|BINARY_FLOAT                       |      |NUMBER
+|BLOB                               |      |NVARCHAR2
+|CHAR                               |      |RAW
+|CLOB                               |      |ROWID
+|DATE                               |      |TIMESTAMP
+|FLOAT                              |      |TIMESTAMP WITH TIME ZONE
+|INTERVAL DAY TO SECOND             |      |TIMESTAMP WITH LOCAL TIME ZONE
+|INTERVAL YEAR TO MONTH             |      |URITYPE
+|LONG                               |      |VARCHAR2
+|NCHAR                              |      |
+|===============================================================================
+
+All other Oracle column types are NOT supported. Example Oracle column types NOT
+supported by Data Connector for Oracle and Hadoop include:
+
+[frame="none",grid="none",cols="2,1l,2"]
+|===============================================================================
+|All of the ANY types               |      |BFILE
+|All of the MEDIA types             |      |LONG RAW
+|All of the SPATIAL types           |      |MLSLABEL
+|Any type referred to as UNDEFINED  |      |UROWID
+|All custom (user-defined) URI types|      |XMLTYPE
+|===============================================================================
+
+NOTE: Data types RAW, LONG and LOB (BLOB, CLOB and NCLOB) are supported for
+Data Connector for Oracle and Hadoop imports. They are not supported for Data
+Connector for Oracle and Hadoop exports.
+
+Execute Sqoop With Data Connector for Oracle and Hadoop
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Connect to Oracle / Oracle RAC
+++++++++++++++++++++++++++++++
+
+The Sqoop +--connect+ parameter defines the Oracle instance or Oracle RAC to
+connect to. It is required with all Sqoop import and export commands.
+
+Data Connector for Oracle and Hadoop expects the associated connection string
+to be of a specific format dependent on whether the Oracle SID or Service
+is defined.
+
++--connect jdbc:oracle:thin:@OracleServer:OraclePort:OracleSID+
+
++--connect jdbc:oracle:thin:@//OracleServer:OraclePort/OracleService+
+
+Connect to An Oracle Database Instance
+++++++++++++++++++++++++++++++++++++++
+
+[frame="topbot",options="header",cols="2*v"]
+|===============================================================================
+|Parameter / Component                         |Description
+|+jdbc:oracle:thin+
+|The Data Connector for Oracle and Hadoop requires the
+connection string starts with jdbc:oracle.
+
+The Data Connector for Oracle and Hadoop has been tested
+with the thin driver however it should work equally well
+with other drivers such as OCI.
+|+OracleServer+
+|The host name of the Oracle server.
+|+OraclePort+
+|The port to connect to the Oracle server.
+|+OracleSID+
+|The Oracle instance.
+|+OracleService+
+|The Oracle Service.
+|===============================================================================
+
+[NOTE]
+================================================================================
+The Hadoop mappers connect to the Oracle database using a dynamically
+generated JDBC URL. This is designed to improve performance however it can be
+disabled by specifying:
+
++-D oraoop.jdbc.url.verbatim=true+
+================================================================================
+
+Connect to An Oracle RAC
+++++++++++++++++++++++++
+
+Use the +--connect+ parameter as above. The connection string should point to
+one instance of the Oracle RAC. The listener of the host of this Oracle
+instance will locate the other instances of the Oracle RAC.
+
+NOTE: To improve performance, The Data Connector for Oracle and Hadoop
+identifies the active instances of the Oracle RAC and connects each Hadoop
+mapper to them in a roundrobin manner.
+
+If services are defined for this Oracle RAC then use the following parameter
+to specify the service name:
+
++-D oraoop.oracle.rac.service.name=ServiceName+
+
+[frame="topbot",options="header",cols="2*v"]
+|===============================================================================
+|Parameter / Component                         |Description
+|+OracleServer:OraclePort:OracleInstance+
+|Name one instance of the Oracle RAC.
+
+The Data Connector for Oracle and
+Hadoop assumes the same port number for
+all instances of the Oracle RAC.
+
+The listener of the host of this Oracle
+instance is used to locate other instances of
+the Oracle RAC. For more information
+enter this command on the host command
+line:
+
++lsnrctl status+
+|+-D oraoop.oracle.rac.service.name=ServiceName+
+|The service to connect to in the Oracle RAC.
+
+A connection is made to all instances of
+the Oracle RAC associated with the service
+given by +ServiceName+.
+
+If omitted, a connection is made to all
+instances of the Oracle RAC.
+
+The listener of the host of this Oracle
+instance needs to know the +ServiceName+
+and all instances of the Oracle RAC. For
+more information enter this command on
+the host command line:
+
++lsnrctl status+
+|===============================================================================
+
+Login to The Oracle Instance
+++++++++++++++++++++++++++++
+
+Login to the Oracle instance on the Sqoop command line:
+
++--connect jdbc:oracle:thin:@OracleServer:OraclePort:OracleInstance --username
+UserName -P+
+
+[frame="topbot",options="header",cols="2*v"]
+|===============================================================================
+|Parameter / Component                         |Description
+|+--username UserName+
+|The username to login to the Oracle instance (SID).
+|+-P+
+|You will be prompted for the password to login to the Oracle
+instance.
+|===============================================================================
+
+Kill Data Connector for Oracle and Hadoop Jobs
+++++++++++++++++++++++++++++++++++++++++++++++
+
+Use the Hadoop Job Tracker to kill the Sqoop job, just as you would kill any
+other Map-Reduce job.
+
+$ +hadoop job -kill jobid+
+
+To allow an Oracle DBA to kill a Data Connector for Oracle and Hadoop
+job (via killing the sessions in Oracle) you need to prevent Map-Reduce from
+re-attempting failed jobs. This is done via the following Sqoop
+command-line switch:
+
++-D mapred.map.max.attempts=1+
+
+This sends instructions similar to the following to the console:
+
+----
+14/07/07 15:24:51 INFO oracle.OraOopManagerFactory:
+Note: This Data Connector for Oracle and Hadoop job can be killed via Oracle
+by executing the following statement:
+  begin
+    for row in (select sid,serial# from v$session where
+                module='Data Connector for Oracle and Hadoop' and
+                action='import 20140707152451EST') loop
+      execute immediate 'alter system kill session ''' || row.sid ||
+                        ',' || row.serial# || '''';
+    end loop;
+  end;
+----
+
+Import Data from Oracle
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Execute Sqoop. Following is an example command:
+
+$ +sqoop import --direct --connect ... --table OracleTableName+
+
+If The Data Connector for Oracle and Hadoop accepts the job then the following
+text is output:
+
+----
+**************************************************
+*** Using Data Connector for Oracle and Hadoop ***
+**************************************************
+----
+
+[NOTE]
+================================================================================
+- More information is available on the +--connect+ parameter. See "Connect to
+Oracle / Oracle RAC" for more information.
+
+- If Java runs out of memory the workaround is to specify each mapper's
+JVM memory allocation. Add the following parameter for example to allocate 4GB:
++
++-Dmapred.child.java.opts=-Xmx4000M+
+
+- An Oracle optimizer hint is included in the SELECT statement by default.
+See "oraoop.import.hint" for more information.
++
+You can alter the hint on the command line as follows:
++
++-Doraoop.import.hint="NO_INDEX(t)"+
++
+You can turn off the hint on the command line as follows (notice the space
+between the double quotes):
++
++-Doraoop.import.hint=" "+
+================================================================================
+
+Match Hadoop Files to Oracle Table Partitions
++++++++++++++++++++++++++++++++++++++++++++++
+
++-Doraoop.chunk.method={ROWID|PARTITION}+
+
+To import data from a partitioned table in such a way that the resulting HDFS
+folder structure in Hadoop will match the table’s partitions, set the chunk
+method to PARTITION. The alternative (default) chunk method is ROWID.
+
+[NOTE]
+================================================================================
+- For the number of Hadoop files to match the number of Oracle partitions,
+set the number of mappers to be greater than or equal to the number of
+partitions.
+- If the table is not partitioned then value PARTITION will lead to an error.
+================================================================================
+
+Specify The Partitions To Import
+++++++++++++++++++++++++++++++++
+
++-Doraoop.import.partitions=PartitionA,PartitionB --table OracleTableName+
+
+Imports +PartitionA+ and +PartitionB+ of +OracleTableName+.
+
+[NOTE]
+================================================================================
+- You can enclose an individual partition name in double quotes to retain the
+letter case or if the name has special characters.
++
++-Doraoop.import.partitions=\'"PartitionA",PartitionB' --table OracleTableName+
++
+If the partition name is not double quoted then its name will be automatically
+converted to upper case, PARTITIONB for above.
++
+When using double quotes the entire list of partition names must be enclosed in
+single quotes.
++
+If the last partition name in the list is double quoted then there must be a
+comma at the end of the list.
++
++-Doraoop.import.partitions=\'"PartitionA","PartitionB",' --table
+OracleTableName+
+
+- Name each partition to be included. There is no facility to provide a range of
+partition names.
+
+- There is no facility to define sub partitions. The entire partition is
+included/excluded as per the filter.
+================================================================================
+
+Consistent Read: All Mappers Read From The Same Point In Time
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
++-Doraoop.import.consistent.read={true|false}+
+
+When set to +false+ (by default) each mapper runs a select query. This will
+return potentially inconsistent data if there are a lot of DML operations on
+the table at the time of import.
+
+Set to +true+ to ensure all mappers read from the same point in time. The
+System Change Number (SCN) is passed down to all mappers, which use the Oracle
+Flashback Query to query the table as at that SCN.
+
+[NOTE]
+================================================================================
+- Values +true+ | +false+ are case sensitive.
+- By default the SCN is taken from V$database. You can specify the SCN in the
+following command
++
++-Doraoop.import.consistent.read.scn=12345+
+================================================================================
+
+Export Data into Oracle
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Execute Sqoop. Following is an example command:
+
+
+$ +sqoop export --direct --connect ... --table OracleTableName --export-dir
+/user/username/tablename+
+
+
+The Data Connector for Oracle and Hadoop accepts all jobs that export data to
+Oracle. You can verify The Data Connector for Oracle and Hadoop is in use by
+checking the following text is output:
+
+----
+**************************************************
+*** Using Data Connector for Oracle and Hadoop ***
+**************************************************
+----
+
+[NOTE]
+================================================================================
+- +OracleTableName+ is the Oracle table the data will export into.
+- +OracleTableName+ can be in a schema other than that for the connecting user.
+Prefix the table name with the schema, for example +SchemaName.OracleTableName+.
+- Hadoop tables are picked up from the +/user/username/tablename+ directory.
+- The export will fail if the Hadoop file contains any fields of a data type
+not supported by The Data Connector for Oracle and Hadoop. See
+"Supported Data Types" for more information.
+- The export will fail if the column definitions in the Hadoop table do not
+exactly match the column definitions in the Oracle table.
+- The Data Connector for Oracle and Hadoop indicates if it finds temporary
+tables that it created more than a day ago that still exist. Usually these
+tables can be dropped. The only circumstance when these tables should not be
+dropped is when an The Data Connector for Oracle and Hadoop job has been
+running for more than 24 hours and is still running.
+- More information is available on the +--connect+ parameter. See
+"Connect to Oracle / Oracle RAC" for more information.
+================================================================================
+
+Insert-Export
++++++++++++++
+
+Appends data to +OracleTableName+. It does not modify existing data in
++OracleTableName+.
+
+Insert-Export is the default method, executed in the absence of the
++--update-key parameter+. All rows in the HDFS file in
++/user/UserName/TableName+ are inserted into +OracleTableName+. No
+change is made to pre-existing data in +OracleTableName+.
+
+$ +sqoop export --direct --connect ... --table OracleTableName --export-dir
+/user/username/tablename+
+
+[NOTE]
+================================================================================
+- If +OracleTableName+ was previously created by The Data Connector for Oracle
+and Hadoop with partitions then this export will create a new partition for the
+data being inserted.
+- When creating +OracleTableName+ specify a template. See
+"Create Oracle Tables" for more information.
+================================================================================
+
+Update-Export
++++++++++++++
+
++--update-key OBJECT+
+
+Updates existing rows in +OracleTableName+.
+
+Rows in the HDFS file in +/user/UserName/TableName+ are matched to rows in
++OracleTableName+ by the +OBJECT+ column. Rows that match are copied from the
+HDFS file to the Oracle table. No action is taken on rows that do not match.
+
+$ +sqoop export --direct --connect ... --update-key OBJECT --table
+OracleTableName --export-dir /user/username/tablename+
+
+[NOTE]
+================================================================================
+- If +OracleTableName+ was previously created by The Data Connector for Oracle
+and Hadoop with partitions then this export will create a new partition for the
+data being inserted. Updated rows will be moved to the new partition that was
+created for the export.
+- For performance reasons it is strongly recommended that where more than a few
+rows are involved column +OBJECT+ be an index column of +OracleTableName+.
+- Ensure the column name defined with +--update-key OBJECT+ is specified in the
+correct letter case. Sqoop will show an error if the letter case is incorrect.
+- It is possible to match rows via multiple columns. See "Match Rows Via
+Multiple Columns" for more information.
+================================================================================
+
+Merge-Export
+++++++++++++
+
++--update-key OBJECT -Doraoop.export.merge=true+
+
+Updates existing rows in +OracleTableName+. Copies across rows from the HDFS
+file that do not exist within the Oracle table.
+
+Rows in the HDFS file in +/user/UserName/TableName+ are matched to rows in
++OracleTableName+ by the +OBJECT+ column. Rows that match are copied from the
+HDFS file to the Oracle table. Rows in the HDFS file that do not exist in
++OracleTableName+ are added to +OracleTableName+.
+
+$ +sqoop export --direct --connect ... --update-key OBJECT
+-Doraoop.export.merge=true --table OracleTableName --export-dir
+/user/username/tablename+
+
+[NOTE]
+================================================================================
+- Merge-Export is unique to The Data Connector for Oracle and Hadoop. It is
+not a standard Sqoop feature.
+- If +OracleTableName+ was previously created by The Data Connector for Oracle
+and Hadoop with partitions, then this export will create a new partition for
+the data being inserted. Updated rows will be moved to the new partition that
+was created for the export.
+- For performance reasons it is strongly recommended that where more than a
+few rows are involved column +OBJECT+ be an index column of +OracleTableName+.
+- Ensure the column name defined with +--update-key OBJECT+ is specified in the
+correct letter case. Sqoop will show an error if the letter case is incorrect.
+- It is possible to match rows via multiple columns. See "Match Rows Via
+Multiple Columns" for more information.
+================================================================================
+
+Create Oracle Tables
+++++++++++++++++++++
+
++-Doraoop.template.table=TemplateTableName+
+
+Creates +OracleTableName+ by replicating the structure and data types of
++TemplateTableName+. +TemplateTableName+ is a table that exists in Oracle prior
+to executing the Sqoop command.
+
+[NOTE]
+================================================================================
+- The export will fail if the Hadoop file contains any fields of a data type
+not supported by The Data Connector for Oracle and Hadoop. See "Supported
+Data Types" for more information.
+- The export will fail if the column definitions in the Hadoop table do not
+exactly match the column definitions in the Oracle table.
+- This parameter is specific to creating an Oracle table. The export will fail
+if +OracleTableName+ already exists in Oracle.
+================================================================================
+
+Example command:
+
+$ +sqoop export --direct --connect.. --table OracleTableName --export-dir
+/user/username/tablename -Doraoop.template.table=TemplateTableName+
+
+NOLOGGING
++++++++++
+
++-Doraoop.nologging=true+
+
+Assigns the NOLOGGING option to +OracleTableName+.
+
+NOLOGGING may enhance performance but you will be unable to backup the table.
+
+Partitioning
+++++++++++++
+
++-Doraoop.partitioned=true+
+
+Partitions the table with the following benefits:
+
+- The speed of the export is improved by allowing each mapper to insert data
+into a separate Oracle table using direct path writes. (An alter table exchange
+subpartition SQL statement is subsequently executed to swap the data into the
+export table.)
+- You can selectively query or delete the data inserted by each Sqoop export
+job. For example, you can delete old data by dropping old partitions from
+the table.
+
+The partition value is the SYSDATE of when Sqoop export job was performed.
+
+The partitioned table created by The Data Connector for Oracle and Hadoop
+includes the following columns that don't exist in the template table:
+
+- +oraoop_export_sysdate+ - This is the Oracle SYSDATE when the Sqoop export
+job was performed. The created table will be partitioned by this column.
+- +oraoop_mapper_id+ - This is the id of the Hadoop mapper that was used to
+process the rows from the HDFS file. Each partition is subpartitioned by this
+column. This column exists merely to facilitate the exchange subpartition
+mechanism that is performed by each mapper during the export process.
+- +oraoop_mapper_row+ - A unique row id within the mapper / partition.
+
+NOTE: If a unique row id is required for the table it can be formed by a
+combination of oraoop_export_sysdate, oraoop_mapper_id and oraoop_mapper_row.
+
+Match Rows Via Multiple Columns
++++++++++++++++++++++++++++++++
+
++-Doraoop.update.key.extra.columns="ColumnA,ColumnB"+
+
+Used with Update-Export and Merge-Export to match on more than one column. The
+first column to be matched on is +--update-key OBJECT+. To match on additional
+columns, specify those columns on this parameter.
+
+[NOTE]
+================================================================================
+- Letter case for the column names on this parameter is not important.
+- All columns used for matching should be indexed. The first three items on the
+index should be +ColumnA+, +ColumnB+ and the column specified on
++--update-key+ - but the order in which the columns are specified is not
+important.
+================================================================================
+
+Storage Clauses
++++++++++++++++
+
++-Doraoop.temporary.table.storage.clause="StorageClause"+
+
++-Doraoop.table.storage.clause="StorageClause"+
+
+Use to customize storage with Oracle clauses as in TABLESPACE or COMPRESS
+
++-Doraoop.table.storage.clause+ applies to the export table that is created
+from +-Doraoop.template.table+. See "Create Oracle Tables" for more
+information. +-Doraoop.temporary.table.storage.clause+ applies to all other
+working tables that are created during the export process and then dropped at
+the end of the export job.
+
+Manage Date And Timestamp Data Types
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Import Date And Timestamp Data Types from Oracle
+++++++++++++++++++++++++++++++++++++++++++++++++
+
+This section lists known differences in the data obtained by performing an
+Data Connector for Oracle and Hadoop import of an Oracle table versus a native
+Sqoop import of the same table.
+
+The Data Connector for Oracle and Hadoop Does Not Apply A Time Zone to DATE / TIMESTAMP Data Types
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+Data stored in a DATE or TIMESTAMP column of an Oracle table is not associated
+with a time zone. Sqoop without the Data Connector for Oracle and Hadoop
+inappropriately applies time zone information to this data.
+
+Take for example the following timestamp in an Oracle DATE or TIMESTAMP column:
++2am on 3rd October, 2010+.
+
+Request Sqoop without the Data Connector for Oracle and Hadoop import this data
+using a system located in Melbourne Australia. The data is adjusted to Melbourne
+Daylight Saving Time. The data is imported into Hadoop as:
++3am on 3rd October, 2010.+
+
+The Data Connector for Oracle and Hadoop does not apply time zone information to
+these Oracle data-types. Even from a system located in Melbourne Australia, The
+Data Connector for Oracle and Hadoop ensures the Oracle and Hadoop timestamps
+match. The Data Connector for Oracle and Hadoop correctly imports this
+timestamp as:
++2am on 3rd October, 2010+.
+
+NOTE: In order for The Data Connector for Oracle and Hadoop to ensure data
+accuracy, Oracle DATE and TIMESTAMP values must be represented by a String,
+even when +--as-sequencefile+ is used on the Sqoop command-line to produce a
+binary file in Hadoop.
+
+The Data Connector for Oracle and Hadoop Retains Time Zone Information in TIMEZONE Data Types
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+Data stored in a TIMESTAMP WITH TIME ZONE column of an Oracle table is
+associated with a time zone. This data consists of two distinct parts: when the
+event occurred and where the event occurred.
+
+When Sqoop without The Data Connector for Oracle and Hadoop is used to import
+data it converts the timestamp to the time zone of the system running Sqoop and
+omits the component of the data that specifies where the event occurred.
+
+Take for example the following timestamps (with time zone) in an Oracle
+TIMESTAMP WITH TIME ZONE column:
+
+----
+2:59:00 am on 4th April, 2010. Australia/Melbourne
+2:59:00 am on 4th April, 2010. America/New York
+----
+
+Request Sqoop without The Data Connector for Oracle and Hadoop import this data
+using a system located in Melbourne Australia. From the data imported into
+Hadoop we know when the events occurred, assuming we know the Sqoop command was
+run from a system located in the Australia/Melbourne time zone, but we have lost
+the information regarding where the event occurred.
+
+----
+2010-04-04 02:59:00.0
+2010-04-04 16:59:00.0
+----
+
+Sqoop with The Data Connector for Oracle and Hadoop imports the example
+timestamps as follows. The Data Connector for Oracle and Hadoop retains the
+time zone portion of the data.
+----
+2010-04-04 02:59:00.0 Australia/Melbourne
+2010-04-04 02:59:00.0 America/New_York
+----
+
+Data Connector for Oracle and Hadoop Explicitly States Time Zone for LOCAL TIMEZONE Data Types
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+Data stored in a TIMESTAMP WITH LOCAL TIME ZONE column of an Oracle table is
+associated with a time zone. Multiple end-users in differing time zones
+(locales) will each have that data expressed as a timestamp within their
+respective locale.
+
+When Sqoop without the Data Connector for Oracle and Hadoop is used to import
+data it converts the timestamp to the time zone of the system running Sqoop and
+omits the component of the data that specifies location.
+
+Take for example the following two timestamps (with time zone) in an Oracle
+TIMESTAMP WITH LOCAL TIME ZONE column:
+
+----
+2:59:00 am on 4th April, 2010. Australia/Melbourne
+2:59:00 am on 4th April, 2010. America/New York
+----
+
+Request Sqoop without the Data Connector for Oracle and Hadoop import this data
+using a system located in Melbourne Australia. The timestamps are imported
+correctly but the local time zone has to be guessed. If multiple systems in
+different locale were executing the Sqoop import it would be very difficult to
+diagnose the cause of the data corruption.
+
+----
+2010-04-04 02:59:00.0
+2010-04-04 16:59:00.0
+----
+
+Sqoop with the Data Connector for Oracle and Hadoop explicitly states the time
+zone portion of the data imported into Hadoop. The local time zone is GMT by
+default. You can set the local time zone with parameter:
+
++-Doracle.sessionTimeZone=Australia/Melbourne+
+
+The Data Connector for Oracle and Hadoop would import these two timestamps as:
+
+----
+2010-04-04 02:59:00.0 Australia/Melbourne
+2010-04-04 16:59:00.0 Australia/Melbourne
+----
+
+java.sql.Timestamp
+++++++++++++++++++
+
+To use Sqoop's handling of date and timestamp data types when importing data
+from Oracle use the following parameter:
+
++-Doraoop.timestamp.string=false+
+
+NOTE: Sqoop's handling of date and timestamp data types does not store the
+timezone. However, some developers may prefer Sqoop's handling as the Data
+Connector for Oracle and Hadoop converts date and timestamp data types to
+string. This may not work for some developers as the string will require
+parsing later in the workflow.
+
+Export Date And Timestamp Data Types into Oracle
+++++++++++++++++++++++++++++++++++++++++++++++++
+
+Ensure the data in the HDFS file fits the required format exactly before using
+Sqoop to export the data into Oracle.
+
+[NOTE]
+================================================================================
+- The Sqoop export command will fail if the data is not in the required format.
+- ff = Fractional second
+- TZR = Time Zone Region
+================================================================================
+
+[frame="topbot",options="header"]
+|===============================================================================
+|Oracle Data Type                  |Required Format of The Data in the HDFS File
+|DATE                              |+yyyy-mm-dd hh24:mi:ss+
+|TIMESTAMP                         |+yyyy-mm-dd hh24:mi:ss.ff+
+|TIMESTAMPTZ                       |+yyyy-mm-dd hh24:mi:ss.ff TZR+
+|TIMESTAMPLTZ                      |+yyyy-mm-dd hh24:mi:ss.ff TZR+
+|===============================================================================
+
+Configure The Data Connector for Oracle and Hadoop
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+oraoop-site-template.xml
+++++++++++++++++++++++++
+
+The oraoop-site-template.xml file is supplied with the Data Connector for
+Oracle and Hadoop. It contains a number of ALTER SESSION statements that are
+used to initialize the Oracle sessions created by the Data Connector for Oracle
+and Hadoop.
+
+If you need to customize these initializations to your environment then:
+
+1. Find +oraoop-site-template.xml+ in the Sqoop configuration directory.
+2. Copy +oraoop-site-template.xml+ to +oraoop-site.xml+.
+3. Edit the +ALTER SESSION+ statements in +oraoop-site.xml+.
+
+oraoop.oracle.session.initialization.statements
++++++++++++++++++++++++++++++++++++++++++++++++
+
+The value of this property is a semicolon-delimited list of Oracle SQL
+statements. These statements are executed, in order, for each Oracle session
+created by the Data Connector for Oracle and Hadoop.
+
+The default statements include:
+
++alter session set time_zone = \'{oracle.sessionTimeZone|GMT}';+::
+This statement initializes the timezone of the JDBC client. This ensures that
+data from columns of type TIMESTAMP WITH LOCAL TIMEZONE are correctly adjusted
+into the timezone of the client and not kept in the timezone of the Oracle
+database.
++
+[NOTE]
+================================================================================
+- There is an explanation to the text within the curly-braces. See
+"Expressions in oraoop-site.xml" for more information..
+- A list of the time zones supported by your Oracle database is available by
+executing the following query: +SELECT TZNAME FROM V$TIMEZONE_NAMES;+
+================================================================================
+
++alter session disable parallel query;+::
+This statement instructs Oracle to not parallelize SQL statements executed by
+the Data Connector for Oracle and Hadoop sessions. This Oracle feature is
+disabled because the Map/Reduce job launched by Sqoop is the mechanism used
+for parallelization.
++
+It is recommended that you not enable parallel query because it can have an
+adverse effect the load on the Oracle instance and on the balance between
+the Data Connector for Oracle and Hadoop mappers.
++
+Some export operations are performed in parallel where deemed appropriate by
+the Data Connector for Oracle and Hadoop. See "Parallelization" for
+more information.
+
++alter session set "_serial_direct_read"=true;+::
+This statement instructs Oracle to bypass the buffer cache. This is used to
+prevent Oracle from filling its buffers with the data being read by the Data
+Connector for Oracle and Hadoop, therefore diminishing its capacity to cache
+higher prioritized data. Hence, this statement is intended to minimize the
+Data Connector for Oracle and Hadoop's impact on the immediate future
+performance of the Oracle database.
+
++--alter session set events \'10046 trace name context forever, level 8';+::
+This statement has been commented-out. To allow tracing, remove the comment
+token "--" from the start of the line.
+
+[NOTE]
+================================================================================
+- These statements are placed on separate lines for readability. They do not
+need to be placed on separate lines.
+- A statement can be commented-out via the standard Oracle double-hyphen
+token: "--". The comment takes effect until the next semicolon.
+================================================================================
+
+oraoop.table.import.where.clause.location
++++++++++++++++++++++++++++++++++++++++++
+
+SUBSPLIT (default)::
+When set to this value, the where clause is applied to each subquery used to
+retrieve data from the Oracle table.
++
+A Sqoop command like:
++
++sqoop import -D oraoop.table.import.where.clause.location=SUBSPLIT --table
+JUNK --where "owner like \'G%'"+
++
+Generates SQL query of the form:
++
+----
+SELECT OWNER,OBJECT_NAME
+  FROM JUNK
+ WHERE ((rowid >=
+           dbms_rowid.rowid_create(1, 113320, 1024, 4223664, 0)
+         AND rowid <=
+             dbms_rowid.rowid_create(1, 113320, 1024, 4223671, 32767)))
+   AND (owner like 'G%')
+UNION ALL
+SELECT OWNER,OBJECT_NAME
+  FROM JUNK
+ WHERE ((rowid >=
+           dbms_rowid.rowid_create(1, 113320, 1024, 4223672, 0)
+         AND rowid <=
+           dbms_rowid.rowid_create(1, 113320, 1024, 4223679, 32767)))
+   AND (owner like 'G%')
+----
+
+SPLIT::
+When set to this value, the where clause is applied to the entire SQL
+statement used by each split/mapper.
++
+A Sqoop command like:
++
++sqoop import -D oraoop.table.import.where.clause.location=SPLIT --table
+JUNK --where "rownum <= 10"+
++
+Generates SQL query of the form:
++
+----
+SELECT OWNER,OBJECT_NAME
+  FROM (
+        SELECT OWNER,OBJECT_NAME
+          FROM JUNK
+         WHERE ((rowid >=
+                   dbms_rowid.rowid_create(1, 113320, 1024, 4223664, 0)
+                 AND rowid <=
+                   dbms_rowid.rowid_create(1, 113320, 1024, 4223671, 32767)))
+        UNION ALL
+        SELECT OWNER,OBJECT_NAME
+          FROM JUNK
+         WHERE ((rowid >=
+                   dbms_rowid.rowid_create(1, 113320, 1024, 4223672, 0)
+                 AND rowid <=
+                   dbms_rowid.rowid_create(1, 113320, 1024, 4223679,32767)))
+       )
+ WHERE rownum <= 10
+----
++
+[NOTE]
+================================================================================
+- In this example, there are up to 10 rows imported per mapper.
+- The SPLIT clause may result in greater overhead than the SUBSPLIT
+clause because the UNION statements need to be fully materialized
+before the data can be streamed to the mappers. However, you may
+wish to use SPLIT in the case where you want to limit the total
+number of rows processed by each mapper.
+================================================================================
+
+oracle.row.fetch.size
++++++++++++++++++++++
+
+The value of this property is an integer specifying the number of rows the
+Oracle JDBC driver should fetch in each network round-trip to the database.
+The default value is 5000.
+
+If you alter this setting, confirmation of the
+change is displayed in the logs of the mappers during the Map-Reduce job.
+
+oraoop.import.hint
+++++++++++++++++++
+
+The Oracle optimizer hint is added to the SELECT statement for IMPORT jobs
+as follows:
+
+----
+SELECT /*+ NO_INDEX(t) */ * FROM employees;
+----
+
+The default hint is +NO_INDEX(t)+
+
+[NOTE]
+================================================================================
+- The hint can be added to the command line. See "Import Data from Oracle" for
+more information.
+- See the Oracle Database Performance Tuning Guide (Using Optimizer Hints)
+for more information on Oracle optimizer hints.
+- To turn the hint off, insert a space between the <value> elements.
++
+----
+<property>
+  <name>oraoop.import.hint</name>
+  <value> </value>
+</property>
+----
+================================================================================
+
+oraoop.oracle.append.values.hint.usage
+++++++++++++++++++++++++++++++++++++++
+
+The value of this property is one of: AUTO / ON / OFF.
+
+AUTO::
+AUTO is the default value.
++
+Currently AUTO is equivalent to OFF.
+
+ON::
+During export the Data Connector for Oracle and Hadoop uses direct path
+writes to populate the target Oracle table, bypassing the buffer cache.
+Oracle only allows a single session to perform direct writes against a specific
+table at any time, so this has the effect of serializing the writes to the
+table. This may reduce throughput, especially if the number of mappers is high.
+However, for databases where DBWR is very busy, or where the IO bandwidth to
+the underlying table is narrow (table resides on a single disk spindle for
+instance), then setting +oraoop.oracle.append.values.hint.usage+ to ON may
+reduce the load on the Oracle database and possibly increase throughput.
+
+OFF::
+During export the Data Connector for Oracle and Hadoop does not use the
++APPEND_VALUES+ Oracle hint.
+
+NOTE: This parameter is only effective on Oracle 11g Release 2 and above.
+
+mapred.map.tasks.speculative.execution
+++++++++++++++++++++++++++++++++++++++
+
+By default speculative execution is disabled for the Data Connector for
+Oracle and Hadoop. This avoids placing redundant load on the Oracle database.
+
+If Speculative execution is enabled, then Hadoop may initiate multiple mappers
+to read the same blocks of data, increasing the overall load on the database.
+
+oraoop.block.allocation
++++++++++++++++++++++++
+
+This setting determines how Oracle's data-blocks are assigned to Map-Reduce mappers.
+
+NOTE: Applicable to import. Not applicable to export.
+
+ROUNDROBIN (default)::
+Each chunk of Oracle blocks is allocated to the mappers in a roundrobin
+manner. This helps prevent one of the mappers from being
+allocated a large proportion of typically small-sized blocks from the
+start of Oracle data-files. In doing so it also helps prevent one of the
+other mappers from being allocated a large proportion of typically
+larger-sized blocks from the end of the Oracle data-files.
++
+Use this method to help ensure all the mappers are allocated a similar
+amount of work.
+
+RANDOM::
+The list of Oracle blocks is randomized before being allocated to the
+mappers via a round-robin approach. This has the benefit of increasing
+the chance that, at any given instant in time, each mapper is reading
+from a different Oracle data-file. If the Oracle data-files are located on
+separate spindles, this should increase the overall IO throughput.
+
+SEQUENTIAL::
+Each chunk of Oracle blocks is allocated to the mappers sequentially.
+This produces the tendency for each mapper to sequentially read a large,
+contiguous proportion of an Oracle data-file. It is unlikely for the
+performance of this method to exceed that of the round-robin method
+and it is more likely to allocate a large difference in the work between
+the mappers.
++
+Use of this method is generally not recommended.
+
+oraoop.import.omit.lobs.and.long
+++++++++++++++++++++++++++++++++
+
+This setting can be used to omit all LOB columns (BLOB, CLOB and NCLOB) and LONG
+column from an Oracle table being imported. This is advantageous in
+troubleshooting, as it provides a convenient way to exclude all LOB-based data
+from the import.
+
+oraoop.locations
+++++++++++++++++
+
+NOTE: Applicable to import. Not applicable to export.
+
+By default, four mappers are used for a Sqoop import job. The number of mappers
+can be altered via the Sqoop +--num-mappers+ parameter.
+
+If the data-nodes in your Hadoop cluster have 4 task-slots (that is they are
+4-CPU core machines) it is likely for all four mappers to execute on the
+same machine. Therefore, IO may be concentrated between the Oracle database
+and a single machine.
+
+This setting allows you to control which DataNodes in your Hadoop cluster each
+mapper executes on. By assigning each mapper to a separate machine you may
+improve the overall IO performance for the job. This will also have the
+side-effect of the imported data being more diluted across the machines in
+the cluster. (HDFS replication will dilute the data across the cluster anyway.)
+
+Specify the machine names as a comma separated list. The locations are
+allocated to each of the mappers in a round-robin manner.
+
+If using EC2, specify the internal name of the machines. Here is an example
+of using this parameter from the Sqoop command-line:
+
+$ +sqoop import -D
+oraoop.locations=ip-10-250-23-225.ec2.internal,ip-10-250-107-32.ec2.internal,ip-10-250-207-2.ec2.internal,ip-10-250-27-114.ec2.internal
+--direct --connect...+
+
+sqoop.connection.factories
+++++++++++++++++++++++++++
+
+This setting determines behavior if the Data Connector for Oracle and Hadoop
+cannot accept the job. By default Sqoop accepts the jobs that the Data Connector
+for Oracle and Hadoop rejects.
+
+Set the value to +org.apache.sqoop.manager.oracle.OraOopManagerFactory+ when you
+want the job to fail if the Data Connector for Oracle and Hadoop cannot
+accept the job.
+
+Expressions in oraoop-site.xml
+++++++++++++++++++++++++++++++
+
+Text contained within curly-braces { and } are expressions to be evaluated
+prior to the SQL statement being executed. The expression contains the name
+of the configuration property optionally followed by a default value to use
+if the property has not been set. A pipe | character is used to delimit the
+property name and the default value.
+
+For example:
+
+When this Sqoop command is executed::
+$ +sqoop import -D oracle.sessionTimeZone=US/Hawaii --direct --connect+
+
+The statement within oraoop-site.xml::
++alter session set time_zone =\'{oracle.sessionTimeZone|GMT}\';+
+
+Becomes::
++alter session set time_zone = \'US/Hawaii'+
+
+If the oracle.sessionTimeZone property had not been set, then this statement would use the specified default value and would become::
++alter session set time_zone = \'GMT'+
+
+NOTE: The +oracle.sessionTimeZone+ property can be specified within the
++sqoop-site.xml+ file if you want this setting to be used all the time.
+
+Troubleshooting The Data Connector for Oracle and Hadoop
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Quote Oracle Owners And Tables
+++++++++++++++++++++++++++++++
+
+[frame="topbot",cols="2*v"]
+|===============================================================================
+|If the owner of the Oracle table needs to be
+quoted, use:
+|$ +sqoop import ... --table
+"\"\"Scott\".customers\""+
+
+This is the equivalent of:
+"Scott".customers
+|If the Oracle table needs to be quoted, use:
+|$ +sqoop import ... --table
+"\"scott.\"Customers\"\""+
+
+This is the equivalent of:
+scott."Customers"
+|If both the owner of the Oracle table and the
+table itself needs to be quoted, use:
+|$ +sqoop import ... --table
+"\"\"Scott\".\"Customers\"\""+
+
+This is the equivalent of:
+"Scott"."Customers"
+|===============================================================================
+
+[NOTE]
+================================================================================
+- The HDFS output directory is called something like:
+/user/username/"Scott"."Customers"
+- If a table name contains a $ character, it may need to be escaped within your
+Unix shell. For example, the dr$object table in the ctxsys schema would be
+referred to as: $ +sqoop import ... --table "ctxsys.dr\$object"+
+================================================================================
+
+Quote Oracle Columns
+++++++++++++++++++++
+
+If a column name of an Oracle table needs to be quoted, use::
+$ +sqoop import ... --table customers --columns "\"\"first name\"\""+
++
+This is the equivalent of: `select "first name" from customers`
+
+Confirm The Data Connector for Oracle and Hadoop Can Initialize The Oracle Session
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+If the Sqoop output includes feedback such as the following then the
+configuration properties contained within +oraoop-site-template.xml+ and
++oraoop-site.xml+ have been loaded by Hadoop and can be accessed by the Data
+Connector for Oracle and Hadoop.
+
++14/07/08 15:21:13 INFO oracle.OracleConnectionFactory:
+Initializing Oracle session with SQL+
+
+Check The Sqoop Debug Logs for Error Messages
++++++++++++++++++++++++++++++++++++++++++++++
+
+For more information about any errors encountered during the Sqoop import,
+refer to the log files generated by each of the (by default 4) mappers that
+performed the import.
+
+The logs can be obtained via your Map-Reduce Job Tracker's web page.
+
+Include these log files with any requests you make for assistance on the Sqoop
+User Group web site.
+
+Export: Check Tables Are Compatible
++++++++++++++++++++++++++++++++++++
+
+Check tables particularly in the case of a parsing error.
+
+- Ensure the fields contained with the HDFS file and the columns within the
+Oracle table are identical. If they are not identical, the Java code
+dynamically generated by Sqoop to parse the HDFS file will throw an error when
+reading the file – causing the export to fail. When creating a table in Oracle
+ensure the definitions for the table template are identical to the definitions
+for the HDFS file.
+- Ensure the data types in the table are supported. See "Supported Data Types"
+for more information.
+- Are date and time zone based data types used? See "Export Date And Timestamp
+Data Types into Oracle" for more information.
+
+Export: Parallelization
++++++++++++++++++++++++
+
++-D oraoop.export.oracle.parallelization.enabled=false+
+
+If you see a parallelization error you may decide to disable parallelization
+on Oracle queries.
+
+Export: Check oraoop.oracle.append.values.hint.usage
+++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+The oraoop.oracle.append.values.hint.usage parameter should not be set to ON
+if the Oracle table contains either a BINARY_DOUBLE or BINARY_FLOAT column and
+the HDFS file being exported contains a NULL value in either of these column
+types. Doing so will result in the error: +ORA-12838: cannot read/modify an
+object after modifying it in parallel+.
+
+Turn On Verbose
++++++++++++++++
+
+Turn on verbose on the Sqoop command line.
+
++--verbose+
+
+Check Sqoop stdout (standard output) and the mapper logs for information as to
+where the problem may be.