From d4ff097ea33721b706811968cb1903fa4be6fa46 Mon Sep 17 00:00:00 2001 From: Jarek Jarcec Cecho Date: Sun, 13 Jul 2014 20:18:04 -0700 Subject: [PATCH] SQOOP-1344: Add documentation for Oracle connector (David Robson via Jarek Jarcec Cecho) --- src/docs/user/connectors.txt | 1299 ++++++++++++++++++++++++++++++++++ 1 file changed, 1299 insertions(+) diff --git a/src/docs/user/connectors.txt b/src/docs/user/connectors.txt index 379cbd99..c04900f3 100644 --- a/src/docs/user/connectors.txt +++ b/src/docs/user/connectors.txt @@ -463,3 +463,1302 @@ representation. It is suggested that the null value be specified as empty string for performance and consistency. + +Data Connector for Oracle and Hadoop +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +About +^^^^^ + +The Data Connector for Oracle and Hadoop is now included in Sqoop. + +It can be enabled by specifying the +\--direct+ argument for your import or +export job. + +Jobs +++++ + +The Data Connector for Oracle and Hadoop inspects each Sqoop job and assumes +responsibility for the ones it can perform better than the Oracle manager built +into Sqoop. + +Data Connector for Oracle and Hadoop accepts responsibility for the following +Sqoop Job types: + +- *Import* jobs that are *Non-Incremental*. +- *Export* jobs +- Data Connector for Oracle and Hadoop does not accept responsibility for other +Sqoop job types. For example Data Connector for Oracle and Hadoop does not +accept *eval* jobs etc. + +Data Connector for Oracle and Hadoop accepts responsibility for those Sqoop Jobs +with the following attributes: + +- Oracle-related +- Table-Based - Jobs where the table argument is used and the specified object +is a table. ++ +NOTE: Data Connector for Oracle and Hadoop does not process index-organized +tables. + +- There are at least 2 mappers — Jobs where the Sqoop command-line does not +include: +--num-mappers 1+ + +How The Standard Oracle Manager Works for Imports ++++++++++++++++++++++++++++++++++++++++++++++++++ + +The Oracle manager built into Sqoop uses a range-based query for each mapper. +Each mapper executes a query of the form: + +---- +SELECT * FROM sometable WHERE id >= lo AND id < hi +---- + +The *lo* and *hi* values are based on the number of mappers and the minimum and +maximum values of the data in the column the table is being split by. + +If no suitable index exists on the table then these queries result in full +table-scans within Oracle. Even with a suitable index, multiple mappers may +fetch data stored within the same Oracle blocks, resulting in redundant IO +calls. + +How The Data Connector for Oracle and Hadoop Works for Imports +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +The Data Connector for Oracle and Hadoop generates queries for the mappers of +the form: + +---- +SELECT * + FROM sometable + WHERE rowid >= dbms_rowid.rowid_create(1, 893, 1, 279, 0) AND + rowid <= dbms_rowid.rowid_create(1, 893, 1, 286, 32767) +---- + +The Data Connector for Oracle and Hadoop queries ensure that: + +- No two mappers read data from the same Oracle block. This minimizes +redundant IO. +- The table does not require indexes. +- The Sqoop command line does not need to specify a +--split-by+ column. + +Data Connector for Oracle and Hadoop Exports +++++++++++++++++++++++++++++++++++++++++++++ + +Benefits of the Data Connector for Oracle and Hadoop: + +- *Merge-Export facility* - Update Oracle tables by modifying changed rows AND +inserting rows from the HDFS file that did not previously exist in the Oracle +table. The Connector for Oracle and Hadoop's Merge-Export is unique - there is +no Sqoop equivalent. +- *Lower impact on the Oracle database* - Update the rows in the Oracle table +that have changed, not all rows in the Oracle table. This has performance +benefits and reduces the impact of the query on Oracle (for example, the Oracle +redo logs). +- *Improved performance* - With partitioned tables, mappers utilize temporary +Oracle tables which allow parallel inserts and direct path writes. + +Requirements +^^^^^^^^^^^^ + +Ensure The Oracle Database JDBC Driver Is Setup Correctly ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +You may want to ensure the Oracle Database 11g Release 2 JDBC driver is setup +correctly on your system. This driver is required for Sqoop to work with Oracle. + +The Oracle Database 11g Release 2 JDBC driver file is +ojdbc6.jar+ (3.2Mb). + +If this file is not on your system then download it from: +http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html + +This file should be put into the +$SQOOP_HOME/lib+ directory. + +Oracle Roles and Privileges ++++++++++++++++++++++++++++ +The Oracle user for The Data Connector for Oracle and Hadoop requires the +following roles and privileges: + +- +create session+ + +In addition, the user must have the select any dictionary privilege or +select_catalog_role role or all of the following object privileges: + +- +select on v_$instance+ +- +select on dba_tables+ +- +select on dba_tab_columns+ +- +select on dba_objects+ +- +select on dba_extents+ +- +select on dba_segments+ — Required for Sqoop imports only +- +select on v_$database+ — Required for Sqoop imports only +- +select on v_$parameter+ — Required for Sqoop imports only + +NOTE: The user also requires the alter session privilege to make use of session +tracing functionality. See "oraoop.oracle.session.initialization.statements" +for more information. + +Additional Oracle Roles And Privileges Required for Export +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +The Oracle user for Data Connector for Oracle and Hadoop requires: + +- Quota on the tablespace in which the Oracle export tables are located. ++ +An example Oracle command to achieve this is ++ +---- +alter user username quota unlimited on tablespace +---- + +- The following privileges: ++ +[frame="topbot",options="header",cols="2*v"] +|=============================================================================== +|Type of Export |Privileges required +|All Export +|+create table+ ++select on dba_tab_partitions+ ++select on dba_tab_subpartitions+ ++select on dba_indexes+ ++select on dba_ind_columns+ +|Insert-Export with a template +table into another schema +|+select any table+ ++create any table+ ++insert any table+ ++alter any table+ (partitioning) +|Insert-Export without a +template table into another +schema +|+select,insert on table+ (no partitioning) ++select,alter on table+ (partitioning) +|Update-Export into another +schema +|+select,update on table+ (no partitioning) ++select,delete,alter,insert on table+ +(partitioning) +|Merge-Export into another +schema +|+select,insert,update on table+ (no +partitioning) ++select,insert,delete,alter on table+ +(partitioning) +|=============================================================================== + +Supported Data Types +++++++++++++++++++++ + +The following Oracle data types are supported by the Data Connector for +Oracle and Hadoop: + +[frame="none",grid="none",cols="2,1l,2"] +|=============================================================================== +|BINARY_DOUBLE | |NCLOB +|BINARY_FLOAT | |NUMBER +|BLOB | |NVARCHAR2 +|CHAR | |RAW +|CLOB | |ROWID +|DATE | |TIMESTAMP +|FLOAT | |TIMESTAMP WITH TIME ZONE +|INTERVAL DAY TO SECOND | |TIMESTAMP WITH LOCAL TIME ZONE +|INTERVAL YEAR TO MONTH | |URITYPE +|LONG | |VARCHAR2 +|NCHAR | | +|=============================================================================== + +All other Oracle column types are NOT supported. Example Oracle column types NOT +supported by Data Connector for Oracle and Hadoop include: + +[frame="none",grid="none",cols="2,1l,2"] +|=============================================================================== +|All of the ANY types | |BFILE +|All of the MEDIA types | |LONG RAW +|All of the SPATIAL types | |MLSLABEL +|Any type referred to as UNDEFINED | |UROWID +|All custom (user-defined) URI types| |XMLTYPE +|=============================================================================== + +NOTE: Data types RAW, LONG and LOB (BLOB, CLOB and NCLOB) are supported for +Data Connector for Oracle and Hadoop imports. They are not supported for Data +Connector for Oracle and Hadoop exports. + +Execute Sqoop With Data Connector for Oracle and Hadoop +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Connect to Oracle / Oracle RAC +++++++++++++++++++++++++++++++ + +The Sqoop +--connect+ parameter defines the Oracle instance or Oracle RAC to +connect to. It is required with all Sqoop import and export commands. + +Data Connector for Oracle and Hadoop expects the associated connection string +to be of a specific format dependent on whether the Oracle SID or Service +is defined. + ++--connect jdbc:oracle:thin:@OracleServer:OraclePort:OracleSID+ + ++--connect jdbc:oracle:thin:@//OracleServer:OraclePort/OracleService+ + +Connect to An Oracle Database Instance +++++++++++++++++++++++++++++++++++++++ + +[frame="topbot",options="header",cols="2*v"] +|=============================================================================== +|Parameter / Component |Description +|+jdbc:oracle:thin+ +|The Data Connector for Oracle and Hadoop requires the +connection string starts with jdbc:oracle. + +The Data Connector for Oracle and Hadoop has been tested +with the thin driver however it should work equally well +with other drivers such as OCI. +|+OracleServer+ +|The host name of the Oracle server. +|+OraclePort+ +|The port to connect to the Oracle server. +|+OracleSID+ +|The Oracle instance. +|+OracleService+ +|The Oracle Service. +|=============================================================================== + +[NOTE] +================================================================================ +The Hadoop mappers connect to the Oracle database using a dynamically +generated JDBC URL. This is designed to improve performance however it can be +disabled by specifying: + ++-D oraoop.jdbc.url.verbatim=true+ +================================================================================ + +Connect to An Oracle RAC +++++++++++++++++++++++++ + +Use the +--connect+ parameter as above. The connection string should point to +one instance of the Oracle RAC. The listener of the host of this Oracle +instance will locate the other instances of the Oracle RAC. + +NOTE: To improve performance, The Data Connector for Oracle and Hadoop +identifies the active instances of the Oracle RAC and connects each Hadoop +mapper to them in a roundrobin manner. + +If services are defined for this Oracle RAC then use the following parameter +to specify the service name: + ++-D oraoop.oracle.rac.service.name=ServiceName+ + +[frame="topbot",options="header",cols="2*v"] +|=============================================================================== +|Parameter / Component |Description +|+OracleServer:OraclePort:OracleInstance+ +|Name one instance of the Oracle RAC. + +The Data Connector for Oracle and +Hadoop assumes the same port number for +all instances of the Oracle RAC. + +The listener of the host of this Oracle +instance is used to locate other instances of +the Oracle RAC. For more information +enter this command on the host command +line: + ++lsnrctl status+ +|+-D oraoop.oracle.rac.service.name=ServiceName+ +|The service to connect to in the Oracle RAC. + +A connection is made to all instances of +the Oracle RAC associated with the service +given by +ServiceName+. + +If omitted, a connection is made to all +instances of the Oracle RAC. + +The listener of the host of this Oracle +instance needs to know the +ServiceName+ +and all instances of the Oracle RAC. For +more information enter this command on +the host command line: + ++lsnrctl status+ +|=============================================================================== + +Login to The Oracle Instance +++++++++++++++++++++++++++++ + +Login to the Oracle instance on the Sqoop command line: + ++--connect jdbc:oracle:thin:@OracleServer:OraclePort:OracleInstance --username +UserName -P+ + +[frame="topbot",options="header",cols="2*v"] +|=============================================================================== +|Parameter / Component |Description +|+--username UserName+ +|The username to login to the Oracle instance (SID). +|+-P+ +|You will be prompted for the password to login to the Oracle +instance. +|=============================================================================== + +Kill Data Connector for Oracle and Hadoop Jobs +++++++++++++++++++++++++++++++++++++++++++++++ + +Use the Hadoop Job Tracker to kill the Sqoop job, just as you would kill any +other Map-Reduce job. + +$ +hadoop job -kill jobid+ + +To allow an Oracle DBA to kill a Data Connector for Oracle and Hadoop +job (via killing the sessions in Oracle) you need to prevent Map-Reduce from +re-attempting failed jobs. This is done via the following Sqoop +command-line switch: + ++-D mapred.map.max.attempts=1+ + +This sends instructions similar to the following to the console: + +---- +14/07/07 15:24:51 INFO oracle.OraOopManagerFactory: +Note: This Data Connector for Oracle and Hadoop job can be killed via Oracle +by executing the following statement: + begin + for row in (select sid,serial# from v$session where + module='Data Connector for Oracle and Hadoop' and + action='import 20140707152451EST') loop + execute immediate 'alter system kill session ''' || row.sid || + ',' || row.serial# || ''''; + end loop; + end; +---- + +Import Data from Oracle +^^^^^^^^^^^^^^^^^^^^^^^ + +Execute Sqoop. Following is an example command: + +$ +sqoop import --direct --connect ... --table OracleTableName+ + +If The Data Connector for Oracle and Hadoop accepts the job then the following +text is output: + +---- +************************************************** +*** Using Data Connector for Oracle and Hadoop *** +************************************************** +---- + +[NOTE] +================================================================================ +- More information is available on the +--connect+ parameter. See "Connect to +Oracle / Oracle RAC" for more information. + +- If Java runs out of memory the workaround is to specify each mapper's +JVM memory allocation. Add the following parameter for example to allocate 4GB: ++ ++-Dmapred.child.java.opts=-Xmx4000M+ + +- An Oracle optimizer hint is included in the SELECT statement by default. +See "oraoop.import.hint" for more information. ++ +You can alter the hint on the command line as follows: ++ ++-Doraoop.import.hint="NO_INDEX(t)"+ ++ +You can turn off the hint on the command line as follows (notice the space +between the double quotes): ++ ++-Doraoop.import.hint=" "+ +================================================================================ + +Match Hadoop Files to Oracle Table Partitions ++++++++++++++++++++++++++++++++++++++++++++++ + ++-Doraoop.chunk.method={ROWID|PARTITION}+ + +To import data from a partitioned table in such a way that the resulting HDFS +folder structure in Hadoop will match the table’s partitions, set the chunk +method to PARTITION. The alternative (default) chunk method is ROWID. + +[NOTE] +================================================================================ +- For the number of Hadoop files to match the number of Oracle partitions, +set the number of mappers to be greater than or equal to the number of +partitions. +- If the table is not partitioned then value PARTITION will lead to an error. +================================================================================ + +Specify The Partitions To Import +++++++++++++++++++++++++++++++++ + ++-Doraoop.import.partitions=PartitionA,PartitionB --table OracleTableName+ + +Imports +PartitionA+ and +PartitionB+ of +OracleTableName+. + +[NOTE] +================================================================================ +- You can enclose an individual partition name in double quotes to retain the +letter case or if the name has special characters. ++ ++-Doraoop.import.partitions=\'"PartitionA",PartitionB' --table OracleTableName+ ++ +If the partition name is not double quoted then its name will be automatically +converted to upper case, PARTITIONB for above. ++ +When using double quotes the entire list of partition names must be enclosed in +single quotes. ++ +If the last partition name in the list is double quoted then there must be a +comma at the end of the list. ++ ++-Doraoop.import.partitions=\'"PartitionA","PartitionB",' --table +OracleTableName+ + +- Name each partition to be included. There is no facility to provide a range of +partition names. + +- There is no facility to define sub partitions. The entire partition is +included/excluded as per the filter. +================================================================================ + +Consistent Read: All Mappers Read From The Same Point In Time ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + ++-Doraoop.import.consistent.read={true|false}+ + +When set to +false+ (by default) each mapper runs a select query. This will +return potentially inconsistent data if there are a lot of DML operations on +the table at the time of import. + +Set to +true+ to ensure all mappers read from the same point in time. The +System Change Number (SCN) is passed down to all mappers, which use the Oracle +Flashback Query to query the table as at that SCN. + +[NOTE] +================================================================================ +- Values +true+ | +false+ are case sensitive. +- By default the SCN is taken from V$database. You can specify the SCN in the +following command ++ ++-Doraoop.import.consistent.read.scn=12345+ +================================================================================ + +Export Data into Oracle +^^^^^^^^^^^^^^^^^^^^^^^ + +Execute Sqoop. Following is an example command: + + +$ +sqoop export --direct --connect ... --table OracleTableName --export-dir +/user/username/tablename+ + + +The Data Connector for Oracle and Hadoop accepts all jobs that export data to +Oracle. You can verify The Data Connector for Oracle and Hadoop is in use by +checking the following text is output: + +---- +************************************************** +*** Using Data Connector for Oracle and Hadoop *** +************************************************** +---- + +[NOTE] +================================================================================ +- +OracleTableName+ is the Oracle table the data will export into. +- +OracleTableName+ can be in a schema other than that for the connecting user. +Prefix the table name with the schema, for example +SchemaName.OracleTableName+. +- Hadoop tables are picked up from the +/user/username/tablename+ directory. +- The export will fail if the Hadoop file contains any fields of a data type +not supported by The Data Connector for Oracle and Hadoop. See +"Supported Data Types" for more information. +- The export will fail if the column definitions in the Hadoop table do not +exactly match the column definitions in the Oracle table. +- The Data Connector for Oracle and Hadoop indicates if it finds temporary +tables that it created more than a day ago that still exist. Usually these +tables can be dropped. The only circumstance when these tables should not be +dropped is when an The Data Connector for Oracle and Hadoop job has been +running for more than 24 hours and is still running. +- More information is available on the +--connect+ parameter. See +"Connect to Oracle / Oracle RAC" for more information. +================================================================================ + +Insert-Export ++++++++++++++ + +Appends data to +OracleTableName+. It does not modify existing data in ++OracleTableName+. + +Insert-Export is the default method, executed in the absence of the ++--update-key parameter+. All rows in the HDFS file in ++/user/UserName/TableName+ are inserted into +OracleTableName+. No +change is made to pre-existing data in +OracleTableName+. + +$ +sqoop export --direct --connect ... --table OracleTableName --export-dir +/user/username/tablename+ + +[NOTE] +================================================================================ +- If +OracleTableName+ was previously created by The Data Connector for Oracle +and Hadoop with partitions then this export will create a new partition for the +data being inserted. +- When creating +OracleTableName+ specify a template. See +"Create Oracle Tables" for more information. +================================================================================ + +Update-Export ++++++++++++++ + ++--update-key OBJECT+ + +Updates existing rows in +OracleTableName+. + +Rows in the HDFS file in +/user/UserName/TableName+ are matched to rows in ++OracleTableName+ by the +OBJECT+ column. Rows that match are copied from the +HDFS file to the Oracle table. No action is taken on rows that do not match. + +$ +sqoop export --direct --connect ... --update-key OBJECT --table +OracleTableName --export-dir /user/username/tablename+ + +[NOTE] +================================================================================ +- If +OracleTableName+ was previously created by The Data Connector for Oracle +and Hadoop with partitions then this export will create a new partition for the +data being inserted. Updated rows will be moved to the new partition that was +created for the export. +- For performance reasons it is strongly recommended that where more than a few +rows are involved column +OBJECT+ be an index column of +OracleTableName+. +- Ensure the column name defined with +--update-key OBJECT+ is specified in the +correct letter case. Sqoop will show an error if the letter case is incorrect. +- It is possible to match rows via multiple columns. See "Match Rows Via +Multiple Columns" for more information. +================================================================================ + +Merge-Export +++++++++++++ + ++--update-key OBJECT -Doraoop.export.merge=true+ + +Updates existing rows in +OracleTableName+. Copies across rows from the HDFS +file that do not exist within the Oracle table. + +Rows in the HDFS file in +/user/UserName/TableName+ are matched to rows in ++OracleTableName+ by the +OBJECT+ column. Rows that match are copied from the +HDFS file to the Oracle table. Rows in the HDFS file that do not exist in ++OracleTableName+ are added to +OracleTableName+. + +$ +sqoop export --direct --connect ... --update-key OBJECT +-Doraoop.export.merge=true --table OracleTableName --export-dir +/user/username/tablename+ + +[NOTE] +================================================================================ +- Merge-Export is unique to The Data Connector for Oracle and Hadoop. It is +not a standard Sqoop feature. +- If +OracleTableName+ was previously created by The Data Connector for Oracle +and Hadoop with partitions, then this export will create a new partition for +the data being inserted. Updated rows will be moved to the new partition that +was created for the export. +- For performance reasons it is strongly recommended that where more than a +few rows are involved column +OBJECT+ be an index column of +OracleTableName+. +- Ensure the column name defined with +--update-key OBJECT+ is specified in the +correct letter case. Sqoop will show an error if the letter case is incorrect. +- It is possible to match rows via multiple columns. See "Match Rows Via +Multiple Columns" for more information. +================================================================================ + +Create Oracle Tables +++++++++++++++++++++ + ++-Doraoop.template.table=TemplateTableName+ + +Creates +OracleTableName+ by replicating the structure and data types of ++TemplateTableName+. +TemplateTableName+ is a table that exists in Oracle prior +to executing the Sqoop command. + +[NOTE] +================================================================================ +- The export will fail if the Hadoop file contains any fields of a data type +not supported by The Data Connector for Oracle and Hadoop. See "Supported +Data Types" for more information. +- The export will fail if the column definitions in the Hadoop table do not +exactly match the column definitions in the Oracle table. +- This parameter is specific to creating an Oracle table. The export will fail +if +OracleTableName+ already exists in Oracle. +================================================================================ + +Example command: + +$ +sqoop export --direct --connect.. --table OracleTableName --export-dir +/user/username/tablename -Doraoop.template.table=TemplateTableName+ + +NOLOGGING ++++++++++ + ++-Doraoop.nologging=true+ + +Assigns the NOLOGGING option to +OracleTableName+. + +NOLOGGING may enhance performance but you will be unable to backup the table. + +Partitioning +++++++++++++ + ++-Doraoop.partitioned=true+ + +Partitions the table with the following benefits: + +- The speed of the export is improved by allowing each mapper to insert data +into a separate Oracle table using direct path writes. (An alter table exchange +subpartition SQL statement is subsequently executed to swap the data into the +export table.) +- You can selectively query or delete the data inserted by each Sqoop export +job. For example, you can delete old data by dropping old partitions from +the table. + +The partition value is the SYSDATE of when Sqoop export job was performed. + +The partitioned table created by The Data Connector for Oracle and Hadoop +includes the following columns that don't exist in the template table: + +- +oraoop_export_sysdate+ - This is the Oracle SYSDATE when the Sqoop export +job was performed. The created table will be partitioned by this column. +- +oraoop_mapper_id+ - This is the id of the Hadoop mapper that was used to +process the rows from the HDFS file. Each partition is subpartitioned by this +column. This column exists merely to facilitate the exchange subpartition +mechanism that is performed by each mapper during the export process. +- +oraoop_mapper_row+ - A unique row id within the mapper / partition. + +NOTE: If a unique row id is required for the table it can be formed by a +combination of oraoop_export_sysdate, oraoop_mapper_id and oraoop_mapper_row. + +Match Rows Via Multiple Columns ++++++++++++++++++++++++++++++++ + ++-Doraoop.update.key.extra.columns="ColumnA,ColumnB"+ + +Used with Update-Export and Merge-Export to match on more than one column. The +first column to be matched on is +--update-key OBJECT+. To match on additional +columns, specify those columns on this parameter. + +[NOTE] +================================================================================ +- Letter case for the column names on this parameter is not important. +- All columns used for matching should be indexed. The first three items on the +index should be +ColumnA+, +ColumnB+ and the column specified on ++--update-key+ - but the order in which the columns are specified is not +important. +================================================================================ + +Storage Clauses ++++++++++++++++ + ++-Doraoop.temporary.table.storage.clause="StorageClause"+ + ++-Doraoop.table.storage.clause="StorageClause"+ + +Use to customize storage with Oracle clauses as in TABLESPACE or COMPRESS + ++-Doraoop.table.storage.clause+ applies to the export table that is created +from +-Doraoop.template.table+. See "Create Oracle Tables" for more +information. +-Doraoop.temporary.table.storage.clause+ applies to all other +working tables that are created during the export process and then dropped at +the end of the export job. + +Manage Date And Timestamp Data Types +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Import Date And Timestamp Data Types from Oracle +++++++++++++++++++++++++++++++++++++++++++++++++ + +This section lists known differences in the data obtained by performing an +Data Connector for Oracle and Hadoop import of an Oracle table versus a native +Sqoop import of the same table. + +The Data Connector for Oracle and Hadoop Does Not Apply A Time Zone to DATE / TIMESTAMP Data Types +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +Data stored in a DATE or TIMESTAMP column of an Oracle table is not associated +with a time zone. Sqoop without the Data Connector for Oracle and Hadoop +inappropriately applies time zone information to this data. + +Take for example the following timestamp in an Oracle DATE or TIMESTAMP column: ++2am on 3rd October, 2010+. + +Request Sqoop without the Data Connector for Oracle and Hadoop import this data +using a system located in Melbourne Australia. The data is adjusted to Melbourne +Daylight Saving Time. The data is imported into Hadoop as: ++3am on 3rd October, 2010.+ + +The Data Connector for Oracle and Hadoop does not apply time zone information to +these Oracle data-types. Even from a system located in Melbourne Australia, The +Data Connector for Oracle and Hadoop ensures the Oracle and Hadoop timestamps +match. The Data Connector for Oracle and Hadoop correctly imports this +timestamp as: ++2am on 3rd October, 2010+. + +NOTE: In order for The Data Connector for Oracle and Hadoop to ensure data +accuracy, Oracle DATE and TIMESTAMP values must be represented by a String, +even when +--as-sequencefile+ is used on the Sqoop command-line to produce a +binary file in Hadoop. + +The Data Connector for Oracle and Hadoop Retains Time Zone Information in TIMEZONE Data Types ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +Data stored in a TIMESTAMP WITH TIME ZONE column of an Oracle table is +associated with a time zone. This data consists of two distinct parts: when the +event occurred and where the event occurred. + +When Sqoop without The Data Connector for Oracle and Hadoop is used to import +data it converts the timestamp to the time zone of the system running Sqoop and +omits the component of the data that specifies where the event occurred. + +Take for example the following timestamps (with time zone) in an Oracle +TIMESTAMP WITH TIME ZONE column: + +---- +2:59:00 am on 4th April, 2010. Australia/Melbourne +2:59:00 am on 4th April, 2010. America/New York +---- + +Request Sqoop without The Data Connector for Oracle and Hadoop import this data +using a system located in Melbourne Australia. From the data imported into +Hadoop we know when the events occurred, assuming we know the Sqoop command was +run from a system located in the Australia/Melbourne time zone, but we have lost +the information regarding where the event occurred. + +---- +2010-04-04 02:59:00.0 +2010-04-04 16:59:00.0 +---- + +Sqoop with The Data Connector for Oracle and Hadoop imports the example +timestamps as follows. The Data Connector for Oracle and Hadoop retains the +time zone portion of the data. +---- +2010-04-04 02:59:00.0 Australia/Melbourne +2010-04-04 02:59:00.0 America/New_York +---- + +Data Connector for Oracle and Hadoop Explicitly States Time Zone for LOCAL TIMEZONE Data Types +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +Data stored in a TIMESTAMP WITH LOCAL TIME ZONE column of an Oracle table is +associated with a time zone. Multiple end-users in differing time zones +(locales) will each have that data expressed as a timestamp within their +respective locale. + +When Sqoop without the Data Connector for Oracle and Hadoop is used to import +data it converts the timestamp to the time zone of the system running Sqoop and +omits the component of the data that specifies location. + +Take for example the following two timestamps (with time zone) in an Oracle +TIMESTAMP WITH LOCAL TIME ZONE column: + +---- +2:59:00 am on 4th April, 2010. Australia/Melbourne +2:59:00 am on 4th April, 2010. America/New York +---- + +Request Sqoop without the Data Connector for Oracle and Hadoop import this data +using a system located in Melbourne Australia. The timestamps are imported +correctly but the local time zone has to be guessed. If multiple systems in +different locale were executing the Sqoop import it would be very difficult to +diagnose the cause of the data corruption. + +---- +2010-04-04 02:59:00.0 +2010-04-04 16:59:00.0 +---- + +Sqoop with the Data Connector for Oracle and Hadoop explicitly states the time +zone portion of the data imported into Hadoop. The local time zone is GMT by +default. You can set the local time zone with parameter: + ++-Doracle.sessionTimeZone=Australia/Melbourne+ + +The Data Connector for Oracle and Hadoop would import these two timestamps as: + +---- +2010-04-04 02:59:00.0 Australia/Melbourne +2010-04-04 16:59:00.0 Australia/Melbourne +---- + +java.sql.Timestamp +++++++++++++++++++ + +To use Sqoop's handling of date and timestamp data types when importing data +from Oracle use the following parameter: + ++-Doraoop.timestamp.string=false+ + +NOTE: Sqoop's handling of date and timestamp data types does not store the +timezone. However, some developers may prefer Sqoop's handling as the Data +Connector for Oracle and Hadoop converts date and timestamp data types to +string. This may not work for some developers as the string will require +parsing later in the workflow. + +Export Date And Timestamp Data Types into Oracle +++++++++++++++++++++++++++++++++++++++++++++++++ + +Ensure the data in the HDFS file fits the required format exactly before using +Sqoop to export the data into Oracle. + +[NOTE] +================================================================================ +- The Sqoop export command will fail if the data is not in the required format. +- ff = Fractional second +- TZR = Time Zone Region +================================================================================ + +[frame="topbot",options="header"] +|=============================================================================== +|Oracle Data Type |Required Format of The Data in the HDFS File +|DATE |+yyyy-mm-dd hh24:mi:ss+ +|TIMESTAMP |+yyyy-mm-dd hh24:mi:ss.ff+ +|TIMESTAMPTZ |+yyyy-mm-dd hh24:mi:ss.ff TZR+ +|TIMESTAMPLTZ |+yyyy-mm-dd hh24:mi:ss.ff TZR+ +|=============================================================================== + +Configure The Data Connector for Oracle and Hadoop +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +oraoop-site-template.xml +++++++++++++++++++++++++ + +The oraoop-site-template.xml file is supplied with the Data Connector for +Oracle and Hadoop. It contains a number of ALTER SESSION statements that are +used to initialize the Oracle sessions created by the Data Connector for Oracle +and Hadoop. + +If you need to customize these initializations to your environment then: + +1. Find +oraoop-site-template.xml+ in the Sqoop configuration directory. +2. Copy +oraoop-site-template.xml+ to +oraoop-site.xml+. +3. Edit the +ALTER SESSION+ statements in +oraoop-site.xml+. + +oraoop.oracle.session.initialization.statements ++++++++++++++++++++++++++++++++++++++++++++++++ + +The value of this property is a semicolon-delimited list of Oracle SQL +statements. These statements are executed, in order, for each Oracle session +created by the Data Connector for Oracle and Hadoop. + +The default statements include: + ++alter session set time_zone = \'{oracle.sessionTimeZone|GMT}';+:: +This statement initializes the timezone of the JDBC client. This ensures that +data from columns of type TIMESTAMP WITH LOCAL TIMEZONE are correctly adjusted +into the timezone of the client and not kept in the timezone of the Oracle +database. ++ +[NOTE] +================================================================================ +- There is an explanation to the text within the curly-braces. See +"Expressions in oraoop-site.xml" for more information.. +- A list of the time zones supported by your Oracle database is available by +executing the following query: +SELECT TZNAME FROM V$TIMEZONE_NAMES;+ +================================================================================ + ++alter session disable parallel query;+:: +This statement instructs Oracle to not parallelize SQL statements executed by +the Data Connector for Oracle and Hadoop sessions. This Oracle feature is +disabled because the Map/Reduce job launched by Sqoop is the mechanism used +for parallelization. ++ +It is recommended that you not enable parallel query because it can have an +adverse effect the load on the Oracle instance and on the balance between +the Data Connector for Oracle and Hadoop mappers. ++ +Some export operations are performed in parallel where deemed appropriate by +the Data Connector for Oracle and Hadoop. See "Parallelization" for +more information. + ++alter session set "_serial_direct_read"=true;+:: +This statement instructs Oracle to bypass the buffer cache. This is used to +prevent Oracle from filling its buffers with the data being read by the Data +Connector for Oracle and Hadoop, therefore diminishing its capacity to cache +higher prioritized data. Hence, this statement is intended to minimize the +Data Connector for Oracle and Hadoop's impact on the immediate future +performance of the Oracle database. + ++--alter session set events \'10046 trace name context forever, level 8';+:: +This statement has been commented-out. To allow tracing, remove the comment +token "--" from the start of the line. + +[NOTE] +================================================================================ +- These statements are placed on separate lines for readability. They do not +need to be placed on separate lines. +- A statement can be commented-out via the standard Oracle double-hyphen +token: "--". The comment takes effect until the next semicolon. +================================================================================ + +oraoop.table.import.where.clause.location ++++++++++++++++++++++++++++++++++++++++++ + +SUBSPLIT (default):: +When set to this value, the where clause is applied to each subquery used to +retrieve data from the Oracle table. ++ +A Sqoop command like: ++ ++sqoop import -D oraoop.table.import.where.clause.location=SUBSPLIT --table +JUNK --where "owner like \'G%'"+ ++ +Generates SQL query of the form: ++ +---- +SELECT OWNER,OBJECT_NAME + FROM JUNK + WHERE ((rowid >= + dbms_rowid.rowid_create(1, 113320, 1024, 4223664, 0) + AND rowid <= + dbms_rowid.rowid_create(1, 113320, 1024, 4223671, 32767))) + AND (owner like 'G%') +UNION ALL +SELECT OWNER,OBJECT_NAME + FROM JUNK + WHERE ((rowid >= + dbms_rowid.rowid_create(1, 113320, 1024, 4223672, 0) + AND rowid <= + dbms_rowid.rowid_create(1, 113320, 1024, 4223679, 32767))) + AND (owner like 'G%') +---- + +SPLIT:: +When set to this value, the where clause is applied to the entire SQL +statement used by each split/mapper. ++ +A Sqoop command like: ++ ++sqoop import -D oraoop.table.import.where.clause.location=SPLIT --table +JUNK --where "rownum <= 10"+ ++ +Generates SQL query of the form: ++ +---- +SELECT OWNER,OBJECT_NAME + FROM ( + SELECT OWNER,OBJECT_NAME + FROM JUNK + WHERE ((rowid >= + dbms_rowid.rowid_create(1, 113320, 1024, 4223664, 0) + AND rowid <= + dbms_rowid.rowid_create(1, 113320, 1024, 4223671, 32767))) + UNION ALL + SELECT OWNER,OBJECT_NAME + FROM JUNK + WHERE ((rowid >= + dbms_rowid.rowid_create(1, 113320, 1024, 4223672, 0) + AND rowid <= + dbms_rowid.rowid_create(1, 113320, 1024, 4223679,32767))) + ) + WHERE rownum <= 10 +---- ++ +[NOTE] +================================================================================ +- In this example, there are up to 10 rows imported per mapper. +- The SPLIT clause may result in greater overhead than the SUBSPLIT +clause because the UNION statements need to be fully materialized +before the data can be streamed to the mappers. However, you may +wish to use SPLIT in the case where you want to limit the total +number of rows processed by each mapper. +================================================================================ + +oracle.row.fetch.size ++++++++++++++++++++++ + +The value of this property is an integer specifying the number of rows the +Oracle JDBC driver should fetch in each network round-trip to the database. +The default value is 5000. + +If you alter this setting, confirmation of the +change is displayed in the logs of the mappers during the Map-Reduce job. + +oraoop.import.hint +++++++++++++++++++ + +The Oracle optimizer hint is added to the SELECT statement for IMPORT jobs +as follows: + +---- +SELECT /*+ NO_INDEX(t) */ * FROM employees; +---- + +The default hint is +NO_INDEX(t)+ + +[NOTE] +================================================================================ +- The hint can be added to the command line. See "Import Data from Oracle" for +more information. +- See the Oracle Database Performance Tuning Guide (Using Optimizer Hints) +for more information on Oracle optimizer hints. +- To turn the hint off, insert a space between the elements. ++ +---- + + oraoop.import.hint + + +---- +================================================================================ + +oraoop.oracle.append.values.hint.usage +++++++++++++++++++++++++++++++++++++++ + +The value of this property is one of: AUTO / ON / OFF. + +AUTO:: +AUTO is the default value. ++ +Currently AUTO is equivalent to OFF. + +ON:: +During export the Data Connector for Oracle and Hadoop uses direct path +writes to populate the target Oracle table, bypassing the buffer cache. +Oracle only allows a single session to perform direct writes against a specific +table at any time, so this has the effect of serializing the writes to the +table. This may reduce throughput, especially if the number of mappers is high. +However, for databases where DBWR is very busy, or where the IO bandwidth to +the underlying table is narrow (table resides on a single disk spindle for +instance), then setting +oraoop.oracle.append.values.hint.usage+ to ON may +reduce the load on the Oracle database and possibly increase throughput. + +OFF:: +During export the Data Connector for Oracle and Hadoop does not use the ++APPEND_VALUES+ Oracle hint. + +NOTE: This parameter is only effective on Oracle 11g Release 2 and above. + +mapred.map.tasks.speculative.execution +++++++++++++++++++++++++++++++++++++++ + +By default speculative execution is disabled for the Data Connector for +Oracle and Hadoop. This avoids placing redundant load on the Oracle database. + +If Speculative execution is enabled, then Hadoop may initiate multiple mappers +to read the same blocks of data, increasing the overall load on the database. + +oraoop.block.allocation ++++++++++++++++++++++++ + +This setting determines how Oracle's data-blocks are assigned to Map-Reduce mappers. + +NOTE: Applicable to import. Not applicable to export. + +ROUNDROBIN (default):: +Each chunk of Oracle blocks is allocated to the mappers in a roundrobin +manner. This helps prevent one of the mappers from being +allocated a large proportion of typically small-sized blocks from the +start of Oracle data-files. In doing so it also helps prevent one of the +other mappers from being allocated a large proportion of typically +larger-sized blocks from the end of the Oracle data-files. ++ +Use this method to help ensure all the mappers are allocated a similar +amount of work. + +RANDOM:: +The list of Oracle blocks is randomized before being allocated to the +mappers via a round-robin approach. This has the benefit of increasing +the chance that, at any given instant in time, each mapper is reading +from a different Oracle data-file. If the Oracle data-files are located on +separate spindles, this should increase the overall IO throughput. + +SEQUENTIAL:: +Each chunk of Oracle blocks is allocated to the mappers sequentially. +This produces the tendency for each mapper to sequentially read a large, +contiguous proportion of an Oracle data-file. It is unlikely for the +performance of this method to exceed that of the round-robin method +and it is more likely to allocate a large difference in the work between +the mappers. ++ +Use of this method is generally not recommended. + +oraoop.import.omit.lobs.and.long +++++++++++++++++++++++++++++++++ + +This setting can be used to omit all LOB columns (BLOB, CLOB and NCLOB) and LONG +column from an Oracle table being imported. This is advantageous in +troubleshooting, as it provides a convenient way to exclude all LOB-based data +from the import. + +oraoop.locations +++++++++++++++++ + +NOTE: Applicable to import. Not applicable to export. + +By default, four mappers are used for a Sqoop import job. The number of mappers +can be altered via the Sqoop +--num-mappers+ parameter. + +If the data-nodes in your Hadoop cluster have 4 task-slots (that is they are +4-CPU core machines) it is likely for all four mappers to execute on the +same machine. Therefore, IO may be concentrated between the Oracle database +and a single machine. + +This setting allows you to control which DataNodes in your Hadoop cluster each +mapper executes on. By assigning each mapper to a separate machine you may +improve the overall IO performance for the job. This will also have the +side-effect of the imported data being more diluted across the machines in +the cluster. (HDFS replication will dilute the data across the cluster anyway.) + +Specify the machine names as a comma separated list. The locations are +allocated to each of the mappers in a round-robin manner. + +If using EC2, specify the internal name of the machines. Here is an example +of using this parameter from the Sqoop command-line: + +$ +sqoop import -D +oraoop.locations=ip-10-250-23-225.ec2.internal,ip-10-250-107-32.ec2.internal,ip-10-250-207-2.ec2.internal,ip-10-250-27-114.ec2.internal +--direct --connect...+ + +sqoop.connection.factories +++++++++++++++++++++++++++ + +This setting determines behavior if the Data Connector for Oracle and Hadoop +cannot accept the job. By default Sqoop accepts the jobs that the Data Connector +for Oracle and Hadoop rejects. + +Set the value to +org.apache.sqoop.manager.oracle.OraOopManagerFactory+ when you +want the job to fail if the Data Connector for Oracle and Hadoop cannot +accept the job. + +Expressions in oraoop-site.xml +++++++++++++++++++++++++++++++ + +Text contained within curly-braces { and } are expressions to be evaluated +prior to the SQL statement being executed. The expression contains the name +of the configuration property optionally followed by a default value to use +if the property has not been set. A pipe | character is used to delimit the +property name and the default value. + +For example: + +When this Sqoop command is executed:: +$ +sqoop import -D oracle.sessionTimeZone=US/Hawaii --direct --connect+ + +The statement within oraoop-site.xml:: ++alter session set time_zone =\'{oracle.sessionTimeZone|GMT}\';+ + +Becomes:: ++alter session set time_zone = \'US/Hawaii'+ + +If the oracle.sessionTimeZone property had not been set, then this statement would use the specified default value and would become:: ++alter session set time_zone = \'GMT'+ + +NOTE: The +oracle.sessionTimeZone+ property can be specified within the ++sqoop-site.xml+ file if you want this setting to be used all the time. + +Troubleshooting The Data Connector for Oracle and Hadoop +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Quote Oracle Owners And Tables +++++++++++++++++++++++++++++++ + +[frame="topbot",cols="2*v"] +|=============================================================================== +|If the owner of the Oracle table needs to be +quoted, use: +|$ +sqoop import ... --table +"\"\"Scott\".customers\""+ + +This is the equivalent of: +"Scott".customers +|If the Oracle table needs to be quoted, use: +|$ +sqoop import ... --table +"\"scott.\"Customers\"\""+ + +This is the equivalent of: +scott."Customers" +|If both the owner of the Oracle table and the +table itself needs to be quoted, use: +|$ +sqoop import ... --table +"\"\"Scott\".\"Customers\"\""+ + +This is the equivalent of: +"Scott"."Customers" +|=============================================================================== + +[NOTE] +================================================================================ +- The HDFS output directory is called something like: +/user/username/"Scott"."Customers" +- If a table name contains a $ character, it may need to be escaped within your +Unix shell. For example, the dr$object table in the ctxsys schema would be +referred to as: $ +sqoop import ... --table "ctxsys.dr\$object"+ +================================================================================ + +Quote Oracle Columns +++++++++++++++++++++ + +If a column name of an Oracle table needs to be quoted, use:: +$ +sqoop import ... --table customers --columns "\"\"first name\"\""+ ++ +This is the equivalent of: `select "first name" from customers` + +Confirm The Data Connector for Oracle and Hadoop Can Initialize The Oracle Session +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +If the Sqoop output includes feedback such as the following then the +configuration properties contained within +oraoop-site-template.xml+ and ++oraoop-site.xml+ have been loaded by Hadoop and can be accessed by the Data +Connector for Oracle and Hadoop. + ++14/07/08 15:21:13 INFO oracle.OracleConnectionFactory: +Initializing Oracle session with SQL+ + +Check The Sqoop Debug Logs for Error Messages ++++++++++++++++++++++++++++++++++++++++++++++ + +For more information about any errors encountered during the Sqoop import, +refer to the log files generated by each of the (by default 4) mappers that +performed the import. + +The logs can be obtained via your Map-Reduce Job Tracker's web page. + +Include these log files with any requests you make for assistance on the Sqoop +User Group web site. + +Export: Check Tables Are Compatible ++++++++++++++++++++++++++++++++++++ + +Check tables particularly in the case of a parsing error. + +- Ensure the fields contained with the HDFS file and the columns within the +Oracle table are identical. If they are not identical, the Java code +dynamically generated by Sqoop to parse the HDFS file will throw an error when +reading the file – causing the export to fail. When creating a table in Oracle +ensure the definitions for the table template are identical to the definitions +for the HDFS file. +- Ensure the data types in the table are supported. See "Supported Data Types" +for more information. +- Are date and time zone based data types used? See "Export Date And Timestamp +Data Types into Oracle" for more information. + +Export: Parallelization ++++++++++++++++++++++++ + ++-D oraoop.export.oracle.parallelization.enabled=false+ + +If you see a parallelization error you may decide to disable parallelization +on Oracle queries. + +Export: Check oraoop.oracle.append.values.hint.usage +++++++++++++++++++++++++++++++++++++++++++++++++++++ + +The oraoop.oracle.append.values.hint.usage parameter should not be set to ON +if the Oracle table contains either a BINARY_DOUBLE or BINARY_FLOAT column and +the HDFS file being exported contains a NULL value in either of these column +types. Doing so will result in the error: +ORA-12838: cannot read/modify an +object after modifying it in parallel+. + +Turn On Verbose ++++++++++++++++ + +Turn on verbose on the Sqoop command line. + ++--verbose+ + +Check Sqoop stdout (standard output) and the mapper logs for information as to +where the problem may be.