diff --git a/src/docs/user/hive.txt b/src/docs/user/hive.txt index 75a389be..03c2bff0 100644 --- a/src/docs/user/hive.txt +++ b/src/docs/user/hive.txt @@ -112,10 +112,14 @@ configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec. -The user can specify the +\--external-table-dir+ option in the sqoop command to +External table import ++++++++++++++++++++++ + +You can specify the +\--external-table-dir+ option in the sqoop command to work with an external Hive table (instead of a managed table, i.e. the default behavior). To import data into an external table, one has to specify +\--hive-import+ in the command -line arguments. Table creation is also supported with the use of +\--create-hive-table+. +line arguments. Table creation is also supported with the use of +\--create-hive-table+ +option. Importing into an external Hive table: ---- @@ -126,3 +130,35 @@ Create an external Hive table: ---- $ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar ---- + +Type Mapping in a Hive import using parquet files ++++++++++++++++++++++++++++++++++++++++++++++++++ + +As mentioned above, a hive import is a two-step process in Sqoop: +- Sqoop imports the data with the import tool onto HDFS first. +- Then, Sqoop generates a Hive statement and executes it, effectively creating a table in Hive. + +Since Sqoop is using an avro schema to write parquet files, the SQL types of the source table's column are first +converted into avro types and an avro schema is created. This schema is then used in a regular Parquet import. +After the data was imported onto HDFS successfully, in the second step, Sqoop uses the Avro +schema generated for the parquet import to create the Hive query and maps the Avro types to Hive +types. + +Decimals are converted to String in a parquet import per default, so Decimal columns appear as String +columns in Hive per default. You can change this behavior and use logical types instead, so that Decimals +will be mapped to the Hive type Decimal as well. This has to be enabled with the ++sqoop.parquet.decimal_padding.enable+ property. As noted in the section discussing +'Padding number types in avro and parquet import', you should also specify the default precision and scale and +enable decimal padding. + +A limitation of Hive is that the maximum precision and scale is 38. When converting SQL types to the Hive Decimal +type, precision and scale will be modified to meet this limitation, automatically. The data itself however, will +only have to adhere to the limitations of the Parquet import, thus values with a precision and scale bigger than +38 will be present on storage on HDFS, but they won't be visible in Hive, (since Hive is a schema-on-read tool). + +Enable padding and specifying a default precision and scale in a Hive Import: +---- +$ sqoop import -Dsqoop.parquet.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true + -Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10 + --hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --as-parquetfile +---- diff --git a/src/docs/user/import.txt b/src/docs/user/import.txt index 79f71012..d878e216 100644 --- a/src/docs/user/import.txt +++ b/src/docs/user/import.txt @@ -476,33 +476,43 @@ i.e. used during both avro and parquet imports, one has to use the sqoop.avro.logical_types.decimal.enable flag. This is necessary if one wants to store values as decimals in the avro file format. -Padding number types in avro import -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Padding number types in avro and parquet import +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Certain databases, such as Oracle and Postgres store number and decimal values without padding. For example 1.5 in a column declared -as NUMBER (20,5) is stored as is in Oracle, while the equivalent +as NUMBER (20, 5) is stored as is in Oracle, while the equivalent DECIMAL (20, 5) is stored as 1.50000 in an SQL server instance. This leads to a scale mismatch during avro import. To avoid this error, one can use the sqoop.avro.decimal_padding.enable flag -to turn on padding with 0s. This flag has to be used together with the -sqoop.avro.logical_types.decimal.enable flag set to true. +to turn on padding with 0s during. One also has to enable logical types with the +sqoop.avro.logical_types.decimal.enable property set to true during an avro import, +or with the sqoop.parquet.logical_types.decimal.enable property during a parquet import. -Default precision and scale in avro import -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Default precision and scale in avro and parquet import +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All of the databases allow users to specify numeric columns without a precision or scale. While MS SQL and MySQL translate these into a valid precision and scale values, Oracle and Postgres don't. -Therefore, when a table contains NUMBER in a table in Oracle or +When a table contains NUMBER in a table in Oracle or NUMERIC/DECIMAL in Postgres, one can specify a default precision and scale to be used in the avro schema by using the +sqoop.avro.logical_types.decimal.default.precision+ -and +sqoop.avro.logical_types.decimal.default.scale+ flags. +and +sqoop.avro.logical_types.decimal.default.scale+ properties. Avro padding also has to be enabled, if the values are shorter than the specified default scale. +Even though their name contains 'avro', the very same properties +(+sqoop.avro.logical_types.decimal.default.precision+ and +sqoop.avro.logical_types.decimal.default.scale+) +can be used to specify defaults during a parquet import as well. +But please not that the padding has to be enabled with the parquet specific property. + +The implementation of the padding logic is database independent. +Our tests only cover only Oracle, Postgres, MS Sql server and MySQL databases, +therefore these are the supported ones. + Large Objects ^^^^^^^^^^^^^ @@ -855,3 +865,13 @@ $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_typ --target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1 ---- + +The same in a parquet import: + +---- +$ sqoop import -Dsqoop.parquet.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true + -Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10 + --connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS" + --target-dir hdfs://nameservice1//etl/target_path --as-parquetfile --verbose -m 1 + +----