5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-02 19:50:39 +08:00

Documentation for parquet decimal support, and parquet decimal support in Hive import

This commit is contained in:
Fero Szabo 2018-12-05 17:37:23 +01:00
parent a50394977b
commit 821cb6bfd3
2 changed files with 67 additions and 11 deletions

View File

@ -112,10 +112,14 @@ configuring a new Hive table with the correct InputFormat. This feature
currently requires that all partitions of a table be compressed with the lzop currently requires that all partitions of a table be compressed with the lzop
codec. codec.
The user can specify the +\--external-table-dir+ option in the sqoop command to External table import
+++++++++++++++++++++
You can specify the +\--external-table-dir+ option in the sqoop command to
work with an external Hive table (instead of a managed table, i.e. the default behavior). work with an external Hive table (instead of a managed table, i.e. the default behavior).
To import data into an external table, one has to specify +\--hive-import+ in the command To import data into an external table, one has to specify +\--hive-import+ in the command
line arguments. Table creation is also supported with the use of +\--create-hive-table+. line arguments. Table creation is also supported with the use of +\--create-hive-table+
option.
Importing into an external Hive table: Importing into an external Hive table:
---- ----
@ -126,3 +130,35 @@ Create an external Hive table:
---- ----
$ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar $ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar
---- ----
Type Mapping in a Hive import using parquet files
+++++++++++++++++++++++++++++++++++++++++++++++++
As mentioned above, a hive import is a two-step process in Sqoop:
- Sqoop imports the data with the import tool onto HDFS first.
- Then, Sqoop generates a Hive statement and executes it, effectively creating a table in Hive.
Since Sqoop is using an avro schema to write parquet files, the SQL types of the source table's column are first
converted into avro types and an avro schema is created. This schema is then used in a regular Parquet import.
After the data was imported onto HDFS successfully, in the second step, Sqoop uses the Avro
schema generated for the parquet import to create the Hive query and maps the Avro types to Hive
types.
Decimals are converted to String in a parquet import per default, so Decimal columns appear as String
columns in Hive per default. You can change this behavior and use logical types instead, so that Decimals
will be mapped to the Hive type Decimal as well. This has to be enabled with the
+sqoop.parquet.decimal_padding.enable+ property. As noted in the section discussing
'Padding number types in avro and parquet import', you should also specify the default precision and scale and
enable decimal padding.
A limitation of Hive is that the maximum precision and scale is 38. When converting SQL types to the Hive Decimal
type, precision and scale will be modified to meet this limitation, automatically. The data itself however, will
only have to adhere to the limitations of the Parquet import, thus values with a precision and scale bigger than
38 will be present on storage on HDFS, but they won't be visible in Hive, (since Hive is a schema-on-read tool).
Enable padding and specifying a default precision and scale in a Hive Import:
----
$ sqoop import -Dsqoop.parquet.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true
-Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
--hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --as-parquetfile
----

View File

@ -476,33 +476,43 @@ i.e. used during both avro and parquet imports, one has to use the
sqoop.avro.logical_types.decimal.enable flag. This is necessary if one sqoop.avro.logical_types.decimal.enable flag. This is necessary if one
wants to store values as decimals in the avro file format. wants to store values as decimals in the avro file format.
Padding number types in avro import Padding number types in avro and parquet import
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Certain databases, such as Oracle and Postgres store number and decimal Certain databases, such as Oracle and Postgres store number and decimal
values without padding. For example 1.5 in a column declared values without padding. For example 1.5 in a column declared
as NUMBER (20,5) is stored as is in Oracle, while the equivalent as NUMBER (20, 5) is stored as is in Oracle, while the equivalent
DECIMAL (20, 5) is stored as 1.50000 in an SQL server instance. DECIMAL (20, 5) is stored as 1.50000 in an SQL server instance.
This leads to a scale mismatch during avro import. This leads to a scale mismatch during avro import.
To avoid this error, one can use the sqoop.avro.decimal_padding.enable flag To avoid this error, one can use the sqoop.avro.decimal_padding.enable flag
to turn on padding with 0s. This flag has to be used together with the to turn on padding with 0s during. One also has to enable logical types with the
sqoop.avro.logical_types.decimal.enable flag set to true. sqoop.avro.logical_types.decimal.enable property set to true during an avro import,
or with the sqoop.parquet.logical_types.decimal.enable property during a parquet import.
Default precision and scale in avro import Default precision and scale in avro and parquet import
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
All of the databases allow users to specify numeric columns without All of the databases allow users to specify numeric columns without
a precision or scale. While MS SQL and MySQL translate these into a precision or scale. While MS SQL and MySQL translate these into
a valid precision and scale values, Oracle and Postgres don't. a valid precision and scale values, Oracle and Postgres don't.
Therefore, when a table contains NUMBER in a table in Oracle or When a table contains NUMBER in a table in Oracle or
NUMERIC/DECIMAL in Postgres, one can specify a default precision and scale NUMERIC/DECIMAL in Postgres, one can specify a default precision and scale
to be used in the avro schema by using the +sqoop.avro.logical_types.decimal.default.precision+ to be used in the avro schema by using the +sqoop.avro.logical_types.decimal.default.precision+
and +sqoop.avro.logical_types.decimal.default.scale+ flags. and +sqoop.avro.logical_types.decimal.default.scale+ properties.
Avro padding also has to be enabled, if the values are shorter than Avro padding also has to be enabled, if the values are shorter than
the specified default scale. the specified default scale.
Even though their name contains 'avro', the very same properties
(+sqoop.avro.logical_types.decimal.default.precision+ and +sqoop.avro.logical_types.decimal.default.scale+)
can be used to specify defaults during a parquet import as well.
But please not that the padding has to be enabled with the parquet specific property.
The implementation of the padding logic is database independent.
Our tests only cover only Oracle, Postgres, MS Sql server and MySQL databases,
therefore these are the supported ones.
Large Objects Large Objects
^^^^^^^^^^^^^ ^^^^^^^^^^^^^
@ -855,3 +865,13 @@ $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_typ
--target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1 --target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1
---- ----
The same in a parquet import:
----
$ sqoop import -Dsqoop.parquet.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true
-Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
--connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS"
--target-dir hdfs://nameservice1//etl/target_path --as-parquetfile --verbose -m 1
----