mirror of
https://github.com/apache/sqoop.git
synced 2025-05-02 20:09:32 +08:00
Documentation for parquet decimal support, and parquet decimal support in Hive import
This commit is contained in:
parent
a50394977b
commit
821cb6bfd3
@ -112,10 +112,14 @@ configuring a new Hive table with the correct InputFormat. This feature
|
||||
currently requires that all partitions of a table be compressed with the lzop
|
||||
codec.
|
||||
|
||||
The user can specify the +\--external-table-dir+ option in the sqoop command to
|
||||
External table import
|
||||
+++++++++++++++++++++
|
||||
|
||||
You can specify the +\--external-table-dir+ option in the sqoop command to
|
||||
work with an external Hive table (instead of a managed table, i.e. the default behavior).
|
||||
To import data into an external table, one has to specify +\--hive-import+ in the command
|
||||
line arguments. Table creation is also supported with the use of +\--create-hive-table+.
|
||||
line arguments. Table creation is also supported with the use of +\--create-hive-table+
|
||||
option.
|
||||
|
||||
Importing into an external Hive table:
|
||||
----
|
||||
@ -126,3 +130,35 @@ Create an external Hive table:
|
||||
----
|
||||
$ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar
|
||||
----
|
||||
|
||||
Type Mapping in a Hive import using parquet files
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
As mentioned above, a hive import is a two-step process in Sqoop:
|
||||
- Sqoop imports the data with the import tool onto HDFS first.
|
||||
- Then, Sqoop generates a Hive statement and executes it, effectively creating a table in Hive.
|
||||
|
||||
Since Sqoop is using an avro schema to write parquet files, the SQL types of the source table's column are first
|
||||
converted into avro types and an avro schema is created. This schema is then used in a regular Parquet import.
|
||||
After the data was imported onto HDFS successfully, in the second step, Sqoop uses the Avro
|
||||
schema generated for the parquet import to create the Hive query and maps the Avro types to Hive
|
||||
types.
|
||||
|
||||
Decimals are converted to String in a parquet import per default, so Decimal columns appear as String
|
||||
columns in Hive per default. You can change this behavior and use logical types instead, so that Decimals
|
||||
will be mapped to the Hive type Decimal as well. This has to be enabled with the
|
||||
+sqoop.parquet.decimal_padding.enable+ property. As noted in the section discussing
|
||||
'Padding number types in avro and parquet import', you should also specify the default precision and scale and
|
||||
enable decimal padding.
|
||||
|
||||
A limitation of Hive is that the maximum precision and scale is 38. When converting SQL types to the Hive Decimal
|
||||
type, precision and scale will be modified to meet this limitation, automatically. The data itself however, will
|
||||
only have to adhere to the limitations of the Parquet import, thus values with a precision and scale bigger than
|
||||
38 will be present on storage on HDFS, but they won't be visible in Hive, (since Hive is a schema-on-read tool).
|
||||
|
||||
Enable padding and specifying a default precision and scale in a Hive Import:
|
||||
----
|
||||
$ sqoop import -Dsqoop.parquet.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true
|
||||
-Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
|
||||
--hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --as-parquetfile
|
||||
----
|
||||
|
@ -476,33 +476,43 @@ i.e. used during both avro and parquet imports, one has to use the
|
||||
sqoop.avro.logical_types.decimal.enable flag. This is necessary if one
|
||||
wants to store values as decimals in the avro file format.
|
||||
|
||||
Padding number types in avro import
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Padding number types in avro and parquet import
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Certain databases, such as Oracle and Postgres store number and decimal
|
||||
values without padding. For example 1.5 in a column declared
|
||||
as NUMBER (20,5) is stored as is in Oracle, while the equivalent
|
||||
as NUMBER (20, 5) is stored as is in Oracle, while the equivalent
|
||||
DECIMAL (20, 5) is stored as 1.50000 in an SQL server instance.
|
||||
This leads to a scale mismatch during avro import.
|
||||
|
||||
To avoid this error, one can use the sqoop.avro.decimal_padding.enable flag
|
||||
to turn on padding with 0s. This flag has to be used together with the
|
||||
sqoop.avro.logical_types.decimal.enable flag set to true.
|
||||
to turn on padding with 0s during. One also has to enable logical types with the
|
||||
sqoop.avro.logical_types.decimal.enable property set to true during an avro import,
|
||||
or with the sqoop.parquet.logical_types.decimal.enable property during a parquet import.
|
||||
|
||||
Default precision and scale in avro import
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Default precision and scale in avro and parquet import
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
All of the databases allow users to specify numeric columns without
|
||||
a precision or scale. While MS SQL and MySQL translate these into
|
||||
a valid precision and scale values, Oracle and Postgres don't.
|
||||
|
||||
Therefore, when a table contains NUMBER in a table in Oracle or
|
||||
When a table contains NUMBER in a table in Oracle or
|
||||
NUMERIC/DECIMAL in Postgres, one can specify a default precision and scale
|
||||
to be used in the avro schema by using the +sqoop.avro.logical_types.decimal.default.precision+
|
||||
and +sqoop.avro.logical_types.decimal.default.scale+ flags.
|
||||
and +sqoop.avro.logical_types.decimal.default.scale+ properties.
|
||||
Avro padding also has to be enabled, if the values are shorter than
|
||||
the specified default scale.
|
||||
|
||||
Even though their name contains 'avro', the very same properties
|
||||
(+sqoop.avro.logical_types.decimal.default.precision+ and +sqoop.avro.logical_types.decimal.default.scale+)
|
||||
can be used to specify defaults during a parquet import as well.
|
||||
But please not that the padding has to be enabled with the parquet specific property.
|
||||
|
||||
The implementation of the padding logic is database independent.
|
||||
Our tests only cover only Oracle, Postgres, MS Sql server and MySQL databases,
|
||||
therefore these are the supported ones.
|
||||
|
||||
Large Objects
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
@ -855,3 +865,13 @@ $ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.avro.logical_typ
|
||||
--target-dir hdfs://nameservice1//etl/target_path --as-avrodatafile --verbose -m 1
|
||||
|
||||
----
|
||||
|
||||
The same in a parquet import:
|
||||
|
||||
----
|
||||
$ sqoop import -Dsqoop.parquet.decimal_padding.enable=true -Dsqoop.avro.logical_types.decimal.enable=true
|
||||
-Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
|
||||
--connect $CON --username $USER --password $PASS --query "select * from table_name where \$CONDITIONS"
|
||||
--target-dir hdfs://nameservice1//etl/target_path --as-parquetfile --verbose -m 1
|
||||
|
||||
----
|
||||
|
Loading…
Reference in New Issue
Block a user