5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-02 07:21:58 +08:00

rephrased new lines a bit

This commit is contained in:
Fero Szabo 2018-12-06 16:13:33 +01:00
parent 821cb6bfd3
commit 936116ff07

View File

@ -131,34 +131,32 @@ Create an external Hive table:
$ sqoop import --hive-import --create-hive-table --connect $CONN --table $TABLENAME --username $USER --password $PASS --external-table-dir /tmp/foobar_example --hive-table foobar
----
Type Mapping in a Hive import using parquet files
+++++++++++++++++++++++++++++++++++++++++++++++++
Decimals in Hive imports using parquet files
++++++++++++++++++++++++++++++++++++++++++++
As mentioned above, a hive import is a two-step process in Sqoop:
- Sqoop imports the data with the import tool onto HDFS first.
- Then, Sqoop generates a Hive statement and executes it, effectively creating a table in Hive.
As mentioned above, a Hive import is a two-step process in Sqoop:
first, the data is imported onto HDFS, then a statement is generated and executed to create a Hive table.
Since Sqoop is using an avro schema to write parquet files, the SQL types of the source table's column are first
converted into avro types and an avro schema is created. This schema is then used in a regular Parquet import.
After the data was imported onto HDFS successfully, in the second step, Sqoop uses the Avro
schema generated for the parquet import to create the Hive query and maps the Avro types to Hive
types.
Since Sqoop is using an avro schema to write parquet files, first an Avro schema is generated from the SQL types.
This schema is then used in a regular Parquet import. After the data was imported onto HDFS successfully,
Sqoop uses the Avro schema to create a Hive command to create a table in Hive and maps the Avro types to Hive
types in this process.
Decimals are converted to String in a parquet import per default, so Decimal columns appear as String
Decimal SQL types are converted to Strings in a parquet import per default, so Decimal columns appear as String
columns in Hive per default. You can change this behavior and use logical types instead, so that Decimals
will be mapped to the Hive type Decimal as well. This has to be enabled with the
+sqoop.parquet.decimal_padding.enable+ property. As noted in the section discussing
will be properly mapped to the Hive type Decimal as well. This has to be enabled with the
+sqoop.parquet.logical_types.decimal.enable+ property. As noted in the section discussing
'Padding number types in avro and parquet import', you should also specify the default precision and scale and
enable decimal padding.
enable padding.
A limitation of Hive is that the maximum precision and scale is 38. When converting SQL types to the Hive Decimal
type, precision and scale will be modified to meet this limitation, automatically. The data itself however, will
only have to adhere to the limitations of the Parquet import, thus values with a precision and scale bigger than
38 will be present on storage on HDFS, but they won't be visible in Hive, (since Hive is a schema-on-read tool).
only have to adhere to the limitations of the Parquet file format, thus values with a precision and scale bigger than
38 will be present on storage, but they won't be readable by Hive, (since Hive is a schema-on-read tool).
Enable padding and specifying a default precision and scale in a Hive Import:
Enabling padding and specifying a default precision and scale in a Hive Import:
----
$ sqoop import -Dsqoop.parquet.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true
$ sqoop import -Dsqoop.avro.decimal_padding.enable=true -Dsqoop.parquet.logical_types.decimal.enable=true
-Dsqoop.avro.logical_types.decimal.default.precision=38 -Dsqoop.avro.logical_types.decimal.default.scale=10
--hive-import --connect $CONN --table $TABLENAME --username $USER --password $PASS --as-parquetfile
----