From 17461e91db01bf67663caf0fb35e8920128c1aba Mon Sep 17 00:00:00 2001 From: Boglarka Egyed Date: Mon, 16 Jul 2018 16:23:56 +0200 Subject: [PATCH] SQOOP-3338: Document Parquet support (Szabolcs Vasas via Boglarka Egyed) --- src/docs/user/hive-args.txt | 3 +- src/docs/user/hive-notes.txt | 4 +- src/docs/user/import.txt | 125 +++++++++++++++++++++++------------ 3 files changed, 88 insertions(+), 44 deletions(-) diff --git a/src/docs/user/hive-args.txt b/src/docs/user/hive-args.txt index 8af9a1c5..4edf3388 100644 --- a/src/docs/user/hive-args.txt +++ b/src/docs/user/hive-args.txt @@ -42,7 +42,8 @@ Argument Description +\--map-column-hive + Override default mapping from SQL type to\ Hive type for configured columns. If specify commas in\ this argument, use URL encoded keys and values, for example,\ - use DECIMAL(1%2C%201) instead of DECIMAL(1, 1). + use DECIMAL(1%2C%201) instead of DECIMAL(1, 1). Note that in case of Parquet file format users have\ + to use +\--map-column-java+ instead of this option. +\--hs2-url+ The JDBC connection string to HiveServer2 as you would specify in Beeline. If you use this option with \ --hive-import then Sqoop will try to connect to HiveServer2 instead of using Hive CLI. +\--hs2-user+ The user for creating the JDBC connection to HiveServer2. The default is the current OS user. diff --git a/src/docs/user/hive-notes.txt b/src/docs/user/hive-notes.txt index deee2702..af97d94b 100644 --- a/src/docs/user/hive-notes.txt +++ b/src/docs/user/hive-notes.txt @@ -32,7 +32,7 @@ informing you of the loss of precision. Parquet Support in Hive ~~~~~~~~~~~~~~~~~~~~~~~ -In order to contact the Hive MetaStore from a MapReduce job, a delegation token will -be fetched and passed. HIVE_CONF_DIR and HIVE_HOME must be set appropriately to add +When using the Kite Dataset API based Parquet implementation in order to contact the Hive MetaStore +from a MapReduce job, a delegation token will be fetched and passed. HIVE_CONF_DIR and HIVE_HOME must be set appropriately to add Hive to the runtime classpath. Otherwise, importing/exporting into Hive in Parquet format may not work. diff --git a/src/docs/user/import.txt b/src/docs/user/import.txt index 2d074f49..a2c16d95 100644 --- a/src/docs/user/import.txt +++ b/src/docs/user/import.txt @@ -51,46 +51,47 @@ include::validation-args.txt[] .Import control arguments: [grid="all"] -`---------------------------------`-------------------------------------- -Argument Description +`-------------------------------------------`---------------------------- +Argument Description ------------------------------------------------------------------------- -+\--append+ Append data to an existing dataset\ - in HDFS -+\--as-avrodatafile+ Imports data to Avro Data Files -+\--as-sequencefile+ Imports data to SequenceFiles -+\--as-textfile+ Imports data as plain text (default) -+\--as-parquetfile+ Imports data to Parquet Files -+\--boundary-query + Boundary query to use for creating splits -+\--columns + Columns to import from table -+\--delete-target-dir+ Delete the import target directory\ - if it exists -+\--direct+ Use direct connector if exists for the database -+\--fetch-size + Number of entries to read from database\ - at once. -+\--inline-lob-limit + Set the maximum size for an inline LOB -+-m,\--num-mappers + Use 'n' map tasks to import in parallel -+-e,\--query + Import the results of '+statement+'. -+\--split-by + Column of the table used to split work\ - units. Cannot be used with\ - +--autoreset-to-one-mapper+ option. -+\--split-limit + Upper Limit for each split size.\ - This only applies to Integer and Date columns.\ - For date or timestamp fields it is calculated in seconds. -+\--autoreset-to-one-mapper+ Import should use one mapper if a table\ - has no primary key and no split-by column\ - is provided. Cannot be used with\ - +--split-by + option. -+\--table + Table to read -+\--target-dir + HDFS destination dir -+\--temporary-rootdir + HDFS directory for temporary files created during import (overrides default "_sqoop") -+\--warehouse-dir + HDFS parent for table destination -+\--where + WHERE clause to use during import -+-z,\--compress+ Enable compression -+\--compression-codec + Use Hadoop codec (default gzip) -+--null-string + The string to be written for a null\ - value for string columns -+--null-non-string + The string to be written for a null\ - value for non-string columns ++\--append+ Append data to an existing dataset\ + in HDFS ++\--as-avrodatafile+ Imports data to Avro Data Files ++\--as-sequencefile+ Imports data to SequenceFiles ++\--as-textfile+ Imports data as plain text (default) ++\--as-parquetfile+ Imports data to Parquet Files ++\--parquet-configurator-implementation+ Sets the implementation used during Parquet import. Supported values: kite, hadoop. ++\--boundary-query + Boundary query to use for creating splits ++\--columns + Columns to import from table ++\--delete-target-dir+ Delete the import target directory\ + if it exists ++\--direct+ Use direct connector if exists for the database ++\--fetch-size + Number of entries to read from database\ + at once. ++\--inline-lob-limit + Set the maximum size for an inline LOB ++-m,\--num-mappers + Use 'n' map tasks to import in parallel ++-e,\--query + Import the results of '+statement+'. ++\--split-by + Column of the table used to split work\ + units. Cannot be used with\ + +--autoreset-to-one-mapper+ option. ++\--split-limit + Upper Limit for each split size.\ + This only applies to Integer and Date columns.\ + For date or timestamp fields it is calculated in seconds. ++\--autoreset-to-one-mapper+ Import should use one mapper if a table\ + has no primary key and no split-by column\ + is provided. Cannot be used with\ + +--split-by + option. ++\--table + Table to read ++\--target-dir + HDFS destination dir ++\--temporary-rootdir + HDFS directory for temporary files created during import (overrides default "_sqoop") ++\--warehouse-dir + HDFS parent for table destination ++\--where + WHERE clause to use during import ++-z,\--compress+ Enable compression ++\--compression-codec + Use Hadoop codec (default gzip) ++--null-string + The string to be written for a null\ + value for string columns ++--null-non-string + The string to be written for a null\ + value for non-string columns ------------------------------------------------------------------------- The +\--null-string+ and +\--null-non-string+ arguments are optional.\ @@ -402,8 +403,8 @@ saved jobs later in this document for more information. File Formats ^^^^^^^^^^^^ -You can import data in one of two file formats: delimited text or -SequenceFiles. +You can import data in one of these file formats: delimited text, +SequenceFiles, Avro and Parquet. Delimited text is the default import format. You can also specify it explicitly by using the +\--as-textfile+ argument. This argument will write @@ -444,6 +445,48 @@ argument, or specify any Hadoop compression codec using the +\--compression-codec+ argument. This applies to SequenceFile, text, and Avro files. +Parquet support ++++++++++++++++ + +Sqoop has two different implementations for importing data in Parquet format: + +- Kite Dataset API based implementation (default, legacy) +- Parquet Hadoop API based implementation (recommended) + +The users can specify the desired implementation with the +\--parquet-configurator-implementation+ option: + +---- +$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --parquet-configurator-implementation kite +---- + +---- +$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --parquet-configurator-implementation hadoop +---- + +If the +\--parquet-configurator-implementation+ option is not present then Sqoop will check the value of +parquetjob.configurator.implementation+ +property (which can be specified using -D in the Sqoop command or in the site.xml). If that value is also absent Sqoop will +default to Kite Dataset API based implementation. + +The Kite Dataset API based implementation executes the import command on a different code +path than the text import: it creates the Hive table based on the generated Avro schema by connecting to the Hive metastore. +This can be a disadvantage since sometimes moving from the text file format to the Parquet file format can lead to many +unexpected behavioral changes. Kite checks the Hive table schema before importing the data into it so if the user wants +to import some data which has a schema incompatible with the Hive table's schema Sqoop will throw an error. This implementation +uses snappy codec for compression by default and apart from this it supports the bzip codec too. + +The Parquet Hadoop API based implementation builds the Hive CREATE TABLE statement and executes the +LOAD DATA INPATH command just like the text import does. Unlike Kite it also supports connecting to HiveServer2 (using the +\--hs2-url+ option) +so it provides better security features. This implementation does not check the Hive table's schema before importing so +it is possible that the user can successfully import data into Hive but they get an error during a Hive read operation later. +It does not use any compression by default but supports snappy and bzip codecs. + +The below example demonstrates how to use Sqoop to import into Hive in Parquet format using HiveServer2 and snappy codec: + +---- +$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --compression-codec snappy \ +--parquet-configurator-implementation hadoop --hs2-url "jdbc:hive2://hs2.foo.com:10000" --hs2-keytab "/path/to/keytab" +---- + Enabling Logical Types in Avro and Parquet import for numbers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^