5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-02 21:11:38 +08:00

SQOOP-3338: Document Parquet support

(Szabolcs Vasas via Boglarka Egyed)
This commit is contained in:
Boglarka Egyed 2018-07-16 16:23:56 +02:00
parent e639053251
commit 17461e91db
3 changed files with 88 additions and 44 deletions

View File

@ -42,7 +42,8 @@ Argument Description
+\--map-column-hive <map>+ Override default mapping from SQL type to\
Hive type for configured columns. If specify commas in\
this argument, use URL encoded keys and values, for example,\
use DECIMAL(1%2C%201) instead of DECIMAL(1, 1).
use DECIMAL(1%2C%201) instead of DECIMAL(1, 1). Note that in case of Parquet file format users have\
to use +\--map-column-java+ instead of this option.
+\--hs2-url+ The JDBC connection string to HiveServer2 as you would specify in Beeline. If you use this option with \
--hive-import then Sqoop will try to connect to HiveServer2 instead of using Hive CLI.
+\--hs2-user+ The user for creating the JDBC connection to HiveServer2. The default is the current OS user.

View File

@ -32,7 +32,7 @@ informing you of the loss of precision.
Parquet Support in Hive
~~~~~~~~~~~~~~~~~~~~~~~
In order to contact the Hive MetaStore from a MapReduce job, a delegation token will
be fetched and passed. HIVE_CONF_DIR and HIVE_HOME must be set appropriately to add
When using the Kite Dataset API based Parquet implementation in order to contact the Hive MetaStore
from a MapReduce job, a delegation token will be fetched and passed. HIVE_CONF_DIR and HIVE_HOME must be set appropriately to add
Hive to the runtime classpath. Otherwise, importing/exporting into Hive in Parquet
format may not work.

View File

@ -51,46 +51,47 @@ include::validation-args.txt[]
.Import control arguments:
[grid="all"]
`---------------------------------`--------------------------------------
Argument Description
`-------------------------------------------`----------------------------
Argument Description
-------------------------------------------------------------------------
+\--append+ Append data to an existing dataset\
in HDFS
+\--as-avrodatafile+ Imports data to Avro Data Files
+\--as-sequencefile+ Imports data to SequenceFiles
+\--as-textfile+ Imports data as plain text (default)
+\--as-parquetfile+ Imports data to Parquet Files
+\--boundary-query <statement>+ Boundary query to use for creating splits
+\--columns <col,col,col...>+ Columns to import from table
+\--delete-target-dir+ Delete the import target directory\
if it exists
+\--direct+ Use direct connector if exists for the database
+\--fetch-size <n>+ Number of entries to read from database\
at once.
+\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
+-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel
+-e,\--query <statement>+ Import the results of '+statement+'.
+\--split-by <column-name>+ Column of the table used to split work\
units. Cannot be used with\
+--autoreset-to-one-mapper+ option.
+\--split-limit <n>+ Upper Limit for each split size.\
This only applies to Integer and Date columns.\
For date or timestamp fields it is calculated in seconds.
+\--autoreset-to-one-mapper+ Import should use one mapper if a table\
has no primary key and no split-by column\
is provided. Cannot be used with\
+--split-by <col>+ option.
+\--table <table-name>+ Table to read
+\--target-dir <dir>+ HDFS destination dir
+\--temporary-rootdir <dir>+ HDFS directory for temporary files created during import (overrides default "_sqoop")
+\--warehouse-dir <dir>+ HDFS parent for table destination
+\--where <where clause>+ WHERE clause to use during import
+-z,\--compress+ Enable compression
+\--compression-codec <c>+ Use Hadoop codec (default gzip)
+--null-string <null-string>+ The string to be written for a null\
value for string columns
+--null-non-string <null-string>+ The string to be written for a null\
value for non-string columns
+\--append+ Append data to an existing dataset\
in HDFS
+\--as-avrodatafile+ Imports data to Avro Data Files
+\--as-sequencefile+ Imports data to SequenceFiles
+\--as-textfile+ Imports data as plain text (default)
+\--as-parquetfile+ Imports data to Parquet Files
+\--parquet-configurator-implementation+ Sets the implementation used during Parquet import. Supported values: kite, hadoop.
+\--boundary-query <statement>+ Boundary query to use for creating splits
+\--columns <col,col,col...>+ Columns to import from table
+\--delete-target-dir+ Delete the import target directory\
if it exists
+\--direct+ Use direct connector if exists for the database
+\--fetch-size <n>+ Number of entries to read from database\
at once.
+\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
+-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel
+-e,\--query <statement>+ Import the results of '+statement+'.
+\--split-by <column-name>+ Column of the table used to split work\
units. Cannot be used with\
+--autoreset-to-one-mapper+ option.
+\--split-limit <n>+ Upper Limit for each split size.\
This only applies to Integer and Date columns.\
For date or timestamp fields it is calculated in seconds.
+\--autoreset-to-one-mapper+ Import should use one mapper if a table\
has no primary key and no split-by column\
is provided. Cannot be used with\
+--split-by <col>+ option.
+\--table <table-name>+ Table to read
+\--target-dir <dir>+ HDFS destination dir
+\--temporary-rootdir <dir>+ HDFS directory for temporary files created during import (overrides default "_sqoop")
+\--warehouse-dir <dir>+ HDFS parent for table destination
+\--where <where clause>+ WHERE clause to use during import
+-z,\--compress+ Enable compression
+\--compression-codec <c>+ Use Hadoop codec (default gzip)
+--null-string <null-string>+ The string to be written for a null\
value for string columns
+--null-non-string <null-string>+ The string to be written for a null\
value for non-string columns
-------------------------------------------------------------------------
The +\--null-string+ and +\--null-non-string+ arguments are optional.\
@ -402,8 +403,8 @@ saved jobs later in this document for more information.
File Formats
^^^^^^^^^^^^
You can import data in one of two file formats: delimited text or
SequenceFiles.
You can import data in one of these file formats: delimited text,
SequenceFiles, Avro and Parquet.
Delimited text is the default import format. You can also specify it
explicitly by using the +\--as-textfile+ argument. This argument will write
@ -444,6 +445,48 @@ argument, or specify any Hadoop compression codec using the
+\--compression-codec+ argument. This applies to SequenceFile, text,
and Avro files.
Parquet support
+++++++++++++++
Sqoop has two different implementations for importing data in Parquet format:
- Kite Dataset API based implementation (default, legacy)
- Parquet Hadoop API based implementation (recommended)
The users can specify the desired implementation with the +\--parquet-configurator-implementation+ option:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --parquet-configurator-implementation kite
----
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --parquet-configurator-implementation hadoop
----
If the +\--parquet-configurator-implementation+ option is not present then Sqoop will check the value of +parquetjob.configurator.implementation+
property (which can be specified using -D in the Sqoop command or in the site.xml). If that value is also absent Sqoop will
default to Kite Dataset API based implementation.
The Kite Dataset API based implementation executes the import command on a different code
path than the text import: it creates the Hive table based on the generated Avro schema by connecting to the Hive metastore.
This can be a disadvantage since sometimes moving from the text file format to the Parquet file format can lead to many
unexpected behavioral changes. Kite checks the Hive table schema before importing the data into it so if the user wants
to import some data which has a schema incompatible with the Hive table's schema Sqoop will throw an error. This implementation
uses snappy codec for compression by default and apart from this it supports the bzip codec too.
The Parquet Hadoop API based implementation builds the Hive CREATE TABLE statement and executes the
LOAD DATA INPATH command just like the text import does. Unlike Kite it also supports connecting to HiveServer2 (using the +\--hs2-url+ option)
so it provides better security features. This implementation does not check the Hive table's schema before importing so
it is possible that the user can successfully import data into Hive but they get an error during a Hive read operation later.
It does not use any compression by default but supports snappy and bzip codecs.
The below example demonstrates how to use Sqoop to import into Hive in Parquet format using HiveServer2 and snappy codec:
----
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --compression-codec snappy \
--parquet-configurator-implementation hadoop --hs2-url "jdbc:hive2://hs2.foo.com:10000" --hs2-keytab "/path/to/keytab"
----
Enabling Logical Types in Avro and Parquet import for numbers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^