5
0
mirror of https://github.com/apache/sqoop.git synced 2025-05-10 22:00:16 +08:00
sqoop/doc/direct.txt
Andrew Bayer ec8f687d97 MAPREDUCE-1169. Improvements to mysqldump use in Sqoop. Contributed by Aaron Kimball.
From: Thomas White <tomwhite@apache.org>

git-svn-id: https://svn.apache.org/repos/asf/incubator/sqoop/trunk@1149840 13f79535-47bb-0310-9956-ffa450edef68
2011-07-22 20:03:28 +00:00

78 lines
3.3 KiB
Plaintext

////
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
////
Direct-mode Imports
-------------------
While the JDBC-based import method used by Sqoop provides it with the
ability to read from a variety of databases using a generic driver, it
is not the most high-performance method available. Sqoop can read from
certain database systems faster by using their built-in export tools.
For example, Sqoop can read from a MySQL database by using the +mysqldump+
tool distributed with MySQL. You can take advantage of this faster
import method by running Sqoop with the +--direct+ argument. This
combined with a connect string that begins with +jdbc:mysql://+ will
inform Sqoop that it should select the faster access method.
If your delimiters exactly match the delimiters used by +mysqldump+,
then Sqoop will use a fast-path that copies the data directly from
+mysqldump+'s output into HDFS. Otherwise, Sqoop will parse +mysqldump+'s
output into fields and transcode them into the user-specified delimiter set.
This incurs additional processing, so performance may suffer.
For convenience, the +--mysql-delimiters+
argument will set all the output delimiters to be consistent with
+mysqldump+'s format.
Sqoop also provides a direct-mode backend for PostgreSQL that uses the
+COPY TO STDOUT+ protocol from +psql+. No specific delimiter set provides
better performance; Sqoop will forward delimiter control arguments to
+psql+.
The "Supported Databases" section provides a full list of database vendors
which have direct-mode support from Sqoop.
When writing to HDFS, direct mode will open a single output file to receive
the results of the import. You can instruct Sqoop to use multiple output
files by using the +--direct-split-size+ argument which takes a size in
bytes. Sqoop will generate files of approximately this size. e.g.,
+--direct-split-size 1000000+ will generate files of approximately 1 MB
each. If compressing the HDFS files with +--compress+, this will allow
subsequent MapReduce programs to use multiple mappers across your data
in parallel.
Tool-specific arguments
~~~~~~~~~~~~~~~~~~~~~~~
Sqoop will generate a set of command-line arguments with which it invokes
the underlying direct-mode tool (e.g., mysqldump). You can specify additional
arguments which should be passed to the tool by passing them to Sqoop
after a single '+-+' argument. e.g.:
----
$ sqoop --connect jdbc:mysql://localhost/db --table foo --direct - --lock-tables
----
The +--lock-tables+ argument (and anything else to the right of the +-+ argument)
will be passed directly to mysqldump.