diff --git a/src/docs/sip/INDEX.txt b/src/docs/sip/INDEX.txt new file mode 100644 index 00000000..887ff6ef --- /dev/null +++ b/src/docs/sip/INDEX.txt @@ -0,0 +1,24 @@ + + Licensed to Cloudera, Inc. under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + +This is the index of all accepted SIPs: + +* SIP-1 – Providing multiple entry-points into Sqoop +* SIP-2 – Sqoop 1.0 release criteria and maintenence policy +* SIP-3 – File format for large object (LOB) storage +* SIP-4 – Public API for Sqoop v1.0.0 + diff --git a/src/docs/sip/README.txt b/src/docs/sip/README.txt new file mode 100644 index 00000000..cf26f510 --- /dev/null +++ b/src/docs/sip/README.txt @@ -0,0 +1,30 @@ + + Licensed to Cloudera, Inc. under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + Cloudera, Inc. licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + +This directory contains the archive of accepted Sqoop Improvement Proposals +(SIPs). An SIP describes a proposed modification to Sqoop in technical detail. +An accepted SIP is a design document that has been integrated into Sqoop. +This directory serves as a technical reference for Sqoop developers to +understand previous design decisions. + +The SIP home is on the Sqoop development wiki at: +http://wiki.github.com/cloudera/sqoop/sqoopimprovementproposals + +The wiki SIP home contains all SIPs including those which are in "proposal", +"implementation available," or "rejected" state. + + diff --git a/src/docs/sip/sip-1.txt b/src/docs/sip/sip-1.txt new file mode 100644 index 00000000..a9371161 --- /dev/null +++ b/src/docs/sip/sip-1.txt @@ -0,0 +1,134 @@ +== == + +|SIP | 1 | +|Title | Providing multiple entry-points into Sqoop | +|Author | Aaron Kimball (aaron at cloudera dot com) | +|Created | April 16, 2010 | +|Status | Accepted | +|Discussion | "http://github.com/cloudera/sqoop/issues/issue/7":http://github.com/cloudera/sqoop/issues/issue/7 | +|Implementation | "kimballa/sqoop@faf04bc4":http://github.com/kimballa/sqoop/commit/faf04bc472426f47f9a77febdc6ca0439bda90cc | + +h2. Abstract + +This SIP proposes breaking the monolithic "sqoop" program entry point into a number of separate programs (e.g., sqoop-import and sqoop-export), to provide users with a clearer set of arguments that affect each operation. + +h2. Problem statement + +The single "sqoop" program takes a large number of arguments which control all aspects of its operation. This has now grown into several different operations. This single program can: + +* generate code +* import from an RDBMS to HDFS +* export from HDFS to an RDBMS +* manipulate Hive metadata and interact with HDFS files +* query databases for metadata (e.g., @--list-tables@) + +As the number of different operations supported by sqoop grows, additional arguments will be required to handle each of these new operations. Furthermore, as each operation provides the user with more fine-grained control, the number of operation-specific arguments will increase. It is growing difficult to know which arguments one should set to effect a particular operation. + +Therefore, I propose that we break out operations into separate program entry points (e.g., sqoop-import, sqoop-export, sqoop-codegen, etc). + +h2. Specification + +h3. User-facing changes + +An abstract class called @SqoopTool@ represents a single operation that can be performed. It has a @run()@ method that returns a status integer, where 0 is success and non-zero is a failure code. A static, global mapping from strings to @SqoopTool@ instances determines what "subprogram" to run. + +The main @bin/sqoop@ program will accept as its first positional argument the name of the subprogram to run. e.g., @bin/sqoop import --table ... --connect ...@. This will dispatch to the @SqoopTool@ implementation bound to the string @"import"@. + +A @help@ subprogram will list all available tools. + +For convenience, users will be provided with git-style @bin/sqoop-tool@ programs (e.g., @bin/sqoop-import@, @bin/sqoop-export@, etc). + +h3. API changes + +Currently the entire set of arguments passed to Sqoop are manually processed in the @SqoopOptions@ class. This class also validates the arguments, prints the usage message for Sqoop, as well as stores the argument values. + +Different subprograms will require different (but overlapping) sets of arguments. Rather than create several different @FooOptions@ classes which redundantly store argument values, the single @SqoopOptions@ should remain as the store for all program initialization state. However, configuration of the @SqoopOptions@ will be left to the @SqoopTool@ implementations. Each @SqoopTool@ should expose a @configureOptions()@ method which configures a Commons-CLI @Options@ object to retrieve arguments from the command line. It then generates the appropriate @SqoopOptions@ based on the actual arguments in its @applyOptions()@ method. Finally, these values are validated in the @validateOptions()@ method; mutually exclusive arguments, arguments with out-of-range values, etc. are signalled here. + +Common methods from an abstract base class (@BaseSqoopTool@) will configure, apply and verify all options which are common to all Sqoop programs (e.g., the connect string, username, and password for the database). + + +h3. Clarifications + +* The current Sqoop program will perform possibly multiple operations in series. e.g., the default control-flow will both generate code as well as perform an import. By splitting Sqoop into multiple subprograms, it should still be possible for one subprogram to invoke another. +* Options are also currently extracted using Hadoop's @GenericOptionsParser@ as applied through @ToolRunner@. Sqoop will still use @ToolRunner@ to process all generic options before yielding to the internal options parsing logic. Due to subtleties in how @ToolRunner@ interacts with some argument parsing, it is recommended to run Sqoop through the @public static int Sqoop.runSqoop(Sqoop sqoop, String [] argv)@ method which captures some arguments which @ToolRunner@ and @GenericOptionsParser@ would inadvertently strip out (specifically, the @"--"@ argument which separates normal arguments from subprogram arguments would be lost). @runSqoop()@ then uses @ToolRunner.run()@ in the usual way. If subprogram arguments are not required, then @ToolRunner@ can be used directly. + + +h3. Listings + +The proposed SqoopTool API can be summarized thusly: + +bc.. import org.apache.hadoop.sqoop.cli.ToolOptions; + +public abstract class SqoopTool { + + /** Main body of code to run the tool. */ + public abstract int run(SqoopOptions options); + + /** Configure the command-line arguments we expect to receive. + * ToolOptions is a wrapper around org.apache.commons.cli.Options + * that contains some additional information about option families. + */ + public void configureOptions(ToolOptions opts); + + /** Generate the SqoopOptions containing actual argument values from + * the Options extracted into a CommandLine. + * @param in the CLI CommandLine that contain the user's arguments. + * @param out the SqoopOptions with all fields applied. + */ + public void applyOptions(CommandLine in, SqoopOptions out); + + /** Print a usage message for this tool to the console. + * @param options the configured options for this tool. + */ + public void printHelp(ToolOptions options); + + /** + * Validates options and ensures that any required options are + * present and that any mutually-exclusive options are not selected. + * @throws InvalidOptionsException if there's a problem. + */ + public void validateOptions(SqoopOptions options) + throws InvalidOptionsException; +} + +p. This uses the Apache Commons CLI argument parsing library. The plan is to use v1.2 (the current stable release), as this is already included with and used by Hadoop. + +h2. Compatibility Issues + +This is an incompatible change. Existing program arguments to Sqoop will cease to function as expected. For example, we currently run imports and exports like this: + +bc.. bin/sqoop --connect jdbc://someserver/db --table sometable # import +bin/sqoop --connect jdbc://someserver/db --table othertable --export-dir /otherdata # export + +p. Both of these commands will return an error after this change, because we have not specified the subprogram to run. After the conversion, these operations would be effected with: + +bc.. bin/sqoop import --connect jdbc://someserver/db --table sometable +bin/sqoop export --connect jdbc://someserver/db --table othertable --export-dir /otherdata + +p. As Sqoop has not had a formal "release," I believe this incompatibility to be acceptable. As this proposes to restructure the argument-parsing semantics of Sqoop, I do not believe it would be in the best interest of the project to maintain the existing argument structure for backwards-compatibility reasons. + +Unversioned releases of Sqoop have been bundled with releases of Cloudera's Distribution for Hadoop. The version of Sqoop packaged in the still-beta CDH3 would be affected by this change. The version of Sqoop packaged in the now-stable CDH2 would not see this backported. + +h2. Test Plan + +A large number of the presently-available unit tests are end-to-end tests of a Sqoop subprogram; they actually pass an array of arguments to the @Sqoop.main()@ method to control its execution. These existing tests cover all functionality. The functionality of Sqoop is not expected to change as a result of this refactoring. Therefore, the existing tests will have their argument lists updated to the new argument sets; when all tests pass, then all currently-available functionality should be preserved. + +h2. Discussion + +Please provide feedback and comments at "http://github.com/cloudera/sqoop/issues/issue/7":http://github.com/cloudera/sqoop/issues/issue/7 + diff --git a/src/docs/sip/sip-2.txt b/src/docs/sip/sip-2.txt new file mode 100644 index 00000000..3b881548 --- /dev/null +++ b/src/docs/sip/sip-2.txt @@ -0,0 +1,141 @@ +== == + +|SIP | 2 | +|Title | Sqoop 1.0 release criteria and maintenance policy | +|Author | Aaron Kimball (aaron at cloudera dot com) | +|Created | April 29, 2010 | +|Status | Accepted | +|Discussion| "http://github.com/cloudera/sqoop/issues/issue/9":http://github.com/cloudera/sqoop/issues/issue/9 | + + +h2. Abstract + +This SIP describes a proposal for creating the first officially-tagged release of Sqoop. This outlines the remaining features which would need to be implemented before creating the release, as well as the version maintenance policy to be adopted moving forward. + +h2. Problem statement + +Sqoop has been provided to users through ad-hoc releases with editions of Cloudera's Distribution for Hadoop. But no version of Sqoop available to date has been deemed a canonical Sqoop release. This SIP proposes to answer the questions of: + +* What constitutes the first release. +* What support should be provided for this release going forward. +* What release policy should be adopted for subsequent releases. + +The specification below addresses each of these issues in turn. + +h2. Specification + +h3. Sqoop 1.0.0 Release + +The first Sqoop release, Sqoop 1.0.0, will include all features currently available in Sqoop. In addition, the following new features should be added as well: + +* The command-line API refactoring (proposed in [[SIP-1]]) +* A version information command +* Better support for exports of large volumes of data (with intermediate checkpointing) +* A file format for large object storage (proposed in [[SIP-3]]) +* A backwards-compatible public API (proposed in [[SIP-4]]) + +The version information command should be straightforward to implement. Improved export support will be performed by adding an OutputFormat that uses batch @INSERT@ statements and incremental spills. For the file format and API refactoring, separate improvement proposals will be filed. + +h3. Release Support + +This release will be included in CDH3, Cloudera's Distribution for Hadoop and marked as Sqoop 1.0.0. Subsequent releases of Sqoop with a _1.y.z_ number should remain API-compatible with Sqoop 1.0.0. + +*API compatibility* is defined as the following: + +* The command-line API will provide at least the same degree of functionality: command-line arguments will not be removed in the 1.0 line. (1.y releases may deprecate arguments as a message that they will be removed in a 2.0 release, but these deprecated arguments will remain present in all 1.y releases.) +** New arguments with additional functionality may be added in 1.y releases. +* Any code generated by Sqoop 1.0 will link and interoperate with any Sqoop 1.y library. +** More generally, code generated by Sqoop 1.x will link and interoperate with any Sqoop 1.y so long as x <= y. +* No internal APIs are currently considered stable. *Internal APIs may change* between releases in the 1.y series. This includes all programmatic APIs except those declared public in [[SIP-4]]. (e.g., the @public@ members of the @org.apache.hadoop.sqoop.lib@ package) + +Code generated by Sqoop is used to interpret records materialized to HDFS. This brings up the issue of data compatibility. + +The following *data compatibility* guarantees will be provided: + +* For data imported to HDFS by Sqoop 1.0, code generated by Sqoop 1.0 will be able to interpret this data in the same way, when used in conjunction with any subsequent Sqoop 1.y library. + +h3. Release Policy + +Sqoop should endeavor to provide as frequent of a release plan as is reasonable. Bugfixes should be efficiently distributed to clients of Sqoop in a timely fashion. New features should be provided incrementally to elicit feedback and demonstrate forward progress. + +h4. Terminology + +The following terminology is used throughout this section: + +* Major version: the first digit in a version number. e.g., "1.3.7" has a major version of "1". +* Minor version: the first two digits in a version number. e.g., "1.3.7" has a minor version of "1.3". +* Major series: All versions with the same major version. e.g., "1.2.0" and "1.4.7" are in the same major series (series "1"). +* Minor series: All versions with the same minor version. e.g., "1.2.0" and "1.2.3" are in the same minor series (series "1.2"). +* Bugfix release: a fully specified version. e.g., 1.2.3. +* End-of-life: A major or minor series has reached end-of-life when no new bugfixes will be provided for it. Users are expected to immediately upgrade to the next major or minor series as appropriate. +* End-of-development: a major series has reached end-of-development when no new minor versions are planned in its series. All subsequent feature development will occur on the next major version. Bugfixes will still be provided on the last minor version in a major series that has reached end-of-development, until that major series later reaches end-of-life. + +h4. Bugfixes + +For any version @x.y.z@, the next bugfix release @x.y.(z+1)@ should be distributed when the following criteria are all met: + +* A sufficient number of bugfixes have been provided on top of @x.y.z@, or a bugfix of sufficient criticality has been provided. "Sufficient" shall be determined on a case-by-case basis. +* A minimum amount of "soak time" for the previous release has been provided. This should be a period of at least two weeks, to ensure that a "z+2" release does not need to be provided the day after a "z+1" release, except in fatally crippled cases (e.g., z+1 proves impossible to compile or install, or introduces a data loss bug, etc). +* A "y+2" release is not yet available. e.g., the 1.0.z minor series will automatically be considered as end-of-life when Sqoop 1.2.0 is released. 1.0.z bugfix releases will be provided so long as the 1.1.z series remains the most current minor series. The last 1.y minor series will receive bugfix releases until end-of-life is proposed for the Sqoop 1 major series. This will occur sufficiently far in the future that Sqoop 2.y is considered stable. +** While end-of-life for a minor series is automatic when two new minor series are available that supercede it, end-of-life for a major series will be performed only by SIP. A major series will not reach end-of-life until the subsequent major series is ready for immediate migration by all clients. End-of-life for a major series will not take place without ample notice to allow clients to develop a migration plan. +* No four-digit versions will be provided. When a new bugfix release is provided in a given minor series, the previous bugfix releases in that minor series are considered obsolete. Users are expected to run the most current bugfix release in a given minor series. + + +h4. New features + +New features will be provided only in a new major or minor series. Bugfix releases do not include new features. + +* New features that do not represent backwards-incompatible changes will be provided in a new minor version. e.g., Sqoop 1.1 will contain new features not present in Sqoop 1.0. +* A new feature release will include all bugfixes present in the previous minor release's latest bugfix release. e.g., Sqoop 1.1 will include all bugfixes present in a hypothetical Sqoop 1.0.4 release. +* Forwards compatibility is not guaranteed. Sqoop 1.1 may generate code or import data in a fashion incompatible with Sqoop 1.0-based tools. Documentation will be included to highlight these differences. +* No new features in a minor release will result in backwards incompatibility. A feature that is fundamentally backwards-incompatible with Sqoop 1.0 will be included (at the earliest) in Sqoop 2.0. +** When a sufficient number of incompatible changes have accrued, the 1.0 line will be terminated as end-of-life. A change which significantly prohibits the ability of code to be backported in an engineer-efficient fashion will be taken as "sufficient." This will not occur without a SIP announcing this intent and proposing a date for the end-of-life to take effect. + +h4. Deprecation of features + +* Features, command-line arguments, API mechanisms, etc. which should eventually be removed will be preserved for a given major version. Any command line arguments or public APIs present in Sqoop 1.0 will remain present in Sqoop 1.1, 1.2, etc. These may disappear in Sqoop 2.0. +** When deprecation is certain for a feature, API, etc, then any subsequently released minor series in the current major series will mark them as such. e.g., if @--some-flag@ is to be deprecated in Sqoop 2.0, and the current minor series is 1.2, then Sqoop 1.3, 1.4, etc. will post warnings about the deprecated nature of this flag when it is used. +* The introduction of a sufficient number of deprecations will be used as grounds for end-of-development for the current major version. End-of-development for a major series will be preceeded by a SIP announcing this intent. +* No features will be removed without a deprecation marker in at least one minor version. The final minor version on a major series may be identical to its predecessor with the exception of the introduction of deprecation flags publicising all intended deprecations. + +h2. Compatibility Issues + +h3. Hadoop versions + +Sqoop 1.0 is and intends to be compatible with the trunk of the Apache Hadoop project. Currently, Sqoop depends on no outstanding patches of Apache Hadoop; therefore it will likely be compatible with the planned Hadoop 0.21 release. Sqoop 1.0 should be compatible with future releases of Hadoop (e.g., 0.22) subject to Hadoop's API deprecation policy. Subsequent editions of Sqoop should target 0.21-based Apache Hadoop. + +Sqoop 1.0 is currently intended to be compatible with the beta release of CDH3 available at the Sqoop 1.0 ship date. Sqoop is currently tested with the (unreleased) CDH 3 beta 2. Sqoop 1.0 will not be compatible with the currently-released CDH 3 beta 1. Nightly snapshots of CDH3b2 are available from Cloudera's maven repository at "https://repository.cloudera.com/nexus":https://repository.cloudera.com/nexus + +If Sqoop uncovers bugs in Hadoop's database interface APIs (or other aspects of Hadoop), then later versions of Hadoop (e.g., 0.21.1) may be required to provide full correctness. + +h3. Prior editions of Sqoop + +Sqoop 1.0 *will not* be API-compatible with unversioned editions of Sqoop present in CDH3 beta 1 or CDH2. These Sqoop releases did not have a version number and are not subject to the end-of-development/end-of-life policy articulated in this document. Upon release of Sqoop 1.0, all prior unversioned Sqoop editions in CDH3 (beta) will be immediately declared end-of-life. All prior unversioned Sqoop editions in CDH2 are already implicitly in the end-of-development phase. Bugfix patches to CDH2-based Sqoop releases will be provided with CDH2 updates in accordance with Cloudera's update policy. + +The code generated by unversioned releases of Sqoop will be incompatible with Sqoop 1.0.0. Sqoop 1.0.0 will generate code that interprets data in a compatible fashion to existing generated code created by unversioned releases of Sqoop. This guarantee does not extend to future releases in the 1.x major series. + +h2. Test Plan + +Current Sqoop functional and unit tests are executed daily against CDH3 beta 2 and Apache Hadoop (development trunk). These tests will be run on any release candidate and will be required to pass. A release candidate will be made available to the community before tagging the release as official. + +When Apache Hadoop makes a branch for version 0.21, tests will be run against that platform as well. + +h2. Discussion + +Please provide feedback and comments at "http://github.com/cloudera/sqoop/issues/issue/9":http://github.com/cloudera/sqoop/issues/issue/9 + diff --git a/src/docs/sip/sip-3.txt b/src/docs/sip/sip-3.txt new file mode 100644 index 00000000..867b9107 --- /dev/null +++ b/src/docs/sip/sip-3.txt @@ -0,0 +1,368 @@ +== == + +|SIP | 3 | +|Title | File format for large object (LOB) storage | +|Author | Aaron Kimball (aaron at cloudera dot com) | +|Created | May 5, 2010 | +|Status | Accepted | +|Discussion | "http://github.com/cloudera/sqoop/issues/issue/11":http://github.com/cloudera/sqoop/issues/issue/11 | +|Implementation| "kimballa/sqoop@c385de1d":http://github.com/kimballa/sqoop/commit/c385de1d17b522e486c63ea1e77c972a90092323 | + +h2. Abstract + +This is a proposal for a file format for the storage of large objects. This describes the file format itself, its reader/writer API, and how it would be integrated with the primary record storage. + +h2. Problem statement + +Large objects (columns of type CLOB and BLOB) can be very large; often larger than can be reasonably materialized completely in-memory. The main mechanisms Hadoop makes available to access records depend on them being fully materialized in memory, even if their contents are fully or partially ignored. Large objects in databases are often stored indirectly; a record locator is used to manipulate the record, but its contents are accessed lazily (e.g., through an InputStream) when desired. + +Sqoop has recently added a mechanism that allows indirect storage of large objects, but lacks the ability to efficiently store these large objects in consolidated files. + +This proposal outlines: + +* A file format for storage of large records. Individual records are stored as uninterpreted byte or character streams, and are accessed lazily. Iterating through the records is an inexpensive operation; users can skip records without needing to deserialize them. +* The API through which these files are accessed. Large object files can be manipulated directly (e.g., as a collection of large objects) by Java programs operating on individual files, or through MapReduce programs. +* The manner of integration between these large object storage files and the regular record storage. + +h2. Specification + +h3. File format requirements + +* Must support very large objects (at least several gigabytes in length). +* Exact length of the objects are not known ahead of time. +* Users must be able to partially read an object then efficiently transition to the next object without reading the entire object they opened. +* Should support compression. The objects are assumed to be very large, so a per-record compression system is acceptable. +* Should support splitting for use as an InputFormat to MapReduce. +* Individual records must be addressable by byte offset or another marker easily computed at write time. + +h3. Anti-requirements + + +* Data can be restricted to uninterpreted byte and character streams. Further typing is unnecessary. +* Does not require (key, value) pairs; values alone are sufficient. + +h3. Data model + +A LobFile is an unordered collection of values with integer record id keys. Values are character arrays or byte arrays, with an arbitrary length. This length may be several gigabytes. Individual values are not expected to be fully materializable in memory at a point in time. Users will lazily consume data from values. Zero-length values are allowed. + +An arbitrary (but assumed small) amount of user-specified metadata may be included in the file. Some metadata elements are well-defined and are used as parameters to the specific encoding of values in the file. Other elements are left to the user to define and interpret. The metadata is assumed to be fully materializable. + +h3. Format specification + +The LobFile format includes an arbitrary number of variable-length records. The start of each record is demarcated by a RecordStartMark, a 16-byte string that is unique on a per-file basis. Following the RecordStartMark is a non-negative integer, indicating that the record is actually a user data record, or a negative integer, which describes one of several internal record formats. + +The lengths of the internal records are usually encoded following the +record type id. The lengths of the user's data records are stored in an +index at the end of the file. + +The LobFile format is as follows: + +bc.. LobFile ::= LobHeader LobRecord* LobIndex Finale + +LobHeader ::= "LOB" Integer(versionNum) RecordStartMark MetaBlock +RecordStartMark ::= Byte[16] +MetaBlock ::= Integer(numEntries) MetaEntry +MetaEntry ::= Utf8String(key) BytesWritable(val) + +LobRecord ::= RecordStartMark Long(entryId) Long(claimedLen) Byte[*](data) + +LobIndex ::= IndexSegment* IndexTable +IndexSegment ::= RecordStartMark Long(-1) Long(segmentLen) + Long[entriesPerSegment](recordLen) + +IndexTable ::= RecordStartMark Long(-3) + Int(tableCount) IndexTableEntry* +IndexTableEntry ::= Long(segmentOffset) Long(firstIndexId) + Long(firstIndexOffset) Long(lastIndexOffset) + +Finale ::= RecordStartMark Long(-2) Long(indexStart) + +p. Listing of LobFile format expressed as a context-free grammar. + + +h4. Data serialization + +Values in the grammar above with type @Utf8String@ are UTF-8 encoded length-prefixed strings. In Java, this is performed using the Hadoop @Text.writeString(DataOutput, String)@ method, which writes the length of the string (in bytes) as a VInt followed by the UTF-8 encoded byte array itself. + +@Integer@ values are signed 32-bit quantities written with VInt encoding. In Java, this is perfomed using the Hadoop @WritableUtils.writeVInt(DataOutput, int)@ method. For values @-120 <= i <= 127@, the actual value is directly encoded as one byte. For other values of @i@, the first byte value indicates whether the integer is positive or negative, and the number of bytes that follow. If the first byte value v is between -121 and -124, the following integer is positive, with number of bytes that follow are -(v+120). If the first byte value v is between -125 and -128, the following integer is negative, with number of bytes that follow are -(v+124). Bytes are stored in the high-non-zero-byte-first order. + +@Long@ values are signed 64-bit quantities written with VInt encoding. In Java, this is performed using the Hadoop @WritableUtils.writeVLong(DataOutput, long)@ method. For @-112 <= i <= 127@, only one byte is used with the actual value. For other values of @i@, the first byte value indicates whether the long is positive or negative, and the number of bytes that follow. If the first byte value v is between -113 and -120, the following long is positive, with number of bytes that follow are -(v+112). If the first byte value v is between -121 and -128, the following long is negative, with number of bytes that follow are -(v+120). Bytes are stored in the high-non-zero-byte-first order. + +The MetaEntry array is a mapping from strings (length-prefixed UTF-8 encoded, as above) to @BytesWritable@ values, which are length-prefixed byte arrays. These byte arrays are written using instances of the @org.apache.hadoop.io.BytesWritable@ serializable class. This class writes the length prefix as a 2's-compliment 32-bit value in big endian order (see "the DataOutput Javadoc":http://java.sun.com/javase/6/docs/api/java/io/DataOutput.html#writeInt(int) for the specific formula), followed by the byte array itself. + +h4. Header + +The header defines the basic properties of the file. + +The versionNum must currently be 0. Different values of versionNum imply other formats related to this one but not yet defined. + +The RecordStartMark is a randomly-chosen array of 16 bytes. It should be different on a per-file basis. It appears once in the header to define the RecordStartMark for the file, and then once before each actual data record. This allows clients to seek to the beginning of an arbitrary record. + +The MetaBlock contains an a set of arbitrary key, value pairs. +Some of these key-value pairs are well-defined: + +* "EntryEncoding" -- should be the UTF-8 encoded string "CLOB" or "BLOB". BLOB is assumed if missing. +* "CompressionCodec" -- if present, should be the UTF-8 encoded name of the codec to use to decompress each entry. The compressor is reset between each record (they are encoded independently). Only the data byte array is encoded. +* "EntriesPerSegment" -- should be the VInt-encoded number of length values in a given IndexSegment. All IndexSegments (except the final IndexSegment in a file) should contain exactly this many entries. This metadata entry is required. + +Files with an EntryEncoding of "CLOB" should provide a character-based access mechanism (e.g., @java.io.Reader@) to records, but may be used in a byte-based fashion (e.g., @java.io.InputStream@). + +h4. Data records + +Following the header are the user's data records. These records have sequentially increasing 'entryId' fields; the first record has entryId=0. These entryIds refer to the offset into the LobIndex. + +The claimedLen field for a record refers to the length to advertise to consumers of a record. It does not strictly specify the amount of data held in the file. For character-based records, it may refer to the length of the record in characters, not bytes. + +Following the per-record header information of entryId and claimedLen is the data byte array itself. This may be compressed on a per-record basis according to the CompressionCodec specified in the MetaBlock. + +The LobRecords are variable length. Their lengths may not be known ahead of time due to the use of compression. Their true in-file lengths are recorded in the LobIndex. + +h4. Index records + +The LobIndex is written to the end of the file. It contains an arbitrary number of IndexSegment records. Each IndexSegment begins with the VLong-encoded value -1 (to distinguish it from a LobRecord), and contains an array of record lengths. The LobIndex is a complete index of all lengths, and they run sequentially. i.e., the first IndexSegment may contain the lengths of records 0..4095. The next IndexSegment (usually immediately adjacent in the file, but not technically required) contains the lengths of records 4096..8191. + +The segmentLen field in an IndexSegment captures the number of bytes required to write the recordLen array. + +It then provides an array of recordLen values, which correspond to the true length of the entire LobRecord. This includes the RecordStartMark, the lengths of the VLong-encoded entryId and claimedLen fields, and the true +compressed length of the data byte array. An entry can be expediently retrieved by index using this mechanism. + +Following the IndexSegment array is the IndexTable record. This is a higher-order index used for seeking. IndexSegments may be read lazily out of the file as the reader requires their data. The IndexTable is always held completely in memory. The IndexTable begins with a RecordStartMark and a record type id of -3. It then encodes its own length. + +The IndexTable is an array of table entries. tableCount is the number of entries in the IndexTableEntry array. + +Each IndexTableEntry represents one IndexSegment. It contains the offset of the IndexSegment in the file, the first entryId whose length is present in the segment, and the first and last offsets of records indexed by the segment. This way, when seeking to an arbitrary offset in the file, one can scan through the IndexTable to find the correct IndexSegment, then seek directly to the correct IndexSegment, read a relatively small amount of data and determine the length of and absolute offset to the first record following the user's requested seek target. + +h4. Finale + +The Finale is always at the very end of the file. It contains the RecordStartMark, the record type id -2, and the offset in the file of the start of the IndexTable. + +h4. Recovery semantics + +The biggest vulnerability of these files is that the index is written to the end. Thus, an interrupted writer may close the file before writing the index data. Nevertheless, the data is still recoverable. + +Since the records are all demarcated by RecordStartMarks, then the data can be extracted from a truncated file by scanning forward and reading the data sequentially. + +The IndexSegments are complete indices of all the lengths of all the records. So by scanning forward through the IndexSegments, the IndexTable can be rebuilt if truncated. + +Since the IndexSegments and IndexTable are led by their record type ids of -1 and -3, which cannot be confused for regular entryIds, the locations of these records can be found by linear scanning, in case the finale was not written to the file. + + +h4. Compression + +LobFiles should support a variety of compression codecs. To support different codecs in a language-neutral manner, the LobFile defines support for a number of codecs, specified by strings. Each codec name is bound to a concrete implementation in Java. + +A user specifies that a file is compressed with a particular codec by specifying the @"CompressionCodec"@ key in the MetaBlock. The value associated with this key must be a string identifying the codec name. + +The following table describes the codecs that may be used: + +|_.name |_.Java implementation class| +|none | _(none)_ | +|deflate | org.apache.hadoop.io.compress.DefaultCodec | +|lzo | com.hadoop.compression.lzo.LzoCodec | + +If the @"CompressionCodec"@ key is not specified, then it is assumed that it has the value @"none"@; this implies that user data is written as an uninterpreted byte array directly to the file. + +The @"deflate"@ codec uses the deflate algorithm (specified in "RFC 1951":http://www.ietf.org/rfc/rfc1951.txt) and can be implemented using the zlib library. Hadoop's @DefaultCodec@ will use zlib if native bindings are installed, or a compatible Java implementation otherwise. + +The @"lzo"@ compression codec can be found in the external GPL hadoop lzo compression library at +"http://code.google.com/p/hadoop-gpl-compression/":http://code.google.com/p/hadoop-gpl-compression/ or "http://github.com/kevinweil/hadoop-lzo":http://github.com/kevinweil/hadoop-lzo. This must be separately installed. + +h3. File API + + +h4. Writing LobFiles + +LobFiles are created through a call to a method @LobFile.create()@, which allows the actual writer class to be selected dynamically. This will return an instance of type @LobFile.Writer@ with the following public API: + + +bc.. /** + * Class that writes out a LobFile. Instantiate via LobFile.create(). + */ + public static abstract class Writer implements Closeable { + + /** + * If this Writer is writing to a physical LobFile, then this returns + * the file path it is writing to. Otherwise it returns null. + * @return the fully-qualified path being written to by this writer. + */ + public abstract Path getPath(); + + /** + * Finishes writing the LobFile and closes underlying handles. + */ + public abstract void close() throws IOException; + + /** + * Terminates the current record and writes any trailing zero-padding + * required by the specified record size. + * This is implicitly called between consecutive writeBlobRecord() / + * writeClobRecord() calls. + */ + public abstract void finishRecord() throws IOException; + + /** + * Declares a new BLOB record to be written to the file. + * @param len the "claimed" number of bytes that will be present in this + * record. The actual record may contain more or fewer bytes than len. + */ + public abstract OutputStream writeBlobRecord(long len) throws IOException; + + /** + * Declares a new CLOB record to be written to the file. + * @param len the "claimed" number of characters that will be written + * to this record. The actual number of characters may differ. + */ + public abstract java.io.Writer writeClobRecord(long len) + throws IOException; + + /** + * Report the current position in the output file + * @return the number of bytes written through this Writer. + */ + public abstract long tell() throws IOException; + } + +p. Listing of @LobFile.Writer@ API + +h4. Reading LobFiles + +The @LobFile.open()@ method will allow a user to read a LobFile. This will inspect the header of the file and determine the appropriate sub-format (as specified by the versionNum field of the file) and return an appropriate instance of type @LobFile.Reader@ whose API follows: + +bc.. /** + * Class that can read a LobFile. Create with LobFile.open(). + */ + public static abstract class Reader implements Closeable { + /** + * If this Reader is reading from a physical LobFile, then this returns + * the file path it is reading from. Otherwise it returns null. + * @return the fully-qualified path being read by this reader. + */ + public abstract Path getPath(); + + /** + * Report the current position in the file + * @return the current offset from the start of the file in bytes. + */ + public abstract long tell() throws IOException; + + /** + * Move the file pointer to the first available full record beginning at + * position 'pos', relative to the start of the file. After calling + * seek(), you will need to call next() to move to the record itself. + * @param pos the position to seek to or past. + */ + public abstract void seek(long pos) throws IOException; + + /** + * Advances to the next record in the file. + * @return true if another record exists, or false if the + * end of the file has been reached. + */ + public abstract boolean next() throws IOException; + + /** + * @return true if we have aligned the Reader (through a call to next()) + * onto a record. + */ + public abstract boolean isRecordAvailable(); + + /** + * Reports the length of the record to the user. + * If next() has not been called, or seek() has been called without + * a subsequent call to next(), or next() returned false, the return + * value of this method is undefined. + * @return the 'claimedLen' field of the current record. For + * character-based records, this is often in characters, not bytes. + * Records may have more bytes associated with them than are reported + * by this method, but never fewer. + */ + public abstract long getRecordLen(); + + /** + * Return the byte offset at which the current record starts. + * If next() has not been called, or seek() has been called without + * a subsequent call to next(), or next() returned false, the return + * value of this method is undefined. + * @return the byte offset of the beginning of the current record. + */ + public abstract long getRecordOffset(); + + /** + * Return the entryId of the current record to the user. + * If next() has not been called, or seek() has been called without + * a subsequent call to next(), or next() returned false, the return + * value of this method is undefined. + * @return the 'entryId' field of the current record. + */ + public abstract long getRecordId(); + + /** + * @return an InputStream allowing the user to read the next binary + * record from the file. + */ + public abstract InputStream readBlobRecord() throws IOException; + + /** + * @return a java.io.Reader allowing the user to read the next character + * record from the file. + */ + public abstract java.io.Reader readClobRecord() throws IOException; + + /** + * Closes the reader. + */ + public abstract void close() throws IOException; + + /** + * @return true if the Reader.close() method has been called. + */ + public abstract boolean isClosed(); + } + +p. Listing of the LobFile.Reader API + +This API allows users to seek to arbitrary records by their position in the file (retrieved by calling @Writer.tell()@ immediately before calling @writeBlobRecord()@ or @writeClobRecord()@). + +Not yet implemented is efficient access by entryId, although this also should be possible given the current format specification. + +h3. Integration with primary record storage + +Records are ordinarily stored in fully-materialized form either as delimited text or in a binary form using SequenceFiles. + +Large objects that are actually small (e.g., a couple MB; this value is configurable through @--inline-lob-limit@) are stored inline with the main record storage. Large objects that cross this threshold are written to a LobFile that is opened in tandem with the main file being written to by the import process. The large object will be referenced in the main record storage by a string-based record locator which provides the location of the large object itself. + +This locator is a string with the form "@externalLob(lf,filename,offset,len)@". The string @externalLob@ indicates that this is an externally-stored object. The @"lf"@ parameter notes that the large object is stored in a LobFile (other formats, e.g., SequenceFile, may be used in the future). Following this are the filename where the records are stored, the offset in the file where the record begins, and the length of the record. These are used to open the LobFile (through the LobFile.Reader API), seek to the start of the record, and read the data back to the user. The claimed length of the object is provided here so that the file need not be opened to determine the length of the object. The filename in the locator may be a relative path, in which case it is relative to the directory holding the file for the materialized primary record storage. Storing relative paths allows primary record storage files to be relocated along with their external large object storage. + +In either case, objects are accessed through a @LobRef@ which encapsulates the record. The @BlobRef@ type provides an InputStream-based accessor to the record data. The @ClobRef@ type provides a Reader-based accessor. Regardless of whether the object was stored inline with the rest of the record, or externally, the nature of the underlying storage is abstracted from the user. + +The reference classes will lazily open the underlying storage if called upon to do so by the user. Open file handles are cached in the current MapReduce process; opening a second large object stored in the same underlying file does not incur additional file-opening overhead (unless multiple LobRef instances open the same file concurrently). This ensures that records within the same file can be iterated over efficiently. If all records in the same file are accessed by the user sequentially, the user will see sequential performance (as seeks are only necessary every few thousand records to retrieve more of the LobIndex). When a seek is necessary to align on the position of another record, the Reader will heuristically determine whether to seek directly to the new location, or to read and consume the bytes between the current location and the target, to ensure that streaming performance is utilized where available. + +h2. Compatibility Issues + +A large object storage mechanism has not yet been released, so there are no backwards compatibility issues to speak of. + +h2. Test Plan + +The reference implementation provides unit tests which access large objects through the LobFile API as well as through BlobRef and ClobRef, integrating these files with regular record storage. + +Furthermore, a longer-running "stress test" writes several large files which contain records of multiple gigabytes each. Operations such as seeking, iterating, and reading these records are then performed to ensure proper operation. + +h2. Discussion + +Please provide feedback and comments at "http://github.com/cloudera/sqoop/issues/issue/11":http://github.com/cloudera/sqoop/issues/issue/11 + diff --git a/src/docs/sip/sip-4.txt b/src/docs/sip/sip-4.txt new file mode 100644 index 00000000..0d546aca --- /dev/null +++ b/src/docs/sip/sip-4.txt @@ -0,0 +1,98 @@ +== == + +|SIP | 4 | +|Title | Public API for Sqoop v1.0.0 | +|Author | Aaron Kimball (aaron at cloudera dot com) | +|Created | May 14, 2010 | +|Status | Accepted | +|Discussion | "http://github.com/cloudera/sqoop/issues/issue/13":http://github.com/cloudera/sqoop/issues/issue/13 | +|Implementation| "http://review.hbase.org/r/73/":http://review.hbase.org/r/73/ | + +h2. Abstract + +This SIP defines the public API to be exposed in the first release of Sqoop. The @org.apache.hadoop.sqoop.lib@ package contains the public API relied-upon by external clients of Sqoop. Generated code produced by Sqoop depends on these modules. Clients of imported data may also rely on additional modules specified here. + +h2. Problem statement + +To deal with the unique table schemas of each database a Sqoop user imports, Sqoop's current design requires that it generate a per-table class. This class is used to interact with the data after it is imported to Hadoop; data can be stored in SequenceFiles, requiring this class to deserialize records. Subsequent re-exports of the data rely on this class to push records back to the RDBMS. And the generated class includes support for parsing text-based representations of the data. + +This class, however, relies on reusable code modules provided with Sqoop. These code modules are all placed in the @org.apache.hadoop.sqoop.lib@ package. Clients of generated code must be able to rely on previously-generated code to work with later versions of Sqoop. While code regeneration is possible, Sqoop users should see the @lib@ package as the most stable API provided by Sqoop. + +Sqoop also provides a file format for large object data; while large objects can be manipulated in the context of their encapsulating records (e.g., through @BlobRef@ or @ClobRef@ references to the data), the large object file store may be inspected directly. + +This SIP defines the official "surface area" of the public packages which will be maintained. In order to ensure that future versions remain backwards compatible, some existing class definitions must be modified. It is hoped that these sorts of "breaking changes" will occur only before incrementing the major version number (1.0, 2.0, etc.), and are thus infrequent disruptions to Sqoop users. Sqoop clients who target only the APIs specified may be confident that their programs will work properly with all subsequent Sqoop releases in the 1.0 series (in accordance with the compatibility and deprecation policy specified in [[SIP-2]]). + +h2. Specification + +h3. lib package + +As of 5/14/2010, the lib package contains the following classes: +* @BigDecimalSerializer@ +* @BlobRef@ +* @ClobRef@ +* @FieldFormatter@ +* @JdbcWritableBridge@ +* @LargeObjectLoader@ +* @LobRef@ +* @LobSerializer@ +* @RecordParser@ +* @TaskId@ + +and the following interface: +* @SqoopRecord@ + + +Classes generated by Sqoop fulfill the interface of @SqoopRecord@. The first change necessary in this package is to transform @SqoopRecord@ from an interface into an abstract class. This way, subsequent releases in the 1.0 series can introduce additional methods required by SqoopRecords along with a default implementation for previously-generated clients. + +The @TaskId@ class is improperly placed in this package. This class is Sqoop-internal and should be moved to the @util@ package. + +We should add a class called @DelimiterSet@ which encapsulates the parameters regarding formatting of delimiters around fields: the field terminator, the record terminator, the escape character, the enclosing character, and whether the latter of these is optional. This would allow sets of delimiters to be manipulated easily. The @SqoopRecord@ class could then be extended with a @toString(DelimiterSet)@ method that allowed users to format output with alternate delimiters than the ones specified during codegen time. + +@LobRef@ is an abstract base class that encapsulates common code in @BlobRef@ and @ClobRef@. The constructors for @LobRef@ are marked as @protected@. Clients of Sqoop should not subclass @LobRef@ directly. + +Classes in the lib package may depend on classes elsewhere in Sqoop's implementation. Clients should not do so directly. + + +h3. io package + +Clients of Sqoop who have imported large objects into HDFS may have large object files holding their data; this file format is defined in [[SIP-3]]. The large objects may be manipulated by iterating over their encapsulating records and calling @{B,C}lobRef.getDataStream()@, which will retrieve the data for a large object from its underlying store. However, the large objects may also be directly retrieved from their underlying LobFile storage. + +The @org.apache.hadoop.sqoop.io.LobFile@ class is considered part of the public API. Clients of Sqoop may depend on the @LobFile.Writer@ and @LobFile.Reader@ APIs. Clients should never instantiate subclasses of @Writer@ and @Reader@ directly; instead they should use the static methods @LobFile.create()@ and @LobFile.open@ respectively. The underlying concrete Writer and Reader implementation classes are considered private. + +To allow users to verify the compression formats available in LobFiles, the @CodecMap.getCodecNames()@ method is also public. + +h3. Entry-points to Sqoop + +A well-defined programmatic entry-point to Sqoop is *not* defined by this specification. The only method of @org.apache.hadoop.sqoop.Sqoop@ considered stable is its @main()@ method; all others are currently internal. This restriction will be relaxed in a future specification, allowing programmatic client interaction with Sqoop. + +h3. Base package + +The base package in Sqoop is currently @org.apache.hadoop.sqoop@. To reflect Sqoop's migration from an Apache Hadoop subproject to its own project, the class hierarchy should be moved to @com.cloudera.sqoop@. + +h2. Compatibility Issues + +The modification of @SqoopRecord@ from interface to class will cause existing generated code to break. Such a change is expected prior to the 1.0.0 release. This is the last interface in Sqoop; once it is transitioned to an abstract class, subsequent changes to the SqoopRecord API should be backwards-compatible. + +h2. Test Plan + +The changes required to implement this specification are minimal; the existing unit test suite should cover all necessary testing. + +h2. Discussion + +Please provide feedback and comments at "http://github.com/cloudera/sqoop/issues/issue/13":http://github.com/cloudera/sqoop/issues/issue/13 +