merge ali/master

This commit is contained in:
zyyang 2024-04-14 14:36:28 +08:00
commit 12fe8776a2
583 changed files with 28713 additions and 4580 deletions

117
README.md
View File

@ -1,12 +1,13 @@
![Datax-logo](https://github.com/alibaba/DataX/blob/master/images/DataX-logo.jpg)
# DataX
DataX 是阿里云 [DataWorks数据集成](https://www.aliyun.com/product/bigdata/ide) 的开源版本,在阿里巴巴集团内被广泛使用的离线数据同步工具/平台。DataX 实现了包括 MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS 等各种异构数据源之间高效的数据同步功能。
[![Leaderboard](https://img.shields.io/badge/DataX-%E6%9F%A5%E7%9C%8B%E8%B4%A1%E7%8C%AE%E6%8E%92%E8%A1%8C%E6%A6%9C-orange)](https://opensource.alibaba.com/contribution_leaderboard/details?projectValue=datax)
DataX 是阿里云 [DataWorks数据集成](https://www.aliyun.com/product/bigdata/ide) 的开源版本,在阿里巴巴集团内被广泛使用的离线数据同步工具/平台。DataX 实现了包括 MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS, databend 等各种异构数据源之间高效的数据同步功能。
# DataX 商业版本
阿里云DataWorks数据集成是DataX团队在阿里云上的商业化产品致力于提供复杂网络环境下、丰富的异构数据源之间高速稳定的数据移动能力以及繁杂业务背景下的数据同步解决方案。目前已经支持云上近3000家客户单日同步数据超过3万亿条。DataWorks数据集成目前支持离线50+种数据源可以进行整库迁移、批量上云、增量同步、分库分表等各类同步解决方案。2020年更新实时同步能力2020年更新实时同步能力支持10+种数据源的读写任意组合。提供MySQLOracle等多种数据源到阿里云MaxComputeHologres等大数据引擎的一键全增量同步解决方案。
阿里云DataWorks数据集成是DataX团队在阿里云上的商业化产品致力于提供复杂网络环境下、丰富的异构数据源之间高速稳定的数据移动能力以及繁杂业务背景下的数据同步解决方案。目前已经支持云上近3000家客户单日同步数据超过3万亿条。DataWorks数据集成目前支持离线50+种数据源可以进行整库迁移、批量上云、增量同步、分库分表等各类同步解决方案。2020年更新实时同步能力支持10+种数据源的读写任意组合。提供MySQLOracle等多种数据源到阿里云MaxComputeHologres等大数据引擎的一键全增量同步解决方案。
商业版本参见: https://www.aliyun.com/product/bigdata/ide
@ -25,7 +26,7 @@ DataX本身作为数据同步框架将不同数据源的同步抽象为从源
# Quick Start
##### Download [DataX下载地址](https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/20220530/datax.tar.gz)
##### Download [DataX下载地址](https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202308/datax.tar.gz)
##### 请点击:[Quick Start](https://github.com/alibaba/DataX/blob/master/userGuid.md)
@ -36,35 +37,49 @@ DataX本身作为数据同步框架将不同数据源的同步抽象为从源
DataX目前已经有了比较全面的插件体系主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入目前支持数据如下图详情请点击[DataX数据源参考指南](https://github.com/alibaba/DataX/wiki/DataX-all-data-channels)
| 类型 | 数据源 | Reader(读) | Writer(写) |文档|
| ------------ | ---------- | :-------: | :-------: |:-------: |
| RDBMS 关系型数据库 | MySQL | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md)|
|             | Oracle     |        |        |[读](https://github.com/alibaba/DataX/blob/master/oraclereader/doc/oraclereader.md) 、[写](https://github.com/alibaba/DataX/blob/master/oraclewriter/doc/oraclewriter.md)|
|             | OceanBase  |        |        |[读](https://open.oceanbase.com/docs/community/oceanbase-database/V3.1.0/use-datax-to-full-migration-data-to-oceanbase) 、[写](https://open.oceanbase.com/docs/community/oceanbase-database/V3.1.0/use-datax-to-full-migration-data-to-oceanbase)|
| | SQLServer | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/sqlserverreader/doc/sqlserverreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/sqlserverwriter/doc/sqlserverwriter.md)|
| | PostgreSQL | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/postgresqlreader/doc/postgresqlreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/postgresqlwriter/doc/postgresqlwriter.md)|
| | DRDS | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/drdsreader/doc/drdsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/drdswriter/doc/drdswriter.md)|
| | 通用RDBMS(支持所有关系型数据库) | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/rdbmsreader/doc/rdbmsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/rdbmswriter/doc/rdbmswriter.md)|
| 阿里云数仓数据存储 | ODPS | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/odpsreader/doc/odpsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/odpswriter/doc/odpswriter.md)|
| | ADS | | √ |[写](https://github.com/alibaba/DataX/blob/master/adswriter/doc/adswriter.md)|
| | OSS | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/ossreader/doc/ossreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/osswriter/doc/osswriter.md)|
| | OCS | | √ |[写](https://github.com/alibaba/DataX/blob/master/ocswriter/doc/ocswriter.md)|
| NoSQL数据存储 | OTS | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/otsreader/doc/otsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/otswriter/doc/otswriter.md)|
| | Hbase0.94 | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/hbase094xreader/doc/hbase094xreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hbase094xwriter/doc/hbase094xwriter.md)|
| | Hbase1.1 | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/hbase11xreader/doc/hbase11xreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hbase11xwriter/doc/hbase11xwriter.md)|
| | Phoenix4.x | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/hbase11xsqlreader/doc/hbase11xsqlreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hbase11xsqlwriter/doc/hbase11xsqlwriter.md)|
| | Phoenix5.x | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/hbase20xsqlreader/doc/hbase20xsqlreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hbase20xsqlwriter/doc/hbase20xsqlwriter.md)|
| | MongoDB | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/mongodbreader/doc/mongodbreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/mongodbwriter/doc/mongodbwriter.md)|
| | Hive | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md)|
| | Cassandra | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/cassandrareader/doc/cassandrareader.md) 、[写](https://github.com/alibaba/DataX/blob/master/cassandrawriter/doc/cassandrawriter.md)|
| 无结构化数据存储 | TxtFile | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/txtfilereader/doc/txtfilereader.md) 、[写](https://github.com/alibaba/DataX/blob/master/txtfilewriter/doc/txtfilewriter.md)|
| | FTP | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/ftpreader/doc/ftpreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/ftpwriter/doc/ftpwriter.md)|
| | HDFS | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md)|
| | Elasticsearch | | √ |[写](https://github.com/alibaba/DataX/blob/master/elasticsearchwriter/doc/elasticsearchwriter.md)|
| 时间序列数据库 | OpenTSDB | √ | |[读](https://github.com/alibaba/DataX/blob/master/opentsdbreader/doc/opentsdbreader.md)|
| | TSDB | √ | √ |[读](https://github.com/alibaba/DataX/blob/master/tsdbreader/doc/tsdbreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/tsdbwriter/doc/tsdbhttpwriter.md)|
| | TDengine2.0 | √ | √ |[读](https://github.com/taosdata/DataX/blob/master/tdengine20reader/doc/tdengine20reader-CN.md) 、[写](https://github.com/alibaba/DataX/blob/master/tdengine20writer/doc/tdengine20writer-CN.md)|
| | TDengine3.0 | √ | √ |[读](https://github.com/taosdata/DataX/blob/master/tdengine30reader/doc/tdengine30reader-CN.md) 、[写](https://github.com/alibaba/DataX/blob/master/tdengine30writer/doc/tdengine30writer-CN.md)|
| 类型 | 数据源 | Reader(读) | Writer(写) | 文档 |
|--------------|---------------------------|:---------:|:---------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| RDBMS 关系型数据库 | MySQL | √ | √ | [](https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md) |
| | Oracle | √ | √ | [](https://github.com/alibaba/DataX/blob/master/oraclereader/doc/oraclereader.md) 、[写](https://github.com/alibaba/DataX/blob/master/oraclewriter/doc/oraclewriter.md) |
| | OceanBase | √ | √ | [](https://open.oceanbase.com/docs/community/oceanbase-database/V3.1.0/use-datax-to-full-migration-data-to-oceanbase) 、[写](https://open.oceanbase.com/docs/community/oceanbase-database/V3.1.0/use-datax-to-full-migration-data-to-oceanbase) |
| | SQLServer | √ | √ | [](https://github.com/alibaba/DataX/blob/master/sqlserverreader/doc/sqlserverreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/sqlserverwriter/doc/sqlserverwriter.md) |
| | PostgreSQL | √ | √ | [](https://github.com/alibaba/DataX/blob/master/postgresqlreader/doc/postgresqlreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/postgresqlwriter/doc/postgresqlwriter.md) |
| | DRDS | √ | √ | [](https://github.com/alibaba/DataX/blob/master/drdsreader/doc/drdsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/drdswriter/doc/drdswriter.md) |
| | Kingbase | √ | √ | [](https://github.com/alibaba/DataX/blob/master/drdsreader/doc/drdsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/drdswriter/doc/drdswriter.md) |
| | 通用RDBMS(支持所有关系型数据库) | √ | √ | [](https://github.com/alibaba/DataX/blob/master/rdbmsreader/doc/rdbmsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/rdbmswriter/doc/rdbmswriter.md) |
| 阿里云数仓数据存储 | ODPS | √ | √ | [](https://github.com/alibaba/DataX/blob/master/odpsreader/doc/odpsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/odpswriter/doc/odpswriter.md) |
| | ADB | | √ | [](https://github.com/alibaba/DataX/blob/master/adbmysqlwriter/doc/adbmysqlwriter.md) |
| | ADS | | √ | [](https://github.com/alibaba/DataX/blob/master/adswriter/doc/adswriter.md) |
| | OSS | √ | √ | [](https://github.com/alibaba/DataX/blob/master/ossreader/doc/ossreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/osswriter/doc/osswriter.md) |
| | OCS | | √ | [](https://github.com/alibaba/DataX/blob/master/ocswriter/doc/ocswriter.md) |
| | Hologres | | √ | [](https://github.com/alibaba/DataX/blob/master/hologresjdbcwriter/doc/hologresjdbcwriter.md) |
| | AnalyticDB For PostgreSQL | | √ | 写 |
| 阿里云中间件 | datahub | √ | √ | 读 、写 |
| | SLS | √ | √ | 读 、写 |
| 图数据库 | 阿里云 GDB | √ | √ | [](https://github.com/alibaba/DataX/blob/master/gdbreader/doc/gdbreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/gdbwriter/doc/gdbwriter.md) |
| | Neo4j | | √ | [](https://github.com/alibaba/DataX/blob/master/neo4jwriter/doc/neo4jwriter.md) |
| NoSQL数据存储 | OTS | √ | √ | [](https://github.com/alibaba/DataX/blob/master/otsreader/doc/otsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/otswriter/doc/otswriter.md) |
| | Hbase0.94 | √ | √ | [](https://github.com/alibaba/DataX/blob/master/hbase094xreader/doc/hbase094xreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hbase094xwriter/doc/hbase094xwriter.md) |
| | Hbase1.1 | √ | √ | [](https://github.com/alibaba/DataX/blob/master/hbase11xreader/doc/hbase11xreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hbase11xwriter/doc/hbase11xwriter.md) |
| | Phoenix4.x | √ | √ | [](https://github.com/alibaba/DataX/blob/master/hbase11xsqlreader/doc/hbase11xsqlreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hbase11xsqlwriter/doc/hbase11xsqlwriter.md) |
| | Phoenix5.x | √ | √ | [](https://github.com/alibaba/DataX/blob/master/hbase20xsqlreader/doc/hbase20xsqlreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hbase20xsqlwriter/doc/hbase20xsqlwriter.md) |
| | MongoDB | √ | √ | [](https://github.com/alibaba/DataX/blob/master/mongodbreader/doc/mongodbreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/mongodbwriter/doc/mongodbwriter.md) |
| | Cassandra | √ | √ | [](https://github.com/alibaba/DataX/blob/master/cassandrareader/doc/cassandrareader.md) 、[写](https://github.com/alibaba/DataX/blob/master/cassandrawriter/doc/cassandrawriter.md) |
| 数仓数据存储 | StarRocks | √ | √ | 读 、[写](https://github.com/alibaba/DataX/blob/master/starrockswriter/doc/starrockswriter.md) |
| | ApacheDoris | | √ | [](https://github.com/alibaba/DataX/blob/master/doriswriter/doc/doriswriter.md) |
| | ClickHouse | √ | √ | [](https://github.com/alibaba/DataX/blob/master/clickhousereader/doc/clickhousereader.md) 、[写](https://github.com/alibaba/DataX/blob/master/clickhousewriter/doc/clickhousewriter.md) |
| | Databend | | √ | [](https://github.com/alibaba/DataX/blob/master/databendwriter/doc/databendwriter.md) |
| | Hive | √ | √ | [](https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md) |
| | kudu | | √ | [](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md) |
| | selectdb | | √ | [](https://github.com/alibaba/DataX/blob/master/selectdbwriter/doc/selectdbwriter.md) |
| 无结构化数据存储 | TxtFile | √ | √ | [](https://github.com/alibaba/DataX/blob/master/txtfilereader/doc/txtfilereader.md) 、[写](https://github.com/alibaba/DataX/blob/master/txtfilewriter/doc/txtfilewriter.md) |
| | FTP | √ | √ | [](https://github.com/alibaba/DataX/blob/master/ftpreader/doc/ftpreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/ftpwriter/doc/ftpwriter.md) |
| | HDFS | √ | √ | [](https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md) |
| | Elasticsearch | | √ | [](https://github.com/alibaba/DataX/blob/master/elasticsearchwriter/doc/elasticsearchwriter.md) |
| 时间序列数据库 | OpenTSDB | √ | | [](https://github.com/alibaba/DataX/blob/master/opentsdbreader/doc/opentsdbreader.md) |
| | TSDB | √ | √ | [](https://github.com/alibaba/DataX/blob/master/tsdbreader/doc/tsdbreader.md) 、[写](https://github.com/alibaba/DataX/blob/master/tsdbwriter/doc/tsdbhttpwriter.md) |
| | TDengine | √ | √ | [](https://github.com/alibaba/DataX/blob/master/tdenginereader/doc/tdenginereader-CN.md) 、[写](https://github.com/alibaba/DataX/blob/master/tdenginewriter/doc/tdenginewriter-CN.md) |
# 阿里云DataWorks数据集成
@ -86,7 +101,7 @@ DataX目前已经有了比较全面的插件体系主流的RDBMS数据库、N
- 整库迁移https://help.aliyun.com/document_detail/137809.html
- 批量上云https://help.aliyun.com/document_detail/146671.html
- 更新更多能力请访问https://help.aliyun.com/document_detail/137663.html
-
# 我要开发新的插件
@ -96,6 +111,40 @@ DataX目前已经有了比较全面的插件体系主流的RDBMS数据库、N
DataX 后续计划月度迭代更新,也欢迎感兴趣的同学提交 Pull requests月度更新内容会介绍介绍如下。
- [datax_v202309]https://github.com/alibaba/DataX/releases/tag/datax_v202309)
- 支持Phoenix 同步数据添加 where条件
- 支持华为 GuassDB读写插件
- 修复ClickReader 插件运行报错 Can't find bundle for base name
- 增加 DataX调试模块
- 修复 orc空文件报错问题
- 优化obwriter性能
- txtfilewriter 增加导出为insert语句功能支持
- HdfsReader/HdfsWriter 支持parquet读写能力
- [datax_v202308]https://github.com/alibaba/DataX/releases/tag/datax_v202308)
- OTS 插件更新
- databend 插件更新
- Oceanbase驱动修复
- [datax_v202306]https://github.com/alibaba/DataX/releases/tag/datax_v202306)
- 精简代码
- 新增插件neo4jwriter、clickhousewriter
- 优化插件、修复问题oceanbase、hdfs、databend、txtfile
- [datax_v202303]https://github.com/alibaba/DataX/releases/tag/datax_v202303)
- 精简代码
- 新增插件adbmysqlwriter、databendwriter、selectdbwriter
- 优化插件、修复问题sqlserver、hdfs、cassandra、kudu、oss
- fastjson 升级到 fastjson2
- [datax_v202210]https://github.com/alibaba/DataX/releases/tag/datax_v202210)
- 涉及通道能力更新OceanBase、Tdengine、Doris等
- [datax_v202209]https://github.com/alibaba/DataX/releases/tag/datax_v202209)
- 涉及通道能力更新MaxCompute、Datahub、SLS等、安全漏洞更新、通用打包更新等
- [datax_v202205]https://github.com/alibaba/DataX/releases/tag/datax_v202205)
- 涉及通道能力更新MaxCompute、Hologres、OSS、Tdengine等、安全漏洞更新、通用打包更新等

View File

@ -0,0 +1,338 @@
# DataX AdbMysqlWriter
---
## 1 快速介绍
AdbMysqlWriter 插件实现了写入数据到 ADB MySQL 目的表的功能。在底层实现上, AdbMysqlWriter 通过 JDBC 连接远程 ADB MySQL 数据库,并执行相应的 `insert into ...` 或者 ( `replace into ...` ) 的 SQL 语句将数据写入 ADB MySQL内部会分批次提交入库。
AdbMysqlWriter 面向ETL开发工程师他们使用 AdbMysqlWriter 从数仓导入数据到 ADB MySQL。同时 AdbMysqlWriter 亦可以作为数据迁移工具为DBA等用户提供服务。
## 2 实现原理
AdbMysqlWriter 通过 DataX 框架获取 Reader 生成的协议数据AdbMysqlWriter 通过 JDBC 连接远程 ADB MySQL 数据库,并执行相应的 `insert into ...` 或者 ( `replace into ...` ) 的 SQL 语句将数据写入 ADB MySQL。
* `insert into...`(遇到主键重复时会自动忽略当前写入数据,不做更新,作用等同于`insert ignore into`)
##### 或者
* `replace into...`(没有遇到主键/唯一性索引冲突时,与 insert into 行为一致,冲突时会用新行替换原有行所有字段) 的语句写入数据到 MySQL。出于性能考虑采用了 `PreparedStatement + Batch`,并且设置了:`rewriteBatchedStatements=true`,将数据缓冲到线程上下文 Buffer 中,当 Buffer 累计到预定阈值时,才发起写入请求。
<br />
注意:整个任务至少需要具备 `insert/replace into...` 的权限,是否需要其他权限,取决于你任务配置中在 preSql 和 postSql 中指定的语句。
## 3 功能说明
### 3.1 配置样例
* 这里使用一份从内存产生到 ADB MySQL 导入的数据。
```json
{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column" : [
{
"value": "DataX",
"type": "string"
},
{
"value": 19880808,
"type": "long"
},
{
"value": "1988-08-08 08:08:08",
"type": "date"
},
{
"value": true,
"type": "bool"
},
{
"value": "test",
"type": "bytes"
}
],
"sliceRecordCount": 1000
}
},
"writer": {
"name": "adbmysqlwriter",
"parameter": {
"writeMode": "replace",
"username": "root",
"password": "root",
"column": [
"*"
],
"preSql": [
"truncate table @table"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://ip:port/database?useUnicode=true",
"table": [
"test"
]
}
]
}
}
}
]
}
}
```
### 3.2 参数说明
* **jdbcUrl**
* 描述:目的数据库的 JDBC 连接信息。作业运行时DataX 会在你提供的 jdbcUrl 后面追加如下属性yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true
注意1、在一个数据库上只能配置一个 jdbcUrl
2、一个 AdbMySQL 写入任务仅能配置一个 jdbcUrl
3、jdbcUrl按照MySQL官方规范并可以填写连接附加控制信息比如想指定连接编码为 gbk ,则在 jdbcUrl 后面追加属性 useUnicode=true&characterEncoding=gbk。具体请参看 Mysql官方文档或者咨询对应 DBA。
* 必选:是 <br />
* 默认值:无 <br />
* **username**
* 描述:目的数据库的用户名 <br />
* 必选:是 <br />
* 默认值:无 <br />
* **password**
* 描述:目的数据库的密码 <br />
* 必选:是 <br />
* 默认值:无 <br />
* **table**
* 描述:目的表的表名称。只能配置一个 AdbMySQL 的表名称。
注意table 和 jdbcUrl 必须包含在 connection 配置单元中
* 必选:是 <br />
* 默认值:无 <br />
* **column**
* 描述:目的表需要写入数据的字段,字段之间用英文逗号分隔。例如: "column": ["id", "name", "age"]。如果要依次写入全部列,使用`*`表示, 例如: `"column": ["*"]`。
**column配置项必须指定不能留空**
注意1、我们强烈不推荐你这样配置因为当你目的表字段个数、类型等有改动时你的任务可能运行不正确或者失败
2、 column 不能配置任何常量值
* 必选:是 <br />
* 默认值:否 <br />
* **session**
* 描述: DataX在获取 ADB MySQL 连接时执行session指定的SQL语句修改当前connection session属性
* 必须: 否
* 默认值: 空
* **preSql**
* 描述:写入数据到目的表前,会先执行这里的标准语句。如果 Sql 中有你需要操作到的表名称,请使用 `@table` 表示,这样在实际执行 SQL 语句时,会对变量按照实际表名称进行替换。比如希望导入数据前,先对表中数据进行删除操作,那么你可以这样配置:`"preSql":["truncate table @table"]`,效果是:在执行到每个表写入数据前,会先执行对应的 `truncate table 对应表名称` <br />
* 必选:否 <br />
* 默认值:无 <br />
* **postSql**
* 描述:写入数据到目的表后,会执行这里的标准语句。(原理同 preSql <br />
* 必选:否 <br />
* 默认值:无 <br />
* **writeMode**
* 描述:控制写入数据到目标表采用 `insert into` 或者 `replace into` 或者 `ON DUPLICATE KEY UPDATE` 语句<br />
* 必选:是 <br />
* 所有选项insert/replace/update <br />
* 默认值replace <br />
* **batchSize**
* 描述一次性批量提交的记录数大小该值可以极大减少DataX与 Adb MySQL 的网络交互次数并提升整体吞吐量。但是该值设置过大可能会造成DataX运行进程OOM情况。<br />
* 必选:否 <br />
* 默认值2048 <br />
### 3.3 类型转换
目前 AdbMysqlWriter 支持大部分 MySQL 类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。
下面列出 AdbMysqlWriter 针对 MySQL 类型转换列表:
| DataX 内部类型 | AdbMysql 数据类型 |
|---------------|---------------------------------|
| Long | tinyint, smallint, int, bigint |
| Double | float, double, decimal |
| String | varchar |
| Date | date, time, datetime, timestamp |
| Boolean | boolean |
| Bytes | binary |
## 4 性能报告
### 4.1 环境准备
#### 4.1.1 数据特征
TPC-H 数据集 lineitem 表,共 17 个字段, 随机生成总记录行数 59986052。未压缩总数据量7.3GiB
建表语句:
CREATE TABLE `datax_adbmysqlwriter_perf_lineitem` (
`l_orderkey` bigint NOT NULL COMMENT '',
`l_partkey` int NOT NULL COMMENT '',
`l_suppkey` int NOT NULL COMMENT '',
`l_linenumber` int NOT NULL COMMENT '',
`l_quantity` decimal(15,2) NOT NULL COMMENT '',
`l_extendedprice` decimal(15,2) NOT NULL COMMENT '',
`l_discount` decimal(15,2) NOT NULL COMMENT '',
`l_tax` decimal(15,2) NOT NULL COMMENT '',
`l_returnflag` varchar(1024) NOT NULL COMMENT '',
`l_linestatus` varchar(1024) NOT NULL COMMENT '',
`l_shipdate` date NOT NULL COMMENT '',
`l_commitdate` date NOT NULL COMMENT '',
`l_receiptdate` date NOT NULL COMMENT '',
`l_shipinstruct` varchar(1024) NOT NULL COMMENT '',
`l_shipmode` varchar(1024) NOT NULL COMMENT '',
`l_comment` varchar(1024) NOT NULL COMMENT '',
`dummy` varchar(1024),
PRIMARY KEY (`l_orderkey`, `l_linenumber`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='datax perf test';
单行记录类似于:
l_orderkey: 2122789
l_partkey: 1233571
l_suppkey: 8608
l_linenumber: 1
l_quantity: 35.00
l_extendedprice: 52657.85
l_discount: 0.02
l_tax: 0.07
l_returnflag: N
l_linestatus: O
l_shipdate: 1996-11-03
l_commitdate: 1996-12-07
l_receiptdate: 1996-11-16
l_shipinstruct: COLLECT COD
l_shipmode: FOB
l_comment: ld, regular theodolites.
dummy:
#### 4.1.2 机器参数
* DataX ECS: 24Core48GB
* Adb MySQL 数据库
* 计算资源16Core64GB集群版
* 弹性IO资源3
#### 4.1.3 DataX jvm 参数
-Xms1G -Xmx10G -XX:+HeapDumpOnOutOfMemoryError
### 4.2 测试报告
| 通道数 | 批量提交行数 | DataX速度(Rec/s) | DataX流量(MB/s) | 导入用时(s) |
|-----|-------|------------------|---------------|---------|
| 1 | 512 | 23071 | 2.34 | 2627 |
| 1 | 1024 | 26080 | 2.65 | 2346 |
| 1 | 2048 | 28162 | 2.86 | 2153 |
| 1 | 4096 | 28978 | 2.94 | 2119 |
| 4 | 512 | 56590 | 5.74 | 1105 |
| 4 | 1024 | 81062 | 8.22 | 763 |
| 4 | 2048 | 107117 | 10.87 | 605 |
| 4 | 4096 | 113181 | 11.48 | 579 |
| 8 | 512 | 81062 | 8.22 | 786 |
| 8 | 1024 | 127629 | 12.95 | 519 |
| 8 | 2048 | 187456 | 19.01 | 369 |
| 8 | 4096 | 206848 | 20.98 | 341 |
| 16 | 512 | 130404 | 13.23 | 513 |
| 16 | 1024 | 214235 | 21.73 | 335 |
| 16 | 2048 | 299930 | 30.42 | 253 |
| 16 | 4096 | 333255 | 33.80 | 227 |
| 32 | 512 | 206848 | 20.98 | 347 |
| 32 | 1024 | 315716 | 32.02 | 241 |
| 32 | 2048 | 399907 | 40.56 | 199 |
| 32 | 4096 | 461431 | 46.80 | 184 |
| 64 | 512 | 333255 | 33.80 | 231 |
| 64 | 1024 | 399907 | 40.56 | 204 |
| 64 | 2048 | 428471 | 43.46 | 199 |
| 64 | 4096 | 461431 | 46.80 | 187 |
| 128 | 512 | 333255 | 33.80 | 235 |
| 128 | 1024 | 399907 | 40.56 | 203 |
| 128 | 2048 | 425432 | 43.15 | 197 |
| 128 | 4096 | 387006 | 39.26 | 211 |
说明:
1. datax 使用 txtfilereader 读取本地文件,避免源端存在性能瓶颈。
#### 性能测试小结
1. channel通道个数和batchSize对性能影响比较大
2. 通常不建议写入数据库时,通道个数 > 32
## 5 约束限制
## FAQ
***
**Q: AdbMysqlWriter 执行 postSql 语句报错,那么数据导入到目标数据库了吗?**
A: DataX 导入过程存在三块逻辑pre 操作、导入操作、post 操作其中任意一环报错DataX 作业报错。由于 DataX 不能保证在同一个事务完成上述几个操作,因此有可能数据已经落入到目标端。
***
**Q: 按照上述说法,那么有部分脏数据导入数据库,如果影响到线上数据库怎么办?**
A: 目前有两种解法,第一种配置 pre 语句,该 sql 可以清理当天导入数据, DataX 每次导入时候可以把上次清理干净并导入完整数据。第二种,向临时表导入数据,完成后再 rename 到线上表。
***
**Q: 上面第二种方法可以避免对线上数据造成影响,那我具体怎样操作?**
A: 可以配置临时表导入

79
adbmysqlwriter/pom.xml Executable file
View File

@ -0,0 +1,79 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-all</artifactId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<artifactId>adbmysqlwriter</artifactId>
<name>adbmysqlwriter</name>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>plugin-rdbms-util</artifactId>
<version>${datax-project-version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.40</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${jdk-version}</source>
<target>${jdk-version}</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<!-- assembly plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

View File

@ -0,0 +1,35 @@
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
<include>plugin_job_template.json</include>
</includes>
<outputDirectory>plugin/writer/adbmysqlwriter</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>adbmysqlwriter-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/writer/adbmysqlwriter</outputDirectory>
</fileSet>
</fileSets>
<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/writer/adbmysqlwriter/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>

View File

@ -0,0 +1,138 @@
package com.alibaba.datax.plugin.writer.adbmysqlwriter;
import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.plugin.RecordReceiver;
import com.alibaba.datax.common.spi.Writer;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.plugin.rdbms.util.DataBaseType;
import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter;
import com.alibaba.datax.plugin.rdbms.writer.Key;
import org.apache.commons.lang3.StringUtils;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.List;
public class AdbMysqlWriter extends Writer {
private static final DataBaseType DATABASE_TYPE = DataBaseType.ADB;
public static class Job extends Writer.Job {
private Configuration originalConfig = null;
private CommonRdbmsWriter.Job commonRdbmsWriterJob;
@Override
public void preCheck(){
this.init();
this.commonRdbmsWriterJob.writerPreCheck(this.originalConfig, DATABASE_TYPE);
}
@Override
public void init() {
this.originalConfig = super.getPluginJobConf();
this.commonRdbmsWriterJob = new CommonRdbmsWriter.Job(DATABASE_TYPE);
this.commonRdbmsWriterJob.init(this.originalConfig);
}
// 一般来说是需要推迟到 task 中进行pre 的执行单表情况例外
@Override
public void prepare() {
//实跑先不支持 权限 检验
//this.commonRdbmsWriterJob.privilegeValid(this.originalConfig, DATABASE_TYPE);
this.commonRdbmsWriterJob.prepare(this.originalConfig);
}
@Override
public List<Configuration> split(int mandatoryNumber) {
return this.commonRdbmsWriterJob.split(this.originalConfig, mandatoryNumber);
}
// 一般来说是需要推迟到 task 中进行post 的执行单表情况例外
@Override
public void post() {
this.commonRdbmsWriterJob.post(this.originalConfig);
}
@Override
public void destroy() {
this.commonRdbmsWriterJob.destroy(this.originalConfig);
}
}
public static class Task extends Writer.Task {
private Configuration writerSliceConfig;
private CommonRdbmsWriter.Task commonRdbmsWriterTask;
public static class DelegateClass extends CommonRdbmsWriter.Task {
private long writeTime = 0L;
private long writeCount = 0L;
private long lastLogTime = 0;
public DelegateClass(DataBaseType dataBaseType) {
super(dataBaseType);
}
@Override
protected void doBatchInsert(Connection connection, List<Record> buffer)
throws SQLException {
long startTime = System.currentTimeMillis();
super.doBatchInsert(connection, buffer);
writeCount = writeCount + buffer.size();
writeTime = writeTime + (System.currentTimeMillis() - startTime);
// log write metrics every 10 seconds
if (System.currentTimeMillis() - lastLogTime > 10000) {
lastLogTime = System.currentTimeMillis();
logTotalMetrics();
}
}
public void logTotalMetrics() {
LOG.info(Thread.currentThread().getName() + ", AdbMySQL writer take " + writeTime + " ms, write " + writeCount + " records.");
}
}
@Override
public void init() {
this.writerSliceConfig = super.getPluginJobConf();
if (StringUtils.isBlank(this.writerSliceConfig.getString(Key.WRITE_MODE))) {
this.writerSliceConfig.set(Key.WRITE_MODE, "REPLACE");
}
this.commonRdbmsWriterTask = new DelegateClass(DATABASE_TYPE);
this.commonRdbmsWriterTask.init(this.writerSliceConfig);
}
@Override
public void prepare() {
this.commonRdbmsWriterTask.prepare(this.writerSliceConfig);
}
//TODO 改用连接池确保每次获取的连接都是可用的注意连接可能需要每次都初始化其 session
public void startWrite(RecordReceiver recordReceiver) {
this.commonRdbmsWriterTask.startWrite(recordReceiver, this.writerSliceConfig,
super.getTaskPluginCollector());
}
@Override
public void post() {
this.commonRdbmsWriterTask.post(this.writerSliceConfig);
}
@Override
public void destroy() {
this.commonRdbmsWriterTask.destroy(this.writerSliceConfig);
}
@Override
public boolean supportFailOver(){
String writeMode = writerSliceConfig.getString(Key.WRITE_MODE);
return "replace".equalsIgnoreCase(writeMode);
}
}
}

View File

@ -0,0 +1,6 @@
{
"name": "adbmysqlwriter",
"class": "com.alibaba.datax.plugin.writer.adbmysqlwriter.AdbMysqlWriter",
"description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql. warn: The more you know about the database, the less problems you encounter.",
"developer": "alibaba"
}

View File

@ -0,0 +1,20 @@
{
"name": "adbmysqlwriter",
"parameter": {
"username": "username",
"password": "password",
"column": ["col1", "col2", "col3"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://<host>:<port>[/<database>]",
"table": ["table1", "table2"]
}
],
"preSql": [],
"postSql": [],
"batchSize": 65536,
"batchByteSize": 134217728,
"dryRun": false,
"writeMode": "insert"
}
}

View File

@ -110,7 +110,6 @@ DataX 将数据直连ADS接口利用ADS暴露的INSERT接口直写到ADS。
"account": "xxx@aliyun.com",
"odpsServer": "xxx",
"tunnelServer": "xxx",
"accountType": "aliyun",
"project": "transfer_project"
},
"writeMode": "load",

View File

@ -18,7 +18,7 @@ import com.alibaba.datax.plugin.writer.adswriter.AdsWriterErrorCode;
import com.alibaba.datax.plugin.writer.adswriter.ads.TableInfo;
import com.alibaba.datax.plugin.writer.adswriter.util.Constant;
import com.alibaba.datax.plugin.writer.adswriter.util.Key;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.tuple.Pair;
import org.slf4j.Logger;

View File

@ -12,7 +12,6 @@ public class TransferProjectConf {
public final static String KEY_ACCOUNT = "odps.account";
public final static String KEY_ODPS_SERVER = "odps.odpsServer";
public final static String KEY_ODPS_TUNNEL = "odps.tunnelServer";
public final static String KEY_ACCOUNT_TYPE = "odps.accountType";
public final static String KEY_PROJECT = "odps.project";
private String accessId;
@ -20,7 +19,6 @@ public class TransferProjectConf {
private String account;
private String odpsServer;
private String odpsTunnel;
private String accountType;
private String project;
public static TransferProjectConf create(Configuration adsWriterConf) {
@ -30,7 +28,6 @@ public class TransferProjectConf {
res.account = adsWriterConf.getString(KEY_ACCOUNT);
res.odpsServer = adsWriterConf.getString(KEY_ODPS_SERVER);
res.odpsTunnel = adsWriterConf.getString(KEY_ODPS_TUNNEL);
res.accountType = adsWriterConf.getString(KEY_ACCOUNT_TYPE, "aliyun");
res.project = adsWriterConf.getString(KEY_PROJECT);
return res;
}
@ -55,9 +52,6 @@ public class TransferProjectConf {
return odpsTunnel;
}
public String getAccountType() {
return accountType;
}
public String getProject() {
return project;

View File

@ -70,7 +70,7 @@ public class DataType {
} else if ("datetime".equals(type)) {
return DATETIME;
} else {
throw new IllegalArgumentException("unkown type: " + type);
throw new IllegalArgumentException("unknown type: " + type);
}
}

View File

@ -23,7 +23,7 @@ import com.alibaba.datax.common.element.StringColumn;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.plugin.TaskPluginCollector;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.CodecRegistry;
@ -298,6 +298,7 @@ public class CassandraReaderHelper {
record.addColumn(new LongColumn(rs.getInt(i)));
break;
case COUNTER:
case BIGINT:
record.addColumn(new LongColumn(rs.getLong(i)));
break;
@ -558,26 +559,6 @@ public class CassandraReaderHelper {
String.format(
"配置信息有错误.列信息中需要包含'%s'字段 .",Key.COLUMN_NAME));
}
if( name.startsWith(Key.WRITE_TIME) ) {
String colName = name.substring(Key.WRITE_TIME.length(),name.length() - 1 );
ColumnMetadata col = tableMetadata.getColumn(colName);
if( col == null ) {
throw DataXException
.asDataXException(
CassandraReaderErrorCode.CONF_ERROR,
String.format(
"配置信息有错误.列'%s'不存在 .",colName));
}
} else {
ColumnMetadata col = tableMetadata.getColumn(name);
if( col == null ) {
throw DataXException
.asDataXException(
CassandraReaderErrorCode.CONF_ERROR,
String.format(
"配置信息有错误.列'%s'不存在 .",name));
}
}
}
}

View File

@ -18,10 +18,10 @@ import java.util.UUID;
import com.alibaba.datax.common.element.Column;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONException;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson2.JSON;
import com.alibaba.fastjson2.JSONArray;
import com.alibaba.fastjson2.JSONException;
import com.alibaba.fastjson2.JSONObject;
import com.datastax.driver.core.BoundStatement;
import com.datastax.driver.core.CodecRegistry;
@ -204,7 +204,7 @@ public class CassandraWriterHelper {
case MAP: {
Map m = new HashMap();
for (JSONObject.Entry e : ((JSONObject)jsonObject).entrySet()) {
for (Map.Entry e : ((JSONObject)jsonObject).entrySet()) {
Object k = parseFromString((String) e.getKey(), type.getTypeArguments().get(0));
Object v = parseFromJson(e.getValue(), type.getTypeArguments().get(1));
m.put(k,v);
@ -233,7 +233,7 @@ public class CassandraWriterHelper {
case UDT: {
UDTValue t = ((UserType) type).newValue();
UserType userType = t.getType();
for (JSONObject.Entry e : ((JSONObject)jsonObject).entrySet()) {
for (Map.Entry e : ((JSONObject)jsonObject).entrySet()) {
DataType eleType = userType.getFieldType((String)e.getKey());
t.set((String)e.getKey(), parseFromJson(e.getValue(), eleType), registry.codecFor(eleType).getJavaType());
}

View File

@ -0,0 +1,344 @@
# ClickhouseReader 插件文档
___
## 1 快速介绍
ClickhouseReader插件实现了从Clickhouse读取数据。在底层实现上ClickhouseReader通过JDBC连接远程Clickhouse数据库并执行相应的sql语句将数据从Clickhouse库中SELECT出来。
## 2 实现原理
简而言之ClickhouseReader通过JDBC连接器连接到远程的Clickhouse数据库并根据用户配置的信息生成查询SELECT SQL语句并发送到远程Clickhouse数据库并将该SQL执行返回结果使用DataX自定义的数据类型拼装为抽象的数据集并传递给下游Writer处理。
对于用户配置Table、Column、Where的信息ClickhouseReader将其拼接为SQL语句发送到Clickhouse数据库对于用户配置querySql信息Clickhouse直接将其发送到Clickhouse数据库。
## 3 功能说明
### 3.1 配置样例
* 配置一个从Clickhouse数据库同步抽取数据到本地的作业:
```
{
"job": {
"setting": {
"speed": {
//设置传输速度 byte/s 尽量逼近这个速度但是不高于它.
// channel 表示通道数量byte表示通道速度如果单通道速度1MB配置byte为1048576表示一个channel
"byte": 1048576
},
//出错限制
"errorLimit": {
//先选择record
"record": 0,
//百分比 1表示100%
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "clickhousereader",
"parameter": {
// 数据库连接用户名
"username": "root",
// 数据库连接密码
"password": "root",
"column": [
"id","name"
],
"connection": [
{
"table": [
"table"
],
"jdbcUrl": [
"jdbc:clickhouse://[HOST_NAME]:PORT/[DATABASE_NAME]"
]
}
]
}
},
"writer": {
//writer类型
"name": "streamwriter",
// 是否打印内容
"parameter": {
"print": true
}
}
}
]
}
}
```
* 配置一个自定义SQL的数据库同步任务到本地内容的作业
```
{
"job": {
"setting": {
"speed": {
"channel": 5
}
},
"content": [
{
"reader": {
"name": "clickhousereader",
"parameter": {
"username": "root",
"password": "root",
"where": "",
"connection": [
{
"querySql": [
"select db_id,on_line_flag from db_info where db_id < 10"
],
"jdbcUrl": [
"jdbc:clickhouse://1.1.1.1:8123/default"
]
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"visible": false,
"encoding": "UTF-8"
}
}
}
]
}
}
```
### 3.2 参数说明
* **jdbcUrl**
* 描述描述的是到对端数据库的JDBC连接信息使用JSON的数组描述并支持一个库填写多个连接地址。之所以使用JSON数组描述连接信息是因为阿里集团内部支持多个IP探测如果配置了多个ClickhouseReader可以依次探测ip的可连接性直到选择一个合法的IP。如果全部连接失败ClickhouseReader报错。 注意jdbcUrl必须包含在connection配置单元中。对于阿里集团外部使用情况JSON数组填写一个JDBC连接即可。
jdbcUrl按照Clickhouse官方规范并可以填写连接附件控制信息。具体请参看[Clickhouse官方文档](https://clickhouse.com/docs/en/engines/table-engines/integrations/jdbc)。
* 必选:是 <br />
* 默认值:无 <br />
* **username**
* 描述:数据源的用户名 <br />
* 必选:是 <br />
* 默认值:无 <br />
* **password**
* 描述:数据源指定用户名的密码 <br />
* 必选:是 <br />
* 默认值:无 <br />
* **table**
* 描述所选取的需要同步的表。使用JSON的数组描述因此支持多张表同时抽取。当配置为多张表时用户自己需保证多张表是同一schema结构ClickhouseReader不予检查表是否同一逻辑表。注意table必须包含在connection配置单元中。<br />
* 必选:是 <br />
* 默认值:无 <br />
* **column**
* 描述所配置的表中需要同步的列名集合使用JSON的数组描述字段信息。用户使用\*代表默认使用所有列配置,例如['\*']。
支持列裁剪,即列可以挑选部分列进行导出。
支持列换序即列可以不按照表schema信息进行导出。
支持常量配置用户需要按照JSON格式:
["id", "`table`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"]
id为普通列名\`table\`为包含保留在的列名1为整形数字常量'bazhen.csy'为字符串常量null为空指针to_char(a + 1)为表达式2.3为浮点数true为布尔值。
Column必须显示填写不允许为空
* 必选:是 <br />
* 默认值:无 <br />
* **splitPk**
* 描述ClickhouseReader进行数据抽取时如果指定splitPk表示用户希望使用splitPk代表的字段进行数据分片DataX因此会启动并发任务进行数据同步这样可以大大提供数据同步的效能。
推荐splitPk用户使用表主键因为表主键通常情况下比较均匀因此切分出来的分片也不容易出现数据热点。
目前splitPk仅支持整形数据切分`不支持浮点、日期等其他类型`。如果用户指定其他非支持类型ClickhouseReader将报错
splitPk如果不填写将视作用户不对单表进行切分ClickhouseReader使用单通道同步全量数据。
* 必选:否 <br />
* 默认值:无 <br />
* **where**
* 描述筛选条件MysqlReader根据指定的column、table、where条件拼接SQL并根据这个SQL进行数据抽取。在实际业务场景中往往会选择当天的数据进行同步可以将where条件指定为gmt_create > $bizdate 。注意不可以将where条件指定为limit 10limit不是SQL的合法where子句。<br />
where条件可以有效地进行业务增量同步。
* 必选:否 <br />
* 默认值:无 <br />
* **querySql**
* 描述在有些业务场景下where这一配置项不足以描述所筛选的条件用户可以通过该配置型来自定义筛选SQL。当用户配置了这一项之后DataX系统就会忽略tablecolumn这些配置型直接使用这个配置项的内容对数据进行筛选例如需要进行多表join后同步数据使用select a,b from table_a join table_b on table_a.id = table_b.id <br />
`当用户配置querySql时ClickhouseReader直接忽略table、column、where条件的配置`
* 必选:否 <br />
* 默认值:无 <br />
* **fetchSize**
* 描述该配置项定义了插件和数据库服务器端每次批量数据获取条数该值决定了DataX和服务器端的网络交互次数能够较大的提升数据抽取性能。<br />
`注意,该值过大(>2048)可能造成DataX进程OOM。`
* 必选:否 <br />
* 默认值1024 <br />
* **session**
* 描述:控制写入数据的时间格式,时区等的配置,如果表中有时间字段,配置该值以明确告知写入 clickhouse 的时间格式。通常配置的参数为NLS_DATE_FORMAT,NLS_TIME_FORMAT。其配置的值为 json 格式,例如:
```
"session": [
"alter session set NLS_DATE_FORMAT='yyyy-mm-dd hh24:mi:ss'",
"alter session set NLS_TIMESTAMP_FORMAT='yyyy-mm-dd hh24:mi:ss'",
"alter session set NLS_TIMESTAMP_TZ_FORMAT='yyyy-mm-dd hh24:mi:ss'",
"alter session set TIME_ZONE='US/Pacific'"
]
```
`(注意&quot;是 " 的转义字符串)`
* 必选:否 <br />
* 默认值:无 <br />
### 3.3 类型转换
目前ClickhouseReader支持大部分Clickhouse类型但也存在部分个别类型没有支持的情况请注意检查你的类型。
下面列出ClickhouseReader针对Clickhouse类型转换列表:
| DataX 内部类型| Clickhouse 数据类型 |
| -------- |--------------------------------------------------------------------------------------------|
| Long | UInt8, UInt16, UInt32, UInt64, UInt128, UInt256, Int8, Int16, Int32, Int64, Int128, Int256 |
| Double | Float32, Float64, Decimal |
| String | String, FixedString |
| Date | DATE, Date32, DateTime, DateTime64 |
| Boolean | Boolean |
| Bytes | BLOB,BFILE,RAW,LONG RAW |
请注意:
* `除上述罗列字段类型外,其他类型均不支持`
## 4 性能报告
### 4.1 环境准备
#### 4.1.1 数据特征
为了模拟线上真实数据我们设计两个Clickhouse数据表分别为:
#### 4.1.2 机器参数
* 执行DataX的机器参数为:
* Clickhouse数据库机器参数为:
### 4.2 测试报告
#### 4.2.1 表1测试报告
| 并发任务数| DataX速度(Rec/s)|DataX流量|网卡流量|DataX运行负载|DB运行负载|
|--------| --------|--------|--------|--------|--------|
|1| DataX 统计速度(Rec/s)|DataX统计流量|网卡流量|DataX运行负载|DB运行负载|
## 5 约束限制
### 5.1 主备同步数据恢复问题
主备同步问题指Clickhouse使用主从灾备备库从主库不间断通过binlog恢复数据。由于主备数据同步存在一定的时间差特别在于某些特定情况例如网络延迟等问题导致备库同步恢复的数据与主库有较大差别导致从备库同步的数据不是一份当前时间的完整镜像。
针对这个问题我们提供了preSql功能该功能待补充。
### 5.2 一致性约束
Clickhouse在数据存储划分中属于RDBMS系统对外可以提供强一致性数据查询接口。例如当一次同步任务启动运行过程中当该库存在其他数据写入方写入数据时ClickhouseReader完全不会获取到写入更新数据这是由于数据库本身的快照特性决定的。关于数据库快照特性请参看[MVCC Wikipedia](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)
上述是在ClickhouseReader单线程模型下数据同步一致性的特性由于ClickhouseReader可以根据用户配置信息使用了并发数据抽取因此不能严格保证数据一致性当ClickhouseReader根据splitPk进行数据切分后会先后启动多个并发任务完成数据同步。由于多个并发任务相互之间不属于同一个读事务同时多个并发任务存在时间间隔。因此这份数据并不是`完整的`、`一致的`数据快照信息。
针对多线程的一致性快照需求,在技术上目前无法实现,只能从工程角度解决,工程化的方式存在取舍,我们提供几个解决思路给用户,用户可以自行选择:
1. 使用单线程同步,即不再进行数据切片。缺点是速度比较慢,但是能够很好保证一致性。
2. 关闭其他数据写入方,保证当前数据为静态数据,例如,锁表、关闭备库同步等等。缺点是可能影响在线业务。
### 5.3 数据库编码问题
ClickhouseReader底层使用JDBC进行数据抽取JDBC天然适配各类编码并在底层进行了编码转换。因此ClickhouseReader不需用户指定编码可以自动获取编码并转码。
对于Clickhouse底层写入编码和其设定的编码不一致的混乱情况ClickhouseReader对此无法识别对此也无法提供解决方案对于这类情况`导出有可能为乱码`。
### 5.4 增量数据同步
ClickhouseReader使用JDBC SELECT语句完成数据抽取工作因此可以使用SELECT...WHERE...进行增量数据抽取,方式有多种:
* 数据库在线应用写入数据库时填充modify字段为更改时间戳包括新增、更新、删除(逻辑删)。对于这类应用ClickhouseReader只需要WHERE条件跟上一同步阶段时间戳即可。
* 对于新增流水型数据ClickhouseReader可以WHERE条件后跟上一阶段最大自增ID即可。
对于业务上无字段区分新增、修改数据情况ClickhouseReader也无法进行增量数据同步只能同步全量数据。
### 5.5 Sql安全性
ClickhouseReader提供querySql语句交给用户自己实现SELECT抽取语句ClickhouseReader本身对querySql不做任何安全性校验。这块交由DataX用户方自己保证。
## 6 FAQ
***
**Q: ClickhouseReader同步报错报错信息为XXX**
A: 网络或者权限问题请使用Clickhouse命令行测试
如果上述命令也报错那可以证实是环境问题请联系你的DBA。
**Q: ClickhouseReader抽取速度很慢怎么办**
A: 影响抽取时间的原因大概有如下几个:(来自专业 DBA 卫绾)
1. 由于SQL的plan异常导致的抽取时间长 在抽取时,尽可能使用全表扫描代替索引扫描;
2. 合理sql的并发度减少抽取时间
3. 抽取sql要简单尽量不用replace等函数这个非常消耗cpu会严重影响抽取速度;

91
clickhousereader/pom.xml Normal file
View File

@ -0,0 +1,91 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>datax-all</artifactId>
<groupId>com.alibaba.datax</groupId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>clickhousereader</artifactId>
<name>clickhousereader</name>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>ru.yandex.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.2.4</version>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-core</artifactId>
<version>${datax-project-version}</version>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>plugin-rdbms-util</artifactId>
<version>${datax-project-version}</version>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.properties</include>
</includes>
</resource>
</resources>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${jdk-version}</source>
<target>${jdk-version}</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<!-- assembly plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

View File

@ -0,0 +1,35 @@
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
<include>plugin_job_template.json</include>
</includes>
<outputDirectory>plugin/reader/clickhousereader</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>clickhousereader-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/reader/clickhousereader</outputDirectory>
</fileSet>
</fileSets>
<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/reader/clickhousereader/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>

View File

@ -0,0 +1,85 @@
package com.alibaba.datax.plugin.reader.clickhousereader;
import java.sql.Array;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.SQLException;
import java.sql.Types;
import java.util.List;
import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.element.StringColumn;
import com.alibaba.datax.common.plugin.RecordSender;
import com.alibaba.datax.common.plugin.TaskPluginCollector;
import com.alibaba.datax.common.spi.Reader;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.common.util.MessageSource;
import com.alibaba.datax.plugin.rdbms.reader.CommonRdbmsReader;
import com.alibaba.datax.plugin.rdbms.util.DataBaseType;
import com.alibaba.fastjson2.JSON;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class ClickhouseReader extends Reader {
private static final DataBaseType DATABASE_TYPE = DataBaseType.ClickHouse;
private static final Logger LOG = LoggerFactory.getLogger(ClickhouseReader.class);
public static class Job extends Reader.Job {
private Configuration jobConfig = null;
private CommonRdbmsReader.Job commonRdbmsReaderMaster;
@Override
public void init() {
this.jobConfig = super.getPluginJobConf();
this.commonRdbmsReaderMaster = new CommonRdbmsReader.Job(DATABASE_TYPE);
this.commonRdbmsReaderMaster.init(this.jobConfig);
}
@Override
public List<Configuration> split(int mandatoryNumber) {
return this.commonRdbmsReaderMaster.split(this.jobConfig, mandatoryNumber);
}
@Override
public void post() {
this.commonRdbmsReaderMaster.post(this.jobConfig);
}
@Override
public void destroy() {
this.commonRdbmsReaderMaster.destroy(this.jobConfig);
}
}
public static class Task extends Reader.Task {
private Configuration jobConfig;
private CommonRdbmsReader.Task commonRdbmsReaderSlave;
@Override
public void init() {
this.jobConfig = super.getPluginJobConf();
this.commonRdbmsReaderSlave = new CommonRdbmsReader.Task(DATABASE_TYPE, super.getTaskGroupId(), super.getTaskId());
this.commonRdbmsReaderSlave.init(this.jobConfig);
}
@Override
public void startRead(RecordSender recordSender) {
int fetchSize = this.jobConfig.getInt(com.alibaba.datax.plugin.rdbms.reader.Constant.FETCH_SIZE, 1000);
this.commonRdbmsReaderSlave.startRead(this.jobConfig, recordSender, super.getTaskPluginCollector(), fetchSize);
}
@Override
public void post() {
this.commonRdbmsReaderSlave.post(this.jobConfig);
}
@Override
public void destroy() {
this.commonRdbmsReaderSlave.destroy(this.jobConfig);
}
}
}

View File

@ -0,0 +1,6 @@
{
"name": "clickhousereader",
"class": "com.alibaba.datax.plugin.reader.clickhousereader.ClickhouseReader",
"description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql.",
"developer": "alibaba"
}

View File

@ -0,0 +1,16 @@
{
"name": "clickhousereader",
"parameter": {
"username": "username",
"password": "password",
"column": ["col1", "col2", "col3"],
"connection": [
{
"jdbcUrl": "jdbc:clickhouse://<host>:<port>[/<database>]",
"table": ["table1", "table2"]
}
],
"preSql": [],
"postSql": []
}
}

View File

@ -0,0 +1,57 @@
{
"job": {
"setting": {
"speed": {
"channel": 5
}
},
"content": [
{
"reader": {
"name": "clickhousereader",
"parameter": {
"username": "XXXX",
"password": "XXXX",
"column": [
"uint8_col",
"uint16_col",
"uint32_col",
"uint64_col",
"int8_col",
"int16_col",
"int32_col",
"int64_col",
"float32_col",
"float64_col",
"bool_col",
"str_col",
"fixedstr_col",
"uuid_col",
"date_col",
"datetime_col",
"enum_col",
"ary_uint8_col",
"ary_str_col",
"tuple_col",
"nullable_col",
"nested_col.nested_id",
"nested_col.nested_str",
"ipv4_col",
"ipv6_col",
"decimal_col"
],
"connection": [
{
"table": [
"all_type_tbl"
],
"jdbcUrl":["jdbc:clickhouse://XXXX:8123/default"]
}
]
}
},
"writer": {}
}
]
}
}

View File

@ -0,0 +1,34 @@
CREATE TABLE IF NOT EXISTS default.all_type_tbl
(
`uint8_col` UInt8,
`uint16_col` UInt16,
uint32_col UInt32,
uint64_col UInt64,
int8_col Int8,
int16_col Int16,
int32_col Int32,
int64_col Int64,
float32_col Float32,
float64_col Float64,
bool_col UInt8,
str_col String,
fixedstr_col FixedString(3),
uuid_col UUID,
date_col Date,
datetime_col DateTime,
enum_col Enum('hello' = 1, 'world' = 2),
ary_uint8_col Array(UInt8),
ary_str_col Array(String),
tuple_col Tuple(UInt8, String),
nullable_col Nullable(UInt8),
nested_col Nested
(
nested_id UInt32,
nested_str String
),
ipv4_col IPv4,
ipv6_col IPv6,
decimal_col Decimal(5,3)
)
ENGINE = MergeTree()
ORDER BY (uint8_col);

View File

@ -10,8 +10,8 @@ import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode;
import com.alibaba.datax.plugin.rdbms.util.DataBaseType;
import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson2.JSON;
import com.alibaba.fastjson2.JSONArray;
import java.sql.Array;
import java.sql.Connection;
@ -68,7 +68,7 @@ public class ClickhouseWriter extends Writer {
this.commonRdbmsWriterSlave = new CommonRdbmsWriter.Task(DATABASE_TYPE) {
@Override
protected PreparedStatement fillPreparedStatementColumnType(PreparedStatement preparedStatement, int columnIndex, int columnSqltype, Column column) throws SQLException {
protected PreparedStatement fillPreparedStatementColumnType(PreparedStatement preparedStatement, int columnIndex, int columnSqltype, String typeName, Column column) throws SQLException {
try {
if (column.getRawData() == null) {
preparedStatement.setNull(columnIndex + 1, columnSqltype);

View File

@ -2,5 +2,5 @@
"name": "clickhousewriter",
"class": "com.alibaba.datax.plugin.writer.clickhousewriter.ClickhouseWriter",
"description": "useScene: prod. mechanism: Jdbc connection using the database, execute insert sql.",
"developer": "jiye.tjy"
"developer": "alibaba"
}

View File

@ -17,8 +17,8 @@
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<groupId>com.alibaba.fastjson2</groupId>
<artifactId>fastjson2</artifactId>
</dependency>
<dependency>
<groupId>commons-io</groupId>

View File

@ -1,6 +1,6 @@
package com.alibaba.datax.common.element;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import java.math.BigDecimal;
import java.math.BigInteger;

View File

@ -5,6 +5,7 @@ import com.alibaba.datax.common.exception.DataXException;
import java.math.BigDecimal;
import java.math.BigInteger;
import java.sql.Time;
import java.util.Date;
/**
@ -12,18 +13,54 @@ import java.util.Date;
*/
public class DateColumn extends Column {
private DateType subType = DateType.DATETIME;
private DateType subType = DateType.DATETIME;
public static enum DateType {
DATE, TIME, DATETIME
}
private int nanos = 0;
/**
* 构建值为null的DateColumn使用Date子类型为DATETIME
* */
public DateColumn() {
this((Long)null);
}
private int precision = -1;
public static enum DateType {
DATE, TIME, DATETIME
}
/**
* 构建值为time(java.sql.Time)的DateColumn使用Date子类型为TIME只有时间没有日期
*/
public DateColumn(Time time, int nanos, int jdbcPrecision) {
this(time);
if (time != null) {
setNanos(nanos);
}
if (jdbcPrecision == 10) {
setPrecision(0);
}
if (jdbcPrecision >= 12 && jdbcPrecision <= 17) {
setPrecision(jdbcPrecision - 11);
}
}
public long getNanos() {
return nanos;
}
public void setNanos(int nanos) {
this.nanos = nanos;
}
public int getPrecision() {
return precision;
}
public void setPrecision(int precision) {
this.precision = precision;
}
/**
* 构建值为null的DateColumn使用Date子类型为DATETIME
*/
public DateColumn() {
this((Long) null);
}
/**
* 构建值为stamp(Unix时间戳)的DateColumn使用Date子类型为DATETIME

View File

@ -31,7 +31,6 @@ public class PerfTrace {
private int taskGroupId;
private int channelNumber;
private int priority;
private int batchSize = 500;
private volatile boolean perfReportEnable = true;
@ -54,12 +53,12 @@ public class PerfTrace {
* @param taskGroupId
* @return
*/
public static PerfTrace getInstance(boolean isJob, long jobId, int taskGroupId, int priority, boolean enable) {
public static PerfTrace getInstance(boolean isJob, long jobId, int taskGroupId, boolean enable) {
if (instance == null) {
synchronized (lock) {
if (instance == null) {
instance = new PerfTrace(isJob, jobId, taskGroupId, priority, enable);
instance = new PerfTrace(isJob, jobId, taskGroupId, enable);
}
}
}
@ -76,22 +75,21 @@ public class PerfTrace {
LOG.error("PerfTrace instance not be init! must have some error! ");
synchronized (lock) {
if (instance == null) {
instance = new PerfTrace(false, -1111, -1111, 0, false);
instance = new PerfTrace(false, -1111, -1111, false);
}
}
}
return instance;
}
private PerfTrace(boolean isJob, long jobId, int taskGroupId, int priority, boolean enable) {
private PerfTrace(boolean isJob, long jobId, int taskGroupId, boolean enable) {
try {
this.perfTraceId = isJob ? "job_" + jobId : String.format("taskGroup_%s_%s", jobId, taskGroupId);
this.enable = enable;
this.isJob = isJob;
this.taskGroupId = taskGroupId;
this.instId = jobId;
this.priority = priority;
LOG.info(String.format("PerfTrace traceId=%s, isEnable=%s, priority=%s", this.perfTraceId, this.enable, this.priority));
LOG.info(String.format("PerfTrace traceId=%s, isEnable=%s", this.perfTraceId, this.enable));
} catch (Exception e) {
// do nothing
@ -398,7 +396,6 @@ public class PerfTrace {
jdo.setWindowEnd(this.windowEnd);
jdo.setJobStartTime(jobStartTime);
jdo.setJobRunTimeMs(System.currentTimeMillis() - jobStartTime.getTime());
jdo.setJobPriority(this.priority);
jdo.setChannelNum(this.channelNumber);
jdo.setCluster(this.cluster);
jdo.setJobDomain(this.jobDomain);
@ -609,7 +606,6 @@ public class PerfTrace {
private Date jobStartTime;
private Date jobEndTime;
private Long jobRunTimeMs;
private Integer jobPriority;
private Integer channelNum;
private String cluster;
private String jobDomain;
@ -680,10 +676,6 @@ public class PerfTrace {
return jobRunTimeMs;
}
public Integer getJobPriority() {
return jobPriority;
}
public Integer getChannelNum() {
return channelNum;
}
@ -816,10 +808,6 @@ public class PerfTrace {
this.jobRunTimeMs = jobRunTimeMs;
}
public void setJobPriority(Integer jobPriority) {
this.jobPriority = jobPriority;
}
public void setChannelNum(Integer channelNum) {
this.channelNum = channelNum;
}

View File

@ -77,8 +77,8 @@ public class VMInfo {
garbageCollectorMXBeanList = java.lang.management.ManagementFactory.getGarbageCollectorMXBeans();
memoryPoolMXBeanList = java.lang.management.ManagementFactory.getMemoryPoolMXBeans();
osInfo = runtimeMXBean.getVmVendor() + " " + runtimeMXBean.getSpecVersion() + " " + runtimeMXBean.getVmVersion();
jvmInfo = osMXBean.getName() + " " + osMXBean.getArch() + " " + osMXBean.getVersion();
jvmInfo = runtimeMXBean.getVmVendor() + " " + runtimeMXBean.getSpecVersion() + " " + runtimeMXBean.getVmVersion();
osInfo = osMXBean.getName() + " " + osMXBean.getArch() + " " + osMXBean.getVersion();
totalProcessorCount = osMXBean.getAvailableProcessors();
//构建startPhyOSStatus

View File

@ -3,8 +3,8 @@ package com.alibaba.datax.common.util;
import com.alibaba.datax.common.exception.CommonErrorCode;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.spi.ErrorCode;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.serializer.SerializerFeature;
import com.alibaba.fastjson2.JSON;
import com.alibaba.fastjson2.JSONWriter;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.CharUtils;
import org.apache.commons.lang3.StringUtils;
@ -411,6 +411,15 @@ public class Configuration {
return list;
}
public <T> List<T> getListWithJson(final String path, Class<T> t) {
Object object = this.get(path, List.class);
if (null == object) {
return null;
}
return JSON.parseArray(JSON.toJSONString(object),t);
}
/**
* 根据用户提供的json path寻址List对象如果对象不存在返回null
*/
@ -577,7 +586,7 @@ public class Configuration {
*/
public String beautify() {
return JSON.toJSONString(this.getInternal(),
SerializerFeature.PrettyFormat);
JSONWriter.Feature.PrettyFormat);
}
/**

View File

@ -1,62 +0,0 @@
package com.alibaba.datax.common.util;
import java.util.Map;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.alibaba.datax.common.exception.DataXException;
public class IdAndKeyRollingUtil {
private static Logger LOGGER = LoggerFactory.getLogger(IdAndKeyRollingUtil.class);
public static final String SKYNET_ACCESSID = "SKYNET_ACCESSID";
public static final String SKYNET_ACCESSKEY = "SKYNET_ACCESSKEY";
public final static String ACCESS_ID = "accessId";
public final static String ACCESS_KEY = "accessKey";
public static String parseAkFromSkynetAccessKey() {
Map<String, String> envProp = System.getenv();
String skynetAccessID = envProp.get(IdAndKeyRollingUtil.SKYNET_ACCESSID);
String skynetAccessKey = envProp.get(IdAndKeyRollingUtil.SKYNET_ACCESSKEY);
String accessKey = null;
// follow 原有的判断条件
// 环境变量中如果存在SKYNET_ACCESSID/SKYNET_ACCESSKEy只要有其中一个变量则认为一定是两个都存在的
// if (StringUtils.isNotBlank(skynetAccessID) ||
// StringUtils.isNotBlank(skynetAccessKey)) {
// 检查严格只有加密串不为空的时候才进去不过 之前能跑的加密串都不应该为空
if (StringUtils.isNotBlank(skynetAccessKey)) {
LOGGER.info("Try to get accessId/accessKey from environment SKYNET_ACCESSKEY.");
accessKey = DESCipher.decrypt(skynetAccessKey);
if (StringUtils.isBlank(accessKey)) {
// 环境变量里面有但是解析不到
throw DataXException.asDataXException(String.format(
"Failed to get the [accessId]/[accessKey] from the environment variable. The [accessId]=[%s]",
skynetAccessID));
}
}
if (StringUtils.isNotBlank(accessKey)) {
LOGGER.info("Get accessId/accessKey from environment variables SKYNET_ACCESSKEY successfully.");
}
return accessKey;
}
public static String getAccessIdAndKeyFromEnv(Configuration originalConfig) {
String accessId = null;
Map<String, String> envProp = System.getenv();
accessId = envProp.get(IdAndKeyRollingUtil.SKYNET_ACCESSID);
String accessKey = null;
if (StringUtils.isBlank(accessKey)) {
// 老的没有出异常只是获取不到ak
accessKey = IdAndKeyRollingUtil.parseAkFromSkynetAccessKey();
}
if (StringUtils.isNotBlank(accessKey)) {
// 确认使用这个的都是 accessIdaccessKey的命名习惯
originalConfig.set(IdAndKeyRollingUtil.ACCESS_ID, accessId);
originalConfig.set(IdAndKeyRollingUtil.ACCESS_KEY, accessKey);
}
return accessKey;
}
}

View File

@ -0,0 +1,34 @@
package com.alibaba.datax.common.util;
import org.apache.commons.lang3.StringUtils;
import java.util.HashMap;
import java.util.Map;
/**
* @author jitongchen
* @date 2023/9/7 9:47 AM
*/
public class LimitLogger {
private static Map<String, Long> lastPrintTime = new HashMap<>();
public static void limit(String name, long limit, LoggerFunction function) {
if (StringUtils.isBlank(name)) {
name = "__all__";
}
if (limit <= 0) {
function.apply();
} else {
if (!lastPrintTime.containsKey(name)) {
lastPrintTime.put(name, System.currentTimeMillis());
function.apply();
} else {
if (System.currentTimeMillis() > lastPrintTime.get(name) + limit) {
lastPrintTime.put(name, System.currentTimeMillis());
function.apply();
}
}
}
}
}

View File

@ -0,0 +1,10 @@
package com.alibaba.datax.common.util;
/**
* @author molin.lxd
* @date 2021-05-09
*/
public interface LoggerFunction {
void apply();
}

View File

@ -3,6 +3,8 @@ package com.alibaba.datax.common.util;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.Validate;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.text.DecimalFormat;
import java.util.HashMap;
import java.util.Map;
@ -82,4 +84,20 @@ public class StrUtil {
return s.substring(0, headLength) + "..." + s.substring(s.length() - tailLength);
}
public static String getMd5(String plainText) {
try {
StringBuilder builder = new StringBuilder();
for (byte b : MessageDigest.getInstance("MD5").digest(plainText.getBytes())) {
int i = b & 0xff;
if (i < 0x10) {
builder.append('0');
}
builder.append(Integer.toHexString(i));
}
return builder.toString();
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
}
}
}

View File

@ -41,7 +41,7 @@
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5</version>
<version>4.5.13</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>

View File

@ -79,16 +79,9 @@ public class Engine {
perfReportEnable = false;
}
int priority = 0;
try {
priority = Integer.parseInt(System.getenv("SKYNET_PRIORITY"));
}catch (NumberFormatException e){
LOG.warn("prioriy set to 0, because NumberFormatException, the value is: "+System.getProperty("PROIORY"));
}
Configuration jobInfoConfig = allConf.getConfiguration(CoreConstant.DATAX_JOB_JOBINFO);
//初始化PerfTrace
PerfTrace perfTrace = PerfTrace.getInstance(isJob, instanceId, taskGroupId, priority, traceEnable);
PerfTrace perfTrace = PerfTrace.getInstance(isJob, instanceId, taskGroupId, traceEnable);
perfTrace.setJobInfo(jobInfoConfig,perfReportEnable,channelNumber);
container.start();

View File

@ -114,7 +114,7 @@ public final class JobAssignUtil {
* 需要实现的效果通过例子来说是
* <pre>
* a 库上有表0, 1, 2
* a 库上有表3, 4
* b 库上有表3, 4
* c 库上有表5, 6, 7
*
* 如果有 4个 taskGroup

View File

@ -27,7 +27,7 @@ import com.alibaba.datax.core.util.container.ClassLoaderSwapper;
import com.alibaba.datax.core.util.container.CoreConstant;
import com.alibaba.datax.core.util.container.LoadUtil;
import com.alibaba.datax.dataxservice.face.domain.enums.ExecuteMode;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.Validate;
import org.slf4j.Logger;

View File

@ -2,7 +2,7 @@ package com.alibaba.datax.core.statistics.communication;
import com.alibaba.datax.common.statistics.PerfTrace;
import com.alibaba.datax.common.util.StrUtil;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import org.apache.commons.lang.Validate;
import java.text.DecimalFormat;

View File

@ -6,7 +6,7 @@ import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.core.statistics.communication.Communication;
import com.alibaba.datax.core.util.container.CoreConstant;
import com.alibaba.datax.core.statistics.plugin.task.util.DirtyRecord;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;

View File

@ -4,7 +4,7 @@ import com.alibaba.datax.common.element.Column;
import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.core.util.FrameworkErrorCode;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import java.math.BigDecimal;
import java.math.BigInteger;

View File

@ -27,7 +27,7 @@ import com.alibaba.datax.core.util.TransformerUtil;
import com.alibaba.datax.core.util.container.CoreConstant;
import com.alibaba.datax.core.util.container.LoadUtil;
import com.alibaba.datax.dataxservice.face.domain.enums.State;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import org.apache.commons.lang3.Validate;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

View File

@ -29,7 +29,7 @@ public class MemoryChannel extends Channel {
private ReentrantLock lock;
private Condition notInsufficient, notEmpty;
private Condition notSufficient, notEmpty;
public MemoryChannel(final Configuration configuration) {
super(configuration);
@ -37,7 +37,7 @@ public class MemoryChannel extends Channel {
this.bufferSize = configuration.getInt(CoreConstant.DATAX_CORE_TRANSPORT_EXCHANGER_BUFFERSIZE);
lock = new ReentrantLock();
notInsufficient = lock.newCondition();
notSufficient = lock.newCondition();
notEmpty = lock.newCondition();
}
@ -75,7 +75,7 @@ public class MemoryChannel extends Channel {
lock.lockInterruptibly();
int bytes = getRecordBytes(rs);
while (memoryBytes.get() + bytes > this.byteCapacity || rs.size() > this.queue.remainingCapacity()) {
notInsufficient.await(200L, TimeUnit.MILLISECONDS);
notSufficient.await(200L, TimeUnit.MILLISECONDS);
}
this.queue.addAll(rs);
waitWriterTime += System.nanoTime() - startTime;
@ -116,7 +116,7 @@ public class MemoryChannel extends Channel {
waitReaderTime += System.nanoTime() - startTime;
int bytes = getRecordBytes(rs);
memoryBytes.addAndGet(-bytes);
notInsufficient.signalAll();
notSufficient.signalAll();
} catch (InterruptedException e) {
throw DataXException.asDataXException(
FrameworkErrorCode.RUNTIME_ERROR, e);

View File

@ -5,7 +5,7 @@ import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.core.util.ClassSize;
import com.alibaba.datax.core.util.FrameworkErrorCode;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson2.JSON;
import java.util.ArrayList;
import java.util.HashMap;

View File

@ -0,0 +1,87 @@
package com.alibaba.datax.core.transport.transformer;
import com.alibaba.datax.common.element.Column;
import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.element.StringColumn;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.transformer.Transformer;
import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.lang.StringUtils;
import java.util.Arrays;
/**
* no comments.
*
* @author XuDaojie
* @since 2021-08-16
*/
public class DigestTransformer extends Transformer {
private static final String MD5 = "md5";
private static final String SHA1 = "sha1";
private static final String TO_UPPER_CASE = "toUpperCase";
private static final String TO_LOWER_CASE = "toLowerCase";
public DigestTransformer() {
setTransformerName("dx_digest");
}
@Override
public Record evaluate(Record record, Object... paras) {
int columnIndex;
String type;
String charType;
try {
if (paras.length != 3) {
throw new RuntimeException("dx_digest paras length must be 3");
}
columnIndex = (Integer) paras[0];
type = (String) paras[1];
charType = (String) paras[2];
if (!StringUtils.equalsIgnoreCase(MD5, type) && !StringUtils.equalsIgnoreCase(SHA1, type)) {
throw new RuntimeException("dx_digest paras index 1 must be md5 or sha1");
}
if (!StringUtils.equalsIgnoreCase(TO_UPPER_CASE, charType) && !StringUtils.equalsIgnoreCase(TO_LOWER_CASE, charType)) {
throw new RuntimeException("dx_digest paras index 2 must be toUpperCase or toLowerCase");
}
} catch (Exception e) {
throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_ILLEGAL_PARAMETER, "paras:" + Arrays.asList(paras) + " => " + e.getMessage());
}
Column column = record.getColumn(columnIndex);
try {
String oriValue = column.asString();
// 如果字段为空作为空字符串处理
if (oriValue == null) {
oriValue = "";
}
String newValue;
if (MD5.equals(type)) {
newValue = DigestUtils.md5Hex(oriValue);
} else {
newValue = DigestUtils.sha1Hex(oriValue);
}
if (TO_UPPER_CASE.equals(charType)) {
newValue = newValue.toUpperCase();
} else {
newValue = newValue.toLowerCase();
}
record.setColumn(columnIndex, new StringColumn(newValue));
} catch (Exception e) {
throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_RUN_EXCEPTION, e.getMessage(), e);
}
return record;
}
}

View File

@ -61,7 +61,7 @@ public class FilterTransformer extends Transformer {
} else if (code.equalsIgnoreCase("<=")) {
return doLess(record, value, column, true);
} else {
throw new RuntimeException("dx_filter can't suport code:" + code);
throw new RuntimeException("dx_filter can't support code:" + code);
}
} catch (Exception e) {
throw DataXException.asDataXException(TransformerErrorCode.TRANSFORMER_RUN_EXCEPTION, e.getMessage(), e);

View File

@ -1,10 +1,18 @@
package com.alibaba.datax.core.transport.transformer;
import org.apache.commons.codec.digest.DigestUtils;
/**
* GroovyTransformer的帮助类供groovy代码使用必须全是static的方法
* Created by liqiang on 16/3/4.
*/
public class GroovyTransformerStaticUtil {
public static String md5(final String data) {
return DigestUtils.md5Hex(data);
}
public static String sha1(final String data) {
return DigestUtils.sha1Hex(data);
}
}

View File

@ -36,6 +36,7 @@ public class TransformerRegistry {
registTransformer(new ReplaceTransformer());
registTransformer(new FilterTransformer());
registTransformer(new GroovyTransformer());
registTransformer(new DigestTransformer());
}
public static void loadTransformerFromLocalStorage() {

View File

@ -168,6 +168,7 @@ public final class ConfigParser {
boolean isDefaultPath = StringUtils.isBlank(pluginPath);
if (isDefaultPath) {
configuration.set("path", path);
configuration.set("loadType","jarLoader");
}
Configuration result = Configuration.newDefault();

View File

@ -105,7 +105,7 @@ public class CoreConstant {
public static final String DATAX_JOB_POSTHANDLER_PLUGINNAME = "job.postHandler.pluginName";
// ----------------------------- 局部使用的变量
public static final String JOB_WRITER = "reader";
public static final String JOB_WRITER = "writer";
public static final String JOB_READER = "reader";

View File

@ -15,7 +15,7 @@ import java.util.List;
/**
* 提供Jar隔离的加载机制会把传入的路径及其子路径以及路径中的jar文件加入到class path
*/
public class JarLoader extends URLClassLoader {
public class JarLoader extends URLClassLoader{
public JarLoader(String[] paths) {
this(paths, JarLoader.class.getClassLoader());
}

View File

@ -49,7 +49,7 @@ public class LoadUtil {
/**
* jarLoader的缓冲
*/
private static Map<String, JarLoader> jarLoaderCenter = new HashMap<String, JarLoader>();
private static Map<String, JarLoader> jarLoaderCenter = new HashMap();
/**
* 设置pluginConfigs方便后面插件来获取

View File

@ -2,7 +2,7 @@
"job": {
"setting": {
"speed": {
"byte":10485760
"channel":1
},
"errorLimit": {
"record": 0,

View File

@ -0,0 +1,183 @@
# DataX DatabendWriter
[简体中文](./databendwriter-CN.md) | [English](./databendwriter.md)
## 1 快速介绍
Databend Writer 是一个 DataX 的插件,用于从 DataX 中写入数据到 Databend 表中。
该插件基于[databend JDBC driver](https://github.com/databendcloud/databend-jdbc) ,它使用 [RESTful http protocol](https://databend.rs/doc/integrations/api/rest)
在开源的 databend 和 [databend cloud](https://app.databend.com/) 上执行查询。
在每个写入批次中databend writer 将批量数据上传到内部的 S3 stage然后执行相应的 insert SQL 将数据上传到 databend 表中。
为了最佳的用户体验,如果您使用的是 databend 社区版本,您应该尝试采用 [S3](https://aws.amazon.com/s3/)/[minio](https://min.io/)/[OSS](https://www.alibabacloud.com/product/object-storage-service) 作为其底层存储层,因为
它们支持预签名上传操作,否则您可能会在数据传输上浪费不必要的成本。
您可以在[文档](https://databend.rs/doc/deploy/deploying-databend)中了解更多详细信息
## 2 实现原理
Databend Writer 将使用 DataX 从 DataX Reader 中获取生成的记录,并将记录批量插入到 databend 表中指定的列中。
## 3 功能说明
### 3.1 配置样例
* 以下配置将从内存中读取一些生成的数据并将数据上传到databend表中
#### 准备工作
```sql
--- create table in databend
drop table if exists datax.sample1;
drop database if exists datax;
create database if not exists datax;
create table if not exsits datax.sample1(a string, b int64, c date, d timestamp, e bool, f string, g variant);
```
#### 配置样例
```json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column" : [
{
"value": "DataX",
"type": "string"
},
{
"value": 19880808,
"type": "long"
},
{
"value": "1926-08-08 08:08:08",
"type": "date"
},
{
"value": "1988-08-08 08:08:08",
"type": "date"
},
{
"value": true,
"type": "bool"
},
{
"value": "test",
"type": "bytes"
},
{
"value": "{\"type\": \"variant\", \"value\": \"test\"}",
"type": "string"
}
],
"sliceRecordCount": 10000
}
},
"writer": {
"name": "databendwriter",
"parameter": {
"writeMode": "replace",
"onConflictColumn": ["id"],
"username": "databend",
"password": "databend",
"column": ["a", "b", "c", "d", "e", "f", "g"],
"batchSize": 1000,
"preSql": [
],
"postSql": [
],
"connection": [
{
"jdbcUrl": "jdbc:databend://localhost:8000/datax",
"table": [
"sample1"
]
}
]
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
```
### 3.2 参数说明
* jdbcUrl
* 描述: JDBC 数据源 url。请参阅仓库中的详细[文档](https://github.com/databendcloud/databend-jdbc)
* 必选: 是
* 默认值: 无
* 示例: jdbc:databend://localhost:8000/datax
* username
* 描述: JDBC 数据源用户名
* 必选: 是
* 默认值: 无
* 示例: databend
* password
* 描述: JDBC 数据源密码
* 必选: 是
* 默认值: 无
* 示例: databend
* table
* 描述: 表名的集合table应该包含column参数中的所有列。
* 必选: 是
* 默认值: 无
* 示例: ["sample1"]
* column
* 描述: 表中的列名集合字段顺序应该与reader的record中的column类型对应
* 必选: 是
* 默认值: 无
* 示例: ["a", "b", "c", "d", "e", "f", "g"]
* batchSize
* 描述: 每个批次的记录数
* 必选: 否
* 默认值: 1000
* 示例: 1000
* preSql
* 描述: 在写入数据之前执行的SQL语句
* 必选: 否
* 默认值: 无
* 示例: ["delete from datax.sample1"]
* postSql
* 描述: 在写入数据之后执行的SQL语句
* 必选: 否
* 默认值: 无
* 示例: ["select count(*) from datax.sample1"]
* writeMode
* 描述:写入模式,支持 insert 和 replace 两种模式,默认为 insert。若为 replace务必填写 onConflictColumn 参数
* 必选:否
* 默认值insert
* 示例:"replace"
* onConflictColumn
* 描述on conflict 字段,指定 writeMode 为 replace 后,需要此参数
* 必选:否
* 默认值:无
* 示例:["id","user"]
### 3.3 类型转化
DataX中的数据类型可以转换为databend中的相应数据类型。下表显示了两种类型之间的对应关系。
| DataX 内部类型 | Databend 数据类型 |
|------------|-----------------------------------------------------------|
| INT | TINYINT, INT8, SMALLINT, INT16, INT, INT32, BIGINT, INT64 |
| LONG | TINYINT, INT8, SMALLINT, INT16, INT, INT32, BIGINT, INT64 |
| STRING | STRING, VARCHAR |
| DOUBLE | FLOAT, DOUBLE |
| BOOL | BOOLEAN, BOOL |
| DATE | DATE, TIMESTAMP |
| BYTES | STRING, VARCHAR |
## 4 性能测试
## 5 约束限制
目前复杂数据类型支持不稳定如果您想使用复杂数据类型例如元组数组请检查databend和jdbc驱动程序的进一步版本。
## FAQ

View File

@ -0,0 +1,176 @@
# DataX DatabendWriter
[简体中文](./databendwriter-CN.md) | [English](./databendwriter.md)
## 1 Introduction
Databend Writer is a plugin for DataX to write data to Databend Table from dataX records.
The plugin is based on [databend JDBC driver](https://github.com/databendcloud/databend-jdbc) which use [RESTful http protocol](https://databend.rs/doc/integrations/api/rest)
to execute query on open source databend and [databend cloud](https://app.databend.com/).
During each write batch, databend writer will upload batch data into internal S3 stage and execute corresponding insert SQL to upload data into databend table.
For best user experience, if you are using databend community distribution, you should try to adopt [S3](https://aws.amazon.com/s3/)/[minio](https://min.io/)/[OSS](https://www.alibabacloud.com/product/object-storage-service) as its underlying storage layer since
they support presign upload operation otherwise you may expend unneeded cost on data transfer.
You could see more details on the [doc](https://databend.rs/doc/deploy/deploying-databend)
## 2 Detailed Implementation
Databend Writer would use DataX to fetch records generated by DataX Reader, and then batch insert records to the designated columns for your databend table.
## 3 Features
### 3.1 Example Configurations
* the following configuration would read some generated data in memory and upload data into databend table
#### Preparation
```sql
--- create table in databend
drop table if exists datax.sample1;
drop database if exists datax;
create database if not exists datax;
create table if not exsits datax.sample1(a string, b int64, c date, d timestamp, e bool, f string, g variant);
```
#### Configurations
```json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column" : [
{
"value": "DataX",
"type": "string"
},
{
"value": 19880808,
"type": "long"
},
{
"value": "1926-08-08 08:08:08",
"type": "date"
},
{
"value": "1988-08-08 08:08:08",
"type": "date"
},
{
"value": true,
"type": "bool"
},
{
"value": "test",
"type": "bytes"
},
{
"value": "{\"type\": \"variant\", \"value\": \"test\"}",
"type": "string"
}
],
"sliceRecordCount": 10000
}
},
"writer": {
"name": "databendwriter",
"parameter": {
"username": "databend",
"password": "databend",
"column": ["a", "b", "c", "d", "e", "f", "g"],
"batchSize": 1000,
"preSql": [
],
"postSql": [
],
"connection": [
{
"jdbcUrl": "jdbc:databend://localhost:8000/datax",
"table": [
"sample1"
]
}
]
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
```
### 3.2 Configuration Description
* jdbcUrl
* Description: JDBC Data source url in Databend. Please take a look at repository for detailed [doc](https://github.com/databendcloud/databend-jdbc)
* Required: yes
* Default: none
* Example: jdbc:databend://localhost:8000/datax
* username
* Description: Databend user name
* Required: yes
* Default: none
* Example: databend
* password
* Description: Databend user password
* Required: yes
* Default: none
* Example: databend
* table
* Description: A list of table names that should contain all of the columns in the column parameter.
* Required: yes
* Default: none
* Example: ["sample1"]
* column
* Description: A list of column field names that should be inserted into the table. if you want to insert all column fields use `["*"]` instead.
* Required: yes
* Default: none
* Example: ["a", "b", "c", "d", "e", "f", "g"]
* batchSize
* Description: The number of records to be inserted in each batch.
* Required: no
* Default: 1024
* preSql
* Description: A list of SQL statements that will be executed before the write operation.
* Required: no
* Default: none
* postSql
* Description: A list of SQL statements that will be executed after the write operation.
* Required: no
* Default: none
* writeMode
* DescriptionThe write mode, support `insert` and `replace` two mode.
* Requiredno
* Defaultinsert
* Example"replace"
* onConflictColumn
* DescriptionOn conflict fields list.
* Requiredno
* Defaultnone
* Example["id","user"]
### 3.3 Type Convert
Data types in datax can be converted to the corresponding data types in databend. The following table shows the correspondence between the two types.
| DataX Type | Databend Type |
|------------|-----------------------------------------------------------|
| INT | TINYINT, INT8, SMALLINT, INT16, INT, INT32, BIGINT, INT64 |
| LONG | TINYINT, INT8, SMALLINT, INT16, INT, INT32, BIGINT, INT64 |
| STRING | STRING, VARCHAR |
| DOUBLE | FLOAT, DOUBLE |
| BOOL | BOOLEAN, BOOL |
| DATE | DATE, TIMESTAMP |
| BYTES | STRING, VARCHAR |
## 4 Performance Test
## 5 Restrictions
Currently, complex data type support is not stable, if you want to use complex data type such as tuple, array, please check further release version of databend and jdbc driver.
## FAQ

101
databendwriter/pom.xml Normal file
View File

@ -0,0 +1,101 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>datax-all</artifactId>
<groupId>com.alibaba.datax</groupId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>databendwriter</artifactId>
<name>databendwriter</name>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>com.databend</groupId>
<artifactId>databend-jdbc</artifactId>
<version>0.1.0</version>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-core</artifactId>
<version>${datax-project-version}</version>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>plugin-rdbms-util</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.properties</include>
</includes>
</resource>
</resources>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${jdk-version}</source>
<target>${jdk-version}</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<!-- assembly plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

View File

@ -0,0 +1,34 @@
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
<include>plugin_job_template.json</include>
</includes>
<outputDirectory>plugin/writer/databendwriter</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>databendwriter-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/writer/databendwriter</outputDirectory>
</fileSet>
</fileSets>
<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/writer/databendwriter/libs</outputDirectory>
</dependencySet>
</dependencySets>
</assembly>

View File

@ -0,0 +1,241 @@
package com.alibaba.datax.plugin.writer.databendwriter;
import com.alibaba.datax.common.element.Column;
import com.alibaba.datax.common.element.StringColumn;
import com.alibaba.datax.common.exception.CommonErrorCode;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.plugin.RecordReceiver;
import com.alibaba.datax.common.spi.Writer;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.plugin.rdbms.util.DataBaseType;
import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter;
import com.alibaba.datax.plugin.writer.databendwriter.util.DatabendWriterUtil;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.*;
import java.util.List;
import java.util.regex.Pattern;
public class DatabendWriter extends Writer {
private static final DataBaseType DATABASE_TYPE = DataBaseType.Databend;
public static class Job
extends Writer.Job {
private static final Logger LOG = LoggerFactory.getLogger(Job.class);
private Configuration originalConfig;
private CommonRdbmsWriter.Job commonRdbmsWriterMaster;
@Override
public void init() throws DataXException {
this.originalConfig = super.getPluginJobConf();
this.commonRdbmsWriterMaster = new CommonRdbmsWriter.Job(DATABASE_TYPE);
this.commonRdbmsWriterMaster.init(this.originalConfig);
// placeholder currently not supported by databend driver, needs special treatment
DatabendWriterUtil.dealWriteMode(this.originalConfig);
}
@Override
public void preCheck() {
this.init();
this.commonRdbmsWriterMaster.writerPreCheck(this.originalConfig, DATABASE_TYPE);
}
@Override
public void prepare() {
this.commonRdbmsWriterMaster.prepare(this.originalConfig);
}
@Override
public List<Configuration> split(int mandatoryNumber) {
return this.commonRdbmsWriterMaster.split(this.originalConfig, mandatoryNumber);
}
@Override
public void post() {
this.commonRdbmsWriterMaster.post(this.originalConfig);
}
@Override
public void destroy() {
this.commonRdbmsWriterMaster.destroy(this.originalConfig);
}
}
public static class Task extends Writer.Task {
private static final Logger LOG = LoggerFactory.getLogger(Task.class);
private Configuration writerSliceConfig;
private CommonRdbmsWriter.Task commonRdbmsWriterSlave;
@Override
public void init() {
this.writerSliceConfig = super.getPluginJobConf();
this.commonRdbmsWriterSlave = new CommonRdbmsWriter.Task(DataBaseType.Databend) {
@Override
protected PreparedStatement fillPreparedStatementColumnType(PreparedStatement preparedStatement, int columnIndex, int columnSqltype, String typeName, Column column) throws SQLException {
try {
if (column.getRawData() == null) {
preparedStatement.setNull(columnIndex + 1, columnSqltype);
return preparedStatement;
}
java.util.Date utilDate;
switch (columnSqltype) {
case Types.TINYINT:
case Types.SMALLINT:
case Types.INTEGER:
preparedStatement.setInt(columnIndex + 1, column.asBigInteger().intValue());
break;
case Types.BIGINT:
preparedStatement.setLong(columnIndex + 1, column.asLong());
break;
case Types.DECIMAL:
preparedStatement.setBigDecimal(columnIndex + 1, column.asBigDecimal());
break;
case Types.FLOAT:
case Types.REAL:
preparedStatement.setFloat(columnIndex + 1, column.asDouble().floatValue());
break;
case Types.DOUBLE:
preparedStatement.setDouble(columnIndex + 1, column.asDouble());
break;
case Types.DATE:
java.sql.Date sqlDate = null;
try {
utilDate = column.asDate();
} catch (DataXException e) {
throw new SQLException(String.format(
"Date type conversion error: [%s]", column));
}
if (null != utilDate) {
sqlDate = new java.sql.Date(utilDate.getTime());
}
preparedStatement.setDate(columnIndex + 1, sqlDate);
break;
case Types.TIME:
java.sql.Time sqlTime = null;
try {
utilDate = column.asDate();
} catch (DataXException e) {
throw new SQLException(String.format(
"Date type conversion error: [%s]", column));
}
if (null != utilDate) {
sqlTime = new java.sql.Time(utilDate.getTime());
}
preparedStatement.setTime(columnIndex + 1, sqlTime);
break;
case Types.TIMESTAMP:
Timestamp sqlTimestamp = null;
if (column instanceof StringColumn && column.asString() != null) {
String timeStampStr = column.asString();
// JAVA TIMESTAMP 类型入参必须是 "2017-07-12 14:39:00.123566" 格式
String pattern = "^\\d+-\\d+-\\d+ \\d+:\\d+:\\d+.\\d+";
boolean isMatch = Pattern.matches(pattern, timeStampStr);
if (isMatch) {
sqlTimestamp = Timestamp.valueOf(timeStampStr);
preparedStatement.setTimestamp(columnIndex + 1, sqlTimestamp);
break;
}
}
try {
utilDate = column.asDate();
} catch (DataXException e) {
throw new SQLException(String.format(
"Date type conversion error: [%s]", column));
}
if (null != utilDate) {
sqlTimestamp = new Timestamp(
utilDate.getTime());
}
preparedStatement.setTimestamp(columnIndex + 1, sqlTimestamp);
break;
case Types.BINARY:
case Types.VARBINARY:
case Types.BLOB:
case Types.LONGVARBINARY:
preparedStatement.setBytes(columnIndex + 1, column
.asBytes());
break;
case Types.BOOLEAN:
// warn: bit(1) -> Types.BIT 可使用setBoolean
// warn: bit(>1) -> Types.VARBINARY 可使用setBytes
case Types.BIT:
if (this.dataBaseType == DataBaseType.MySql) {
Boolean asBoolean = column.asBoolean();
if (asBoolean != null) {
preparedStatement.setBoolean(columnIndex + 1, asBoolean);
} else {
preparedStatement.setNull(columnIndex + 1, Types.BIT);
}
} else {
preparedStatement.setString(columnIndex + 1, column.asString());
}
break;
default:
// cast variant / array into string is fine.
preparedStatement.setString(columnIndex + 1, column.asString());
break;
}
return preparedStatement;
} catch (DataXException e) {
// fix类型转换或者溢出失败时将具体哪一列打印出来
if (e.getErrorCode() == CommonErrorCode.CONVERT_NOT_SUPPORT ||
e.getErrorCode() == CommonErrorCode.CONVERT_OVER_FLOW) {
throw DataXException
.asDataXException(
e.getErrorCode(),
String.format(
"type conversion error. columnName: [%s], columnType:[%d], columnJavaType: [%s]. please change the data type in given column field or do not sync on the column.",
this.resultSetMetaData.getLeft()
.get(columnIndex),
this.resultSetMetaData.getMiddle()
.get(columnIndex),
this.resultSetMetaData.getRight()
.get(columnIndex)));
} else {
throw e;
}
}
}
};
this.commonRdbmsWriterSlave.init(this.writerSliceConfig);
}
@Override
public void destroy() {
this.commonRdbmsWriterSlave.destroy(this.writerSliceConfig);
}
@Override
public void prepare() {
this.commonRdbmsWriterSlave.prepare(this.writerSliceConfig);
}
@Override
public void post() {
this.commonRdbmsWriterSlave.post(this.writerSliceConfig);
}
@Override
public void startWrite(RecordReceiver lineReceiver) {
this.commonRdbmsWriterSlave.startWrite(lineReceiver, this.writerSliceConfig, this.getTaskPluginCollector());
}
}
}

View File

@ -0,0 +1,33 @@
package com.alibaba.datax.plugin.writer.databendwriter;
import com.alibaba.datax.common.spi.ErrorCode;
public enum DatabendWriterErrorCode implements ErrorCode {
CONF_ERROR("DatabendWriter-00", "配置错误."),
WRITE_DATA_ERROR("DatabendWriter-01", "写入数据时失败."),
;
private final String code;
private final String description;
private DatabendWriterErrorCode(String code, String description) {
this.code = code;
this.description = description;
}
@Override
public String getCode() {
return this.code;
}
@Override
public String getDescription() {
return this.description;
}
@Override
public String toString() {
return String.format("Code:[%s], Description:[%s].", this.code, this.description);
}
}

View File

@ -0,0 +1,72 @@
package com.alibaba.datax.plugin.writer.databendwriter.util;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.plugin.rdbms.writer.Constant;
import com.alibaba.datax.plugin.rdbms.writer.Key;
import com.alibaba.datax.plugin.writer.databendwriter.DatabendWriterErrorCode;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.xml.crypto.Data;
import java.util.List;
import java.util.StringJoiner;
public final class DatabendWriterUtil {
private static final Logger LOG = LoggerFactory.getLogger(DatabendWriterUtil.class);
private DatabendWriterUtil() {
}
public static void dealWriteMode(Configuration originalConfig) throws DataXException {
List<String> columns = originalConfig.getList(Key.COLUMN, String.class);
List<String> onConflictColumns = originalConfig.getList(Key.ONCONFLICT_COLUMN, String.class);
StringBuilder writeDataSqlTemplate = new StringBuilder();
String jdbcUrl = originalConfig.getString(String.format("%s[0].%s",
Constant.CONN_MARK, Key.JDBC_URL, String.class));
String writeMode = originalConfig.getString(Key.WRITE_MODE, "INSERT");
LOG.info("write mode is {}", writeMode);
if (writeMode.toLowerCase().contains("replace")) {
if (onConflictColumns == null || onConflictColumns.size() == 0) {
throw DataXException
.asDataXException(
DatabendWriterErrorCode.CONF_ERROR,
String.format(
"Replace mode must has onConflictColumn config."
));
}
// for databend if you want to use replace mode, the writeMode should be: "writeMode": "replace"
writeDataSqlTemplate.append("REPLACE INTO %s (")
.append(StringUtils.join(columns, ",")).append(") ").append(onConFlictDoString(onConflictColumns))
.append(" VALUES");
LOG.info("Replace data [\n{}\n], which jdbcUrl like:[{}]", writeDataSqlTemplate, jdbcUrl);
originalConfig.set(Constant.INSERT_OR_REPLACE_TEMPLATE_MARK, writeDataSqlTemplate);
} else {
writeDataSqlTemplate.append("INSERT INTO %s");
StringJoiner columnString = new StringJoiner(",");
for (String column : columns) {
columnString.add(column);
}
writeDataSqlTemplate.append(String.format("(%s)", columnString));
writeDataSqlTemplate.append(" VALUES");
LOG.info("Insert data [\n{}\n], which jdbcUrl like:[{}]", writeDataSqlTemplate, jdbcUrl);
originalConfig.set(Constant.INSERT_OR_REPLACE_TEMPLATE_MARK, writeDataSqlTemplate);
}
}
public static String onConFlictDoString(List<String> conflictColumns) {
return " ON " +
"(" +
StringUtils.join(conflictColumns, ",") + ") ";
}
}

View File

@ -0,0 +1,6 @@
{
"name": "databendwriter",
"class": "com.alibaba.datax.plugin.writer.databendwriter.DatabendWriter",
"description": "execute batch insert sql to write dataX data into databend",
"developer": "databend"
}

View File

@ -0,0 +1,19 @@
{
"name": "databendwriter",
"parameter": {
"username": "username",
"password": "password",
"column": ["col1", "col2", "col3"],
"connection": [
{
"jdbcUrl": "jdbc:databend://<host>:<port>[/<database>]",
"table": "table1"
}
],
"preSql": [],
"postSql": [],
"maxBatchRows": 65536,
"maxBatchSize": 134217728
}
}

79
datahubreader/pom.xml Normal file
View File

@ -0,0 +1,79 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>datax-all</artifactId>
<groupId>com.alibaba.datax</groupId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>datahubreader</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
</dependency>
<dependency>
<groupId>com.aliyun.datahub</groupId>
<artifactId>aliyun-sdk-datahub</artifactId>
<version>2.21.6-public</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${jdk-version}</source>
<target>${jdk-version}</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<!-- assembly plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

View File

@ -0,0 +1,34 @@
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
</includes>
<outputDirectory>plugin/reader/datahubreader</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>datahubreader-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/reader/datahubreader</outputDirectory>
</fileSet>
</fileSets>
<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/reader/datahubreader/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>

View File

@ -0,0 +1,8 @@
package com.alibaba.datax.plugin.reader.datahubreader;
public class Constant {
public static String DATETIME_FORMAT = "yyyyMMddHHmmss";
public static String DATE_FORMAT = "yyyyMMdd";
}

View File

@ -0,0 +1,42 @@
package com.alibaba.datax.plugin.reader.datahubreader;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.fastjson2.JSON;
import com.alibaba.fastjson2.TypeReference;
import com.aliyun.datahub.client.DatahubClient;
import com.aliyun.datahub.client.DatahubClientBuilder;
import com.aliyun.datahub.client.auth.Account;
import com.aliyun.datahub.client.auth.AliyunAccount;
import com.aliyun.datahub.client.common.DatahubConfig;
import com.aliyun.datahub.client.http.HttpConfig;
import org.apache.commons.lang3.StringUtils;
public class DatahubClientHelper {
public static DatahubClient getDatahubClient(Configuration jobConfig) {
String accessId = jobConfig.getNecessaryValue(Key.CONFIG_KEY_ACCESS_ID,
DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
String accessKey = jobConfig.getNecessaryValue(Key.CONFIG_KEY_ACCESS_KEY,
DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
String endpoint = jobConfig.getNecessaryValue(Key.CONFIG_KEY_ENDPOINT,
DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
Account account = new AliyunAccount(accessId, accessKey);
// 是否开启二进制传输服务端2.12版本开始支持
boolean enableBinary = jobConfig.getBool("enableBinary", false);
DatahubConfig datahubConfig = new DatahubConfig(endpoint, account, enableBinary);
// HttpConfig可不设置不设置时采用默认值
// 读写数据推荐打开网络传输 LZ4压缩
HttpConfig httpConfig = null;
String httpConfigStr = jobConfig.getString("httpConfig");
if (StringUtils.isNotBlank(httpConfigStr)) {
httpConfig = JSON.parseObject(httpConfigStr, new TypeReference<HttpConfig>() {
});
}
DatahubClientBuilder builder = DatahubClientBuilder.newBuilder().setDatahubConfig(datahubConfig);
if (null != httpConfig) {
builder.setHttpConfig(httpConfig);
}
DatahubClient datahubClient = builder.build();
return datahubClient;
}
}

View File

@ -0,0 +1,292 @@
package com.alibaba.datax.plugin.reader.datahubreader;
import java.text.ParseException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import com.aliyun.datahub.client.model.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.alibaba.datax.common.element.Column;
import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.element.StringColumn;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.plugin.RecordSender;
import com.alibaba.datax.common.spi.Reader;
import com.alibaba.datax.common.util.Configuration;
import com.aliyun.datahub.client.DatahubClient;
public class DatahubReader extends Reader {
public static class Job extends Reader.Job {
private static final Logger LOG = LoggerFactory.getLogger(Job.class);
private Configuration originalConfig;
private Long beginTimestampMillis;
private Long endTimestampMillis;
DatahubClient datahubClient;
@Override
public void init() {
LOG.info("datahub reader job init begin ...");
this.originalConfig = super.getPluginJobConf();
validateParameter(originalConfig);
this.datahubClient = DatahubClientHelper.getDatahubClient(this.originalConfig);
LOG.info("datahub reader job init end.");
}
private void validateParameter(Configuration conf){
conf.getNecessaryValue(Key.ENDPOINT,DatahubReaderErrorCode.REQUIRE_VALUE);
conf.getNecessaryValue(Key.ACCESSKEYID,DatahubReaderErrorCode.REQUIRE_VALUE);
conf.getNecessaryValue(Key.ACCESSKEYSECRET,DatahubReaderErrorCode.REQUIRE_VALUE);
conf.getNecessaryValue(Key.PROJECT,DatahubReaderErrorCode.REQUIRE_VALUE);
conf.getNecessaryValue(Key.TOPIC,DatahubReaderErrorCode.REQUIRE_VALUE);
conf.getNecessaryValue(Key.COLUMN,DatahubReaderErrorCode.REQUIRE_VALUE);
conf.getNecessaryValue(Key.BEGINDATETIME,DatahubReaderErrorCode.REQUIRE_VALUE);
conf.getNecessaryValue(Key.ENDDATETIME,DatahubReaderErrorCode.REQUIRE_VALUE);
int batchSize = this.originalConfig.getInt(Key.BATCHSIZE, 1024);
if (batchSize > 10000) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"Invalid batchSize[" + batchSize + "] value (0,10000]!");
}
String beginDateTime = this.originalConfig.getString(Key.BEGINDATETIME);
if (beginDateTime != null) {
try {
beginTimestampMillis = DatahubReaderUtils.getUnixTimeFromDateTime(beginDateTime);
} catch (ParseException e) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"Invalid beginDateTime[" + beginDateTime + "], format [yyyyMMddHHmmss]!");
}
}
if (beginTimestampMillis != null && beginTimestampMillis <= 0) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"Invalid beginTimestampMillis[" + beginTimestampMillis + "]!");
}
String endDateTime = this.originalConfig.getString(Key.ENDDATETIME);
if (endDateTime != null) {
try {
endTimestampMillis = DatahubReaderUtils.getUnixTimeFromDateTime(endDateTime);
} catch (ParseException e) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"Invalid beginDateTime[" + endDateTime + "], format [yyyyMMddHHmmss]!");
}
}
if (endTimestampMillis != null && endTimestampMillis <= 0) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"Invalid endTimestampMillis[" + endTimestampMillis + "]!");
}
if (beginTimestampMillis != null && endTimestampMillis != null
&& endTimestampMillis <= beginTimestampMillis) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"endTimestampMillis[" + endTimestampMillis + "] must bigger than beginTimestampMillis[" + beginTimestampMillis + "]!");
}
}
@Override
public void prepare() {
// create datahub client
String project = originalConfig.getNecessaryValue(Key.PROJECT, DatahubReaderErrorCode.REQUIRE_VALUE);
String topic = originalConfig.getNecessaryValue(Key.TOPIC, DatahubReaderErrorCode.REQUIRE_VALUE);
RecordType recordType = null;
try {
DatahubClient client = DatahubClientHelper.getDatahubClient(this.originalConfig);
GetTopicResult getTopicResult = client.getTopic(project, topic);
recordType = getTopicResult.getRecordType();
} catch (Exception e) {
LOG.warn("get topic type error: {}", e.getMessage());
}
if (null != recordType) {
if (recordType == RecordType.BLOB) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"DatahubReader only support 'Tuple' RecordType now, but your RecordType is 'BLOB'");
}
}
}
@Override
public void destroy() {
}
@Override
public List<Configuration> split(int adviceNumber) {
LOG.info("split() begin...");
List<Configuration> readerSplitConfigs = new ArrayList<Configuration>();
String project = this.originalConfig.getString(Key.PROJECT);
String topic = this.originalConfig.getString(Key.TOPIC);
List<ShardEntry> shardEntrys = DatahubReaderUtils.getShardsWithRetry(this.datahubClient, project, topic);
if (shardEntrys == null || shardEntrys.isEmpty() || shardEntrys.size() == 0) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"Project [" + project + "] Topic [" + topic + "] has no shards, please check !");
}
for (ShardEntry shardEntry : shardEntrys) {
Configuration splitedConfig = this.originalConfig.clone();
splitedConfig.set(Key.SHARDID, shardEntry.getShardId());
readerSplitConfigs.add(splitedConfig);
}
LOG.info("split() ok and end...");
return readerSplitConfigs;
}
}
public static class Task extends Reader.Task {
private static final Logger LOG = LoggerFactory.getLogger(Task.class);
private Configuration taskConfig;
private String accessId;
private String accessKey;
private String endpoint;
private String project;
private String topic;
private String shardId;
private Long beginTimestampMillis;
private Long endTimestampMillis;
private int batchSize;
private List<String> columns;
private RecordSchema schema;
private String timeStampUnit;
DatahubClient datahubClient;
@Override
public void init() {
this.taskConfig = super.getPluginJobConf();
this.accessId = this.taskConfig.getString(Key.ACCESSKEYID);
this.accessKey = this.taskConfig.getString(Key.ACCESSKEYSECRET);
this.endpoint = this.taskConfig.getString(Key.ENDPOINT);
this.project = this.taskConfig.getString(Key.PROJECT);
this.topic = this.taskConfig.getString(Key.TOPIC);
this.shardId = this.taskConfig.getString(Key.SHARDID);
this.batchSize = this.taskConfig.getInt(Key.BATCHSIZE, 1024);
this.timeStampUnit = this.taskConfig.getString(Key.TIMESTAMP_UNIT, "MICROSECOND");
try {
this.beginTimestampMillis = DatahubReaderUtils.getUnixTimeFromDateTime(this.taskConfig.getString(Key.BEGINDATETIME));
} catch (ParseException e) {
}
try {
this.endTimestampMillis = DatahubReaderUtils.getUnixTimeFromDateTime(this.taskConfig.getString(Key.ENDDATETIME));
} catch (ParseException e) {
}
this.columns = this.taskConfig.getList(Key.COLUMN, String.class);
this.datahubClient = DatahubClientHelper.getDatahubClient(this.taskConfig);
this.schema = DatahubReaderUtils.getDatahubSchemaWithRetry(this.datahubClient, this.project, topic);
LOG.info("init datahub reader task finished.project:{} topic:{} batchSize:{}", project, topic, batchSize);
}
@Override
public void destroy() {
}
@Override
public void startRead(RecordSender recordSender) {
LOG.info("read start");
String beginCursor = DatahubReaderUtils.getCursorWithRetry(this.datahubClient, this.project,
this.topic, this.shardId, this.beginTimestampMillis);
String endCursor = DatahubReaderUtils.getCursorWithRetry(this.datahubClient, this.project,
this.topic, this.shardId, this.endTimestampMillis);
if (beginCursor == null) {
LOG.info("Shard:{} has no data!", this.shardId);
return;
} else if (endCursor == null) {
endCursor = DatahubReaderUtils.getLatestCursorWithRetry(this.datahubClient, this.project,
this.topic, this.shardId);
}
String curCursor = beginCursor;
boolean exit = false;
while (true) {
GetRecordsResult result = DatahubReaderUtils.getRecordsResultWithRetry(this.datahubClient, this.project, this.topic,
this.shardId, this.batchSize, curCursor, this.schema);
List<RecordEntry> records = result.getRecords();
if (records.size() > 0) {
for (RecordEntry record : records) {
if (record.getSystemTime() >= this.endTimestampMillis) {
exit = true;
break;
}
HashMap<String, Column> dataMap = new HashMap<String, Column>();
List<Field> fields = ((TupleRecordData) record.getRecordData()).getRecordSchema().getFields();
for (int i = 0; i < fields.size(); i++) {
Field field = fields.get(i);
Column column = DatahubReaderUtils.getColumnFromField(record, field, this.timeStampUnit);
dataMap.put(field.getName(), column);
}
Record dataxRecord = recordSender.createRecord();
if (null != this.columns && 1 == this.columns.size()) {
String columnsInStr = columns.get(0).toString();
if ("\"*\"".equals(columnsInStr) || "*".equals(columnsInStr)) {
for (int i = 0; i < fields.size(); i++) {
dataxRecord.addColumn(dataMap.get(fields.get(i).getName()));
}
} else {
if (dataMap.containsKey(columnsInStr)) {
dataxRecord.addColumn(dataMap.get(columnsInStr));
} else {
dataxRecord.addColumn(new StringColumn(null));
}
}
} else {
for (String col : this.columns) {
if (dataMap.containsKey(col)) {
dataxRecord.addColumn(dataMap.get(col));
} else {
dataxRecord.addColumn(new StringColumn(null));
}
}
}
recordSender.sendToWriter(dataxRecord);
}
} else {
break;
}
if (exit) {
break;
}
curCursor = result.getNextCursor();
}
LOG.info("end read datahub shard...");
}
}
}

View File

@ -0,0 +1,35 @@
package com.alibaba.datax.plugin.reader.datahubreader;
import com.alibaba.datax.common.spi.ErrorCode;
public enum DatahubReaderErrorCode implements ErrorCode {
BAD_CONFIG_VALUE("DatahubReader-00", "The value you configured is invalid."),
LOG_HUB_ERROR("DatahubReader-01","Datahub exception"),
REQUIRE_VALUE("DatahubReader-02","Missing parameters"),
EMPTY_LOGSTORE_VALUE("DatahubReader-03","There is no shard under this LogStore");
private final String code;
private final String description;
private DatahubReaderErrorCode(String code, String description) {
this.code = code;
this.description = description;
}
@Override
public String getCode() {
return this.code;
}
@Override
public String getDescription() {
return this.description;
}
@Override
public String toString() {
return String.format("Code:[%s], Description:[%s]. ", this.code,
this.description);
}
}

View File

@ -0,0 +1,200 @@
package com.alibaba.datax.plugin.reader.datahubreader;
import java.math.BigDecimal;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.concurrent.Callable;
import com.alibaba.datax.common.element.*;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.util.DataXCaseEnvUtil;
import com.alibaba.datax.common.util.RetryUtil;
import com.aliyun.datahub.client.DatahubClient;
import com.aliyun.datahub.client.exception.InvalidParameterException;
import com.aliyun.datahub.client.model.*;
public class DatahubReaderUtils {
public static long getUnixTimeFromDateTime(String dateTime) throws ParseException {
try {
String format = Constant.DATETIME_FORMAT;
SimpleDateFormat simpleDateFormat = new SimpleDateFormat(format);
return simpleDateFormat.parse(dateTime).getTime();
} catch (ParseException ignored) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"Invalid DateTime[" + dateTime + "]!");
}
}
public static List<ShardEntry> getShardsWithRetry(final DatahubClient datahubClient, final String project, final String topic) {
List<ShardEntry> shards = null;
try {
shards = RetryUtil.executeWithRetry(new Callable<List<ShardEntry>>() {
@Override
public List<ShardEntry> call() throws Exception {
ListShardResult listShardResult = datahubClient.listShard(project, topic);
return listShardResult.getShards();
}
}, DataXCaseEnvUtil.getRetryTimes(7), DataXCaseEnvUtil.getRetryInterval(1000L), DataXCaseEnvUtil.getRetryExponential(true));
} catch (Exception e) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"get Shards error, please check ! detail error messsage: " + e.toString());
}
return shards;
}
public static String getCursorWithRetry(final DatahubClient datahubClient, final String project, final String topic,
final String shardId, final long timestamp) {
String cursor;
try {
cursor = RetryUtil.executeWithRetry(new Callable<String>() {
@Override
public String call() throws Exception {
try {
return datahubClient.getCursor(project, topic, shardId, CursorType.SYSTEM_TIME, timestamp).getCursor();
} catch (InvalidParameterException e) {
if (e.getErrorMessage().indexOf("Time in seek request is out of range") >= 0) {
return null;
} else {
throw e;
}
}
}
}, DataXCaseEnvUtil.getRetryTimes(7), DataXCaseEnvUtil.getRetryInterval(1000L), DataXCaseEnvUtil.getRetryExponential(true));
} catch (Exception e) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"get Cursor error, please check ! detail error messsage: " + e.toString());
}
return cursor;
}
public static String getLatestCursorWithRetry(final DatahubClient datahubClient, final String project, final String topic,
final String shardId) {
String cursor;
try {
cursor = RetryUtil.executeWithRetry(new Callable<String>() {
@Override
public String call() throws Exception {
return datahubClient.getCursor(project, topic, shardId, CursorType.LATEST).getCursor();
}
}, DataXCaseEnvUtil.getRetryTimes(7), DataXCaseEnvUtil.getRetryInterval(1000L), DataXCaseEnvUtil.getRetryExponential(true));
} catch (Exception e) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"get Cursor error, please check ! detail error messsage: " + e.toString());
}
return cursor;
}
public static RecordSchema getDatahubSchemaWithRetry(final DatahubClient datahubClient, final String project, final String topic) {
RecordSchema schema;
try {
schema = RetryUtil.executeWithRetry(new Callable<RecordSchema>() {
@Override
public RecordSchema call() throws Exception {
return datahubClient.getTopic(project, topic).getRecordSchema();
}
}, DataXCaseEnvUtil.getRetryTimes(7), DataXCaseEnvUtil.getRetryInterval(1000L), DataXCaseEnvUtil.getRetryExponential(true));
} catch (Exception e) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"get Topic Schema error, please check ! detail error messsage: " + e.toString());
}
return schema;
}
public static GetRecordsResult getRecordsResultWithRetry(final DatahubClient datahubClient, final String project,
final String topic, final String shardId, final int batchSize, final String cursor, final RecordSchema schema) {
GetRecordsResult result;
try {
result = RetryUtil.executeWithRetry(new Callable<GetRecordsResult>() {
@Override
public GetRecordsResult call() throws Exception {
return datahubClient.getRecords(project, topic, shardId, schema, cursor, batchSize);
}
}, DataXCaseEnvUtil.getRetryTimes(7), DataXCaseEnvUtil.getRetryInterval(1000L), DataXCaseEnvUtil.getRetryExponential(true));
} catch (Exception e) {
throw DataXException.asDataXException(DatahubReaderErrorCode.BAD_CONFIG_VALUE,
"get Record Result error, please check ! detail error messsage: " + e.toString());
}
return result;
}
public static Column getColumnFromField(RecordEntry record, Field field, String timeStampUnit) {
Column col = null;
TupleRecordData o = (TupleRecordData) record.getRecordData();
switch (field.getType()) {
case SMALLINT:
Short shortValue = ((Short) o.getField(field.getName()));
col = new LongColumn(shortValue == null ? null: shortValue.longValue());
break;
case INTEGER:
col = new LongColumn((Integer) o.getField(field.getName()));
break;
case BIGINT: {
col = new LongColumn((Long) o.getField(field.getName()));
break;
}
case TINYINT: {
Byte byteValue = ((Byte) o.getField(field.getName()));
col = new LongColumn(byteValue == null ? null : byteValue.longValue());
break;
}
case BOOLEAN: {
col = new BoolColumn((Boolean) o.getField(field.getName()));
break;
}
case FLOAT:
col = new DoubleColumn((Float) o.getField(field.getName()));
break;
case DOUBLE: {
col = new DoubleColumn((Double) o.getField(field.getName()));
break;
}
case STRING: {
col = new StringColumn((String) o.getField(field.getName()));
break;
}
case DECIMAL: {
BigDecimal value = (BigDecimal) o.getField(field.getName());
col = new DoubleColumn(value == null ? null : value.doubleValue());
break;
}
case TIMESTAMP: {
Long value = (Long) o.getField(field.getName());
if ("MILLISECOND".equals(timeStampUnit)) {
// MILLISECOND, 13位精度直接 new Date()
col = new DateColumn(value == null ? null : new Date(value));
}
else if ("SECOND".equals(timeStampUnit)){
col = new DateColumn(value == null ? null : new Date(value * 1000));
}
else {
// 默认都是 MICROSECOND, 16位精度 和之前的逻辑保持一致
col = new DateColumn(value == null ? null : new Date(value / 1000));
}
break;
}
default:
throw new RuntimeException("Unknown column type: " + field.getType());
}
return col;
}
}

View File

@ -0,0 +1,37 @@
package com.alibaba.datax.plugin.reader.datahubreader;
import com.alibaba.datax.common.spi.ErrorCode;
import com.alibaba.datax.common.util.MessageSource;
public enum DatahubWriterErrorCode implements ErrorCode {
MISSING_REQUIRED_VALUE("DatahubWriter-01", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.missing_required_value")),
INVALID_CONFIG_VALUE("DatahubWriter-02", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.invalid_config_value")),
GET_TOPOIC_INFO_FAIL("DatahubWriter-03", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.get_topic_info_fail")),
WRITE_DATAHUB_FAIL("DatahubWriter-04", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.write_datahub_fail")),
SCHEMA_NOT_MATCH("DatahubWriter-05", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.schema_not_match")),
;
private final String code;
private final String description;
private DatahubWriterErrorCode(String code, String description) {
this.code = code;
this.description = description;
}
@Override
public String getCode() {
return this.code;
}
@Override
public String getDescription() {
return this.description;
}
@Override
public String toString() {
return String.format("Code:[%s], Description:[%s]. ", this.code,
this.description);
}
}

View File

@ -0,0 +1,35 @@
package com.alibaba.datax.plugin.reader.datahubreader;
public final class Key {
/**
* 此处声明插件用到的需要插件使用者提供的配置项
*/
public static final String ENDPOINT = "endpoint";
public static final String ACCESSKEYID = "accessId";
public static final String ACCESSKEYSECRET = "accessKey";
public static final String PROJECT = "project";
public static final String TOPIC = "topic";
public static final String BEGINDATETIME = "beginDateTime";
public static final String ENDDATETIME = "endDateTime";
public static final String BATCHSIZE = "batchSize";
public static final String COLUMN = "column";
public static final String SHARDID = "shardId";
public static final String CONFIG_KEY_ENDPOINT = "endpoint";
public static final String CONFIG_KEY_ACCESS_ID = "accessId";
public static final String CONFIG_KEY_ACCESS_KEY = "accessKey";
public static final String TIMESTAMP_UNIT = "timeStampUnit";
}

View File

@ -0,0 +1,5 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.

View File

@ -0,0 +1,5 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.

View File

@ -0,0 +1,5 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.

View File

@ -0,0 +1,5 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.

View File

@ -0,0 +1,9 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.errorcode.missing_required_value=您缺失了必須填寫的參數值.
errorcode.invalid_config_value=您的參數配寘錯誤.
errorcode.get_topic_info_fail=獲取shard清單失敗.
errorcode.write_datahub_fail=寫數據失敗.
errorcode.schema_not_match=數據格式錯誤.

View File

@ -0,0 +1,9 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.errorcode.missing_required_value=您缺失了必須填寫的參數值.
errorcode.invalid_config_value=您的參數配寘錯誤.
errorcode.get_topic_info_fail=獲取shard清單失敗.
errorcode.write_datahub_fail=寫數據失敗.
errorcode.schema_not_match=數據格式錯誤.

View File

@ -0,0 +1,14 @@
{
"name": "datahubreader",
"parameter": {
"endpoint":"",
"accessId": "",
"accessKey": "",
"project": "",
"topic": "",
"beginDateTime": "20180913121019",
"endDateTime": "20180913121119",
"batchSize": 1024,
"column": []
}
}

View File

@ -0,0 +1,6 @@
{
"name": "datahubreader",
"class": "com.alibaba.datax.plugin.reader.datahubreader.DatahubReader",
"description": "datahub reader",
"developer": "alibaba"
}

79
datahubwriter/pom.xml Normal file
View File

@ -0,0 +1,79 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>datax-all</artifactId>
<groupId>com.alibaba.datax</groupId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>datahubwriter</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
</dependency>
<dependency>
<groupId>com.aliyun.datahub</groupId>
<artifactId>aliyun-sdk-datahub</artifactId>
<version>2.21.6-public</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${jdk-version}</source>
<target>${jdk-version}</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<!-- assembly plugin -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

View File

@ -0,0 +1,34 @@
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
</includes>
<outputDirectory>plugin/writer/datahubwriter</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>datahubwriter-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/writer/datahubwriter</outputDirectory>
</fileSet>
</fileSets>
<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/writer/datahubwriter/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>

View File

@ -0,0 +1,43 @@
package com.alibaba.datax.plugin.writer.datahubwriter;
import org.apache.commons.lang3.StringUtils;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.fastjson2.JSON;
import com.alibaba.fastjson2.TypeReference;
import com.aliyun.datahub.client.DatahubClient;
import com.aliyun.datahub.client.DatahubClientBuilder;
import com.aliyun.datahub.client.auth.Account;
import com.aliyun.datahub.client.auth.AliyunAccount;
import com.aliyun.datahub.client.common.DatahubConfig;
import com.aliyun.datahub.client.http.HttpConfig;
public class DatahubClientHelper {
public static DatahubClient getDatahubClient(Configuration jobConfig) {
String accessId = jobConfig.getNecessaryValue(Key.CONFIG_KEY_ACCESS_ID,
DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
String accessKey = jobConfig.getNecessaryValue(Key.CONFIG_KEY_ACCESS_KEY,
DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
String endpoint = jobConfig.getNecessaryValue(Key.CONFIG_KEY_ENDPOINT,
DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
Account account = new AliyunAccount(accessId, accessKey);
// 是否开启二进制传输服务端2.12版本开始支持
boolean enableBinary = jobConfig.getBool("enableBinary", false);
DatahubConfig datahubConfig = new DatahubConfig(endpoint, account, enableBinary);
// HttpConfig可不设置不设置时采用默认值
// 读写数据推荐打开网络传输 LZ4压缩
HttpConfig httpConfig = null;
String httpConfigStr = jobConfig.getString("httpConfig");
if (StringUtils.isNotBlank(httpConfigStr)) {
httpConfig = JSON.parseObject(httpConfigStr, new TypeReference<HttpConfig>() {
});
}
DatahubClientBuilder builder = DatahubClientBuilder.newBuilder().setDatahubConfig(datahubConfig);
if (null != httpConfig) {
builder.setHttpConfig(httpConfig);
}
DatahubClient datahubClient = builder.build();
return datahubClient;
}
}

View File

@ -0,0 +1,355 @@
package com.alibaba.datax.plugin.writer.datahubwriter;
import com.alibaba.datax.common.element.Column;
import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.plugin.RecordReceiver;
import com.alibaba.datax.common.spi.Writer;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.common.util.DataXCaseEnvUtil;
import com.alibaba.datax.common.util.RetryUtil;
import com.alibaba.fastjson2.JSON;
import com.aliyun.datahub.client.DatahubClient;
import com.aliyun.datahub.client.model.FieldType;
import com.aliyun.datahub.client.model.GetTopicResult;
import com.aliyun.datahub.client.model.ListShardResult;
import com.aliyun.datahub.client.model.PutErrorEntry;
import com.aliyun.datahub.client.model.PutRecordsResult;
import com.aliyun.datahub.client.model.RecordEntry;
import com.aliyun.datahub.client.model.RecordSchema;
import com.aliyun.datahub.client.model.RecordType;
import com.aliyun.datahub.client.model.ShardEntry;
import com.aliyun.datahub.client.model.ShardState;
import com.aliyun.datahub.client.model.TupleRecordData;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.concurrent.Callable;
public class DatahubWriter extends Writer {
/**
* Job 中的方法仅执行一次Task 中方法会由框架启动多个 Task 线程并行执行
* <p/>
* 整个 Writer 执行流程是
* <pre>
* Job类init-->prepare-->split
*
* Task类init-->prepare-->startWrite-->post-->destroy
* Task类init-->prepare-->startWrite-->post-->destroy
*
* Job类post-->destroy
* </pre>
*/
public static class Job extends Writer.Job {
private static final Logger LOG = LoggerFactory
.getLogger(Job.class);
private Configuration jobConfig = null;
@Override
public void init() {
this.jobConfig = super.getPluginJobConf();
jobConfig.getNecessaryValue(Key.CONFIG_KEY_ENDPOINT, DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
jobConfig.getNecessaryValue(Key.CONFIG_KEY_ACCESS_ID, DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
jobConfig.getNecessaryValue(Key.CONFIG_KEY_ACCESS_KEY, DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
jobConfig.getNecessaryValue(Key.CONFIG_KEY_PROJECT, DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
jobConfig.getNecessaryValue(Key.CONFIG_KEY_TOPIC, DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
}
@Override
public void prepare() {
String project = jobConfig.getNecessaryValue(Key.CONFIG_KEY_PROJECT,
DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
String topic = jobConfig.getNecessaryValue(Key.CONFIG_KEY_TOPIC,
DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
RecordType recordType = null;
DatahubClient client = DatahubClientHelper.getDatahubClient(this.jobConfig);
try {
GetTopicResult getTopicResult = client.getTopic(project, topic);
recordType = getTopicResult.getRecordType();
} catch (Exception e) {
LOG.warn("get topic type error: {}", e.getMessage());
}
if (null != recordType) {
if (recordType == RecordType.BLOB) {
throw DataXException.asDataXException(DatahubWriterErrorCode.WRITE_DATAHUB_FAIL,
"DatahubWriter only support 'Tuple' RecordType now, but your RecordType is 'BLOB'");
}
}
}
@Override
public List<Configuration> split(int mandatoryNumber) {
List<Configuration> configs = new ArrayList<Configuration>();
for (int i = 0; i < mandatoryNumber; ++i) {
configs.add(jobConfig.clone());
}
return configs;
}
@Override
public void post() {}
@Override
public void destroy() {}
}
public static class Task extends Writer.Task {
private static final Logger LOG = LoggerFactory
.getLogger(Task.class);
private static final List<String> FATAL_ERRORS_DEFAULT = Arrays.asList(
"InvalidParameterM",
"MalformedRecord",
"INVALID_SHARDID",
"NoSuchTopic",
"NoSuchShard"
);
private Configuration taskConfig;
private DatahubClient client;
private String project;
private String topic;
private List<String> shards;
private int maxCommitSize;
private int maxRetryCount;
private RecordSchema schema;
private long retryInterval;
private Random random;
private List<String> column;
private List<Integer> columnIndex;
private boolean enableColumnConfig;
private List<String> fatalErrors;
@Override
public void init() {
this.taskConfig = super.getPluginJobConf();
project = taskConfig.getNecessaryValue(Key.CONFIG_KEY_PROJECT, DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
topic = taskConfig.getNecessaryValue(Key.CONFIG_KEY_TOPIC, DatahubWriterErrorCode.MISSING_REQUIRED_VALUE);
maxCommitSize = taskConfig.getInt(Key.CONFIG_KEY_MAX_COMMIT_SIZE, 1024*1024);
maxRetryCount = taskConfig.getInt(Key.CONFIG_KEY_MAX_RETRY_COUNT, 500);
this.retryInterval = taskConfig.getInt(Key.RETRY_INTERVAL, 650);
this.random = new Random();
this.column = this.taskConfig.getList(Key.CONFIG_KEY_COLUMN, String.class);
// ["*"]
if (null != this.column && 1 == this.column.size()) {
if (StringUtils.equals("*", this.column.get(0))) {
this.column = null;
}
}
this.columnIndex = new ArrayList<Integer>();
// 留个开关保平安
this.enableColumnConfig = this.taskConfig.getBool("enableColumnConfig", true);
this.fatalErrors = this.taskConfig.getList("fatalErrors", Task.FATAL_ERRORS_DEFAULT, String.class);
this.client = DatahubClientHelper.getDatahubClient(this.taskConfig);
}
@Override
public void prepare() {
final String shardIdConfig = this.taskConfig.getString(Key.CONFIG_KEY_SHARD_ID);
this.shards = new ArrayList<String>();
try {
RetryUtil.executeWithRetry(new Callable<Void>() {
@Override
public Void call() throws Exception {
ListShardResult result = client.listShard(project, topic);
if (StringUtils.isNotBlank(shardIdConfig)) {
shards.add(shardIdConfig);
} else {
for (ShardEntry shard : result.getShards()) {
if (shard.getState() == ShardState.ACTIVE || shard.getState() == ShardState.OPENING) {
shards.add(shard.getShardId());
}
}
}
schema = client.getTopic(project, topic).getRecordSchema();
return null;
}
}, DataXCaseEnvUtil.getRetryTimes(5), DataXCaseEnvUtil.getRetryInterval(10000L), DataXCaseEnvUtil.getRetryExponential(false));
} catch (Exception e) {
throw DataXException.asDataXException(DatahubWriterErrorCode.GET_TOPOIC_INFO_FAIL,
"get topic info failed", e);
}
LOG.info("datahub topic {} shard to write: {}", this.topic, JSON.toJSONString(this.shards));
LOG.info("datahub topic {} has schema: {}", this.topic, JSON.toJSONString(this.schema));
// 根据 schmea 顺序 和用户配置的 column计算写datahub的顺序关系以支持列换序
// 后续统一使用 columnIndex 的顺位关系写 datahub
int totalSize = this.schema.getFields().size();
if (null != this.column && !this.column.isEmpty() && this.enableColumnConfig) {
for (String eachCol : this.column) {
int indexFound = -1;
for (int i = 0; i < totalSize; i++) {
// warn: 大小写ignore
if (StringUtils.equalsIgnoreCase(eachCol, this.schema.getField(i).getName())) {
indexFound = i;
break;
}
}
if (indexFound >= 0) {
this.columnIndex.add(indexFound);
} else {
throw DataXException.asDataXException(DatahubWriterErrorCode.SCHEMA_NOT_MATCH,
String.format("can not find column %s in datahub topic %s", eachCol, this.topic));
}
}
} else {
for (int i = 0; i < totalSize; i++) {
this.columnIndex.add(i);
}
}
}
@Override
public void startWrite(RecordReceiver recordReceiver) {
Record record;
List<RecordEntry> records = new ArrayList<RecordEntry>();
String shardId = null;
if (1 == this.shards.size()) {
shardId = shards.get(0);
} else {
shardId = shards.get(this.random.nextInt(shards.size()));
}
int commitSize = 0;
try {
while ((record = recordReceiver.getFromReader()) != null) {
RecordEntry dhRecord = convertRecord(record, shardId);
if (dhRecord != null) {
records.add(dhRecord);
}
commitSize += record.getByteSize();
if (commitSize >= maxCommitSize) {
commit(records);
records.clear();
commitSize = 0;
if (1 == this.shards.size()) {
shardId = shards.get(0);
} else {
shardId = shards.get(this.random.nextInt(shards.size()));
}
}
}
if (commitSize > 0) {
commit(records);
}
} catch (Exception e) {
throw DataXException.asDataXException(
DatahubWriterErrorCode.WRITE_DATAHUB_FAIL, e);
}
}
@Override
public void post() {}
@Override
public void destroy() {}
private void commit(List<RecordEntry> records) throws InterruptedException {
PutRecordsResult result = client.putRecords(project, topic, records);
if (result.getFailedRecordCount() > 0) {
for (int i = 0; i < maxRetryCount; ++i) {
boolean limitExceededMessagePrinted = false;
for (PutErrorEntry error : result.getPutErrorEntries()) {
// 如果是 LimitExceeded 这样打印日志不能每行记录打印一次了
if (StringUtils.equalsIgnoreCase("LimitExceeded", error.getErrorcode())) {
if (!limitExceededMessagePrinted) {
LOG.warn("write record error, request id: {}, error code: {}, error message: {}",
result.getRequestId(), error.getErrorcode(), error.getMessage());
limitExceededMessagePrinted = true;
}
} else {
LOG.error("write record error, request id: {}, error code: {}, error message: {}",
result.getRequestId(), error.getErrorcode(), error.getMessage());
}
if (this.fatalErrors.contains(error.getErrorcode())) {
throw DataXException.asDataXException(
DatahubWriterErrorCode.WRITE_DATAHUB_FAIL,
error.getMessage());
}
}
if (this.retryInterval >= 0) {
Thread.sleep(this.retryInterval);
} else {
Thread.sleep(new Random().nextInt(700) + 300);
}
result = client.putRecords(project, topic, result.getFailedRecords());
if (result.getFailedRecordCount() == 0) {
return;
}
}
throw DataXException.asDataXException(
DatahubWriterErrorCode.WRITE_DATAHUB_FAIL,
"write datahub failed");
}
}
private RecordEntry convertRecord(Record dxRecord, String shardId) {
try {
RecordEntry dhRecord = new RecordEntry();
dhRecord.setShardId(shardId);
TupleRecordData data = new TupleRecordData(this.schema);
for (int i = 0; i < this.columnIndex.size(); ++i) {
int orderInSchema = this.columnIndex.get(i);
FieldType type = this.schema.getField(orderInSchema).getType();
Column column = dxRecord.getColumn(i);
switch (type) {
case BIGINT:
data.setField(orderInSchema, column.asLong());
break;
case DOUBLE:
data.setField(orderInSchema, column.asDouble());
break;
case STRING:
data.setField(orderInSchema, column.asString());
break;
case BOOLEAN:
data.setField(orderInSchema, column.asBoolean());
break;
case TIMESTAMP:
if (null == column.asDate()) {
data.setField(orderInSchema, null);
} else {
data.setField(orderInSchema, column.asDate().getTime() * 1000);
}
break;
case DECIMAL:
// warn
data.setField(orderInSchema, column.asBigDecimal());
break;
case INTEGER:
data.setField(orderInSchema, column.asLong());
break;
case FLOAT:
data.setField(orderInSchema, column.asDouble());
break;
case TINYINT:
data.setField(orderInSchema, column.asLong());
break;
case SMALLINT:
data.setField(orderInSchema, column.asLong());
break;
default:
throw DataXException.asDataXException(
DatahubWriterErrorCode.SCHEMA_NOT_MATCH,
String.format("does not support type: %s", type));
}
}
dhRecord.setRecordData(data);
return dhRecord;
} catch (Exception e) {
super.getTaskPluginCollector().collectDirtyRecord(dxRecord, e, "convert recor failed");
}
return null;
}
}
}

View File

@ -0,0 +1,37 @@
package com.alibaba.datax.plugin.writer.datahubwriter;
import com.alibaba.datax.common.spi.ErrorCode;
import com.alibaba.datax.common.util.MessageSource;
public enum DatahubWriterErrorCode implements ErrorCode {
MISSING_REQUIRED_VALUE("DatahubWriter-01", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.missing_required_value")),
INVALID_CONFIG_VALUE("DatahubWriter-02", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.invalid_config_value")),
GET_TOPOIC_INFO_FAIL("DatahubWriter-03", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.get_topic_info_fail")),
WRITE_DATAHUB_FAIL("DatahubWriter-04", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.write_datahub_fail")),
SCHEMA_NOT_MATCH("DatahubWriter-05", MessageSource.loadResourceBundle(DatahubWriterErrorCode.class).message("errorcode.schema_not_match")),
;
private final String code;
private final String description;
private DatahubWriterErrorCode(String code, String description) {
this.code = code;
this.description = description;
}
@Override
public String getCode() {
return this.code;
}
@Override
public String getDescription() {
return this.description;
}
@Override
public String toString() {
return String.format("Code:[%s], Description:[%s]. ", this.code,
this.description);
}
}

View File

@ -0,0 +1,26 @@
package com.alibaba.datax.plugin.writer.datahubwriter;
public final class Key {
/**
* 此处声明插件用到的需要插件使用者提供的配置项
*/
public static final String CONFIG_KEY_ENDPOINT = "endpoint";
public static final String CONFIG_KEY_ACCESS_ID = "accessId";
public static final String CONFIG_KEY_ACCESS_KEY = "accessKey";
public static final String CONFIG_KEY_PROJECT = "project";
public static final String CONFIG_KEY_TOPIC = "topic";
public static final String CONFIG_KEY_WRITE_MODE = "mode";
public static final String CONFIG_KEY_SHARD_ID = "shardId";
public static final String CONFIG_KEY_MAX_COMMIT_SIZE = "maxCommitSize";
public static final String CONFIG_KEY_MAX_RETRY_COUNT = "maxRetryCount";
public static final String CONFIG_VALUE_SEQUENCE_MODE = "sequence";
public static final String CONFIG_VALUE_RANDOM_MODE = "random";
public final static String MAX_RETRY_TIME = "maxRetryTime";
public final static String RETRY_INTERVAL = "retryInterval";
public final static String CONFIG_KEY_COLUMN = "column";
}

View File

@ -0,0 +1,5 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.

View File

@ -0,0 +1,5 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.

View File

@ -0,0 +1,5 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.

View File

@ -0,0 +1,5 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.

View File

@ -0,0 +1,9 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.errorcode.missing_required_value=您缺失了必須填寫的參數值.
errorcode.invalid_config_value=您的參數配寘錯誤.
errorcode.get_topic_info_fail=獲取shard清單失敗.
errorcode.write_datahub_fail=寫數據失敗.
errorcode.schema_not_match=數據格式錯誤.

View File

@ -0,0 +1,9 @@
errorcode.missing_required_value=\u60A8\u7F3A\u5931\u4E86\u5FC5\u987B\u586B\u5199\u7684\u53C2\u6570\u503C.
errorcode.invalid_config_value=\u60A8\u7684\u53C2\u6570\u914D\u7F6E\u9519\u8BEF.
errorcode.get_topic_info_fail=\u83B7\u53D6shard\u5217\u8868\u5931\u8D25.
errorcode.write_datahub_fail=\u5199\u6570\u636E\u5931\u8D25.
errorcode.schema_not_match=\u6570\u636E\u683C\u5F0F\u9519\u8BEF.errorcode.missing_required_value=您缺失了必須填寫的參數值.
errorcode.invalid_config_value=您的參數配寘錯誤.
errorcode.get_topic_info_fail=獲取shard清單失敗.
errorcode.write_datahub_fail=寫數據失敗.
errorcode.schema_not_match=數據格式錯誤.

View File

@ -0,0 +1,14 @@
{
"name": "datahubwriter",
"parameter": {
"endpoint":"",
"accessId": "",
"accessKey": "",
"project": "",
"topic": "",
"mode": "random",
"shardId": "",
"maxCommitSize": 524288,
"maxRetryCount": 500
}
}

View File

@ -0,0 +1,6 @@
{
"name": "datahubwriter",
"class": "com.alibaba.datax.plugin.writer.datahubwriter.DatahubWriter",
"description": "datahub writer",
"developer": "alibaba"
}

View File

@ -0,0 +1,20 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-example</artifactId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<artifactId>datax-example-core</artifactId>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
</project>

View File

@ -0,0 +1,26 @@
package com.alibaba.datax.example;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.core.Engine;
import com.alibaba.datax.example.util.ExampleConfigParser;
/**
* {@code Date} 2023/8/6 11:22
*
* @author fuyouj
*/
public class ExampleContainer {
/**
* example对外暴露的启动入口
* 使用前最好看下 datax-example/doc/README.MD
* @param jobPath 任务json绝对路径
*/
public static void start(String jobPath) {
Configuration configuration = ExampleConfigParser.parse(jobPath);
Engine engine = new Engine();
engine.start(configuration);
}
}

View File

@ -0,0 +1,23 @@
package com.alibaba.datax.example;
import com.alibaba.datax.example.util.PathUtil;
/**
* @author fuyouj
*/
public class Main {
/**
* 1.在example模块pom文件添加你依赖的的调试插件
* 你可以直接打开本模块的pom文件,参考是如何引入streamreaderstreamwriter
* 2. 在此处指定你的job文件
*/
public static void main(String[] args) {
String classPathJobPath = "/job/stream2stream.json";
String absJobPath = PathUtil.getAbsolutePathFromClassPath(classPathJobPath);
ExampleContainer.start(absJobPath);
}
}

View File

@ -0,0 +1,154 @@
package com.alibaba.datax.example.util;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.core.util.ConfigParser;
import com.alibaba.datax.core.util.FrameworkErrorCode;
import com.alibaba.datax.core.util.container.CoreConstant;
import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.nio.file.Paths;
import java.util.*;
/**
* @author fuyouj
*/
public class ExampleConfigParser {
private static final String CORE_CONF = "/example/conf/core.json";
private static final String PLUGIN_DESC_FILE = "plugin.json";
/**
* 指定Job配置路径ConfigParser会解析JobPluginCore全部信息并以Configuration返回
* 不同于Core的ConfigParser,这里的core,plugin 不依赖于编译后的datax.home,而是扫描程序编译后的target目录
*/
public static Configuration parse(final String jobPath) {
Configuration configuration = ConfigParser.parseJobConfig(jobPath);
configuration.merge(coreConfig(),
false);
Map<String, String> pluginTypeMap = new HashMap<>();
String readerName = configuration.getString(CoreConstant.DATAX_JOB_CONTENT_READER_NAME);
String writerName = configuration.getString(CoreConstant.DATAX_JOB_CONTENT_WRITER_NAME);
pluginTypeMap.put(readerName, "reader");
pluginTypeMap.put(writerName, "writer");
Configuration pluginsDescConfig = parsePluginsConfig(pluginTypeMap);
configuration.merge(pluginsDescConfig, false);
return configuration;
}
private static Configuration parsePluginsConfig(Map<String, String> pluginTypeMap) {
Configuration configuration = Configuration.newDefault();
//最初打算通过user.dir获取工作目录来扫描插件
//但是user.dir在不同有一些不确定性所以废弃了这个选择
for (File basePackage : runtimeBasePackages()) {
if (pluginTypeMap.isEmpty()) {
break;
}
scanPluginByPackage(basePackage, configuration, basePackage.listFiles(), pluginTypeMap);
}
if (!pluginTypeMap.isEmpty()) {
String failedPlugin = pluginTypeMap.keySet().toString();
String message = "\nplugin %s load failed ry to analyze the reasons from the following aspects.。\n" +
"1: Check if the name of the plugin is spelled correctly, and verify whether DataX supports this plugin\n" +
"2Verify if the <resource></resource> tag has been added under <build></build> section in the pom file of the relevant plugin.\n<resource>" +
" <directory>src/main/resources</directory>\n" +
" <includes>\n" +
" <include>**/*.*</include>\n" +
" </includes>\n" +
" <filtering>true</filtering>\n" +
" </resource>\n [Refer to the streamreader pom file] \n" +
"3: Check that the datax-yourPlugin-example module imported your test plugin";
message = String.format(message, failedPlugin);
throw DataXException.asDataXException(FrameworkErrorCode.PLUGIN_INIT_ERROR, message);
}
return configuration;
}
/**
* 通过classLoader获取程序编译的输出目录
*
* @return File[/datax-example/target/classes,xxReader/target/classes,xxWriter/target/classes]
*/
private static File[] runtimeBasePackages() {
List<File> basePackages = new ArrayList<>();
ClassLoader classLoader = Thread.currentThread().getContextClassLoader();
Enumeration<URL> resources = null;
try {
resources = classLoader.getResources("");
} catch (IOException e) {
throw DataXException.asDataXException(e.getMessage());
}
while (resources.hasMoreElements()) {
URL resource = resources.nextElement();
File file = new File(resource.getFile());
if (file.isDirectory()) {
basePackages.add(file);
}
}
return basePackages.toArray(new File[0]);
}
/**
* @param packageFile 编译出来的target/classes根目录 便于找到插件时设置插件的URL目录设置根目录是最保险的方式
* @param configuration pluginConfig
* @param files 待扫描文件
* @param needPluginTypeMap 需要的插件
*/
private static void scanPluginByPackage(File packageFile,
Configuration configuration,
File[] files,
Map<String, String> needPluginTypeMap) {
if (files == null) {
return;
}
for (File file : files) {
if (file.isFile() && PLUGIN_DESC_FILE.equals(file.getName())) {
Configuration pluginDesc = Configuration.from(file);
String descPluginName = pluginDesc.getString("name", "");
if (needPluginTypeMap.containsKey(descPluginName)) {
String type = needPluginTypeMap.get(descPluginName);
configuration.merge(parseOnePlugin(packageFile.getAbsolutePath(), type, descPluginName, pluginDesc), false);
needPluginTypeMap.remove(descPluginName);
}
} else {
scanPluginByPackage(packageFile, configuration, file.listFiles(), needPluginTypeMap);
}
}
}
private static Configuration parseOnePlugin(String packagePath,
String pluginType,
String pluginName,
Configuration pluginDesc) {
//设置path 兼容jarLoader的加载方式URLClassLoader
pluginDesc.set("path", packagePath);
Configuration pluginConfInJob = Configuration.newDefault();
pluginConfInJob.set(
String.format("plugin.%s.%s", pluginType, pluginName),
pluginDesc.getInternal());
return pluginConfInJob;
}
private static Configuration coreConfig() {
try {
URL resource = ExampleConfigParser.class.getResource(CORE_CONF);
return Configuration.from(Paths.get(resource.toURI()).toFile());
} catch (Exception ignore) {
throw DataXException.asDataXException("Failed to load the configuration file core.json. " +
"Please check whether /example/conf/core.json exists!");
}
}
}

View File

@ -0,0 +1,26 @@
package com.alibaba.datax.example.util;
import com.alibaba.datax.common.exception.DataXException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.nio.file.Paths;
/**
* @author fuyouj
*/
public class PathUtil {
public static String getAbsolutePathFromClassPath(String path) {
URL resource = PathUtil.class.getResource(path);
try {
assert resource != null;
URI uri = resource.toURI();
return Paths.get(uri).toString();
} catch (NullPointerException | URISyntaxException e) {
throw DataXException.asDataXException("path error,please check whether the path is correct");
}
}
}

View File

@ -0,0 +1,60 @@
{
"entry": {
"jvm": "-Xms1G -Xmx1G",
"environment": {}
},
"common": {
"column": {
"datetimeFormat": "yyyy-MM-dd HH:mm:ss",
"timeFormat": "HH:mm:ss",
"dateFormat": "yyyy-MM-dd",
"extraFormats":["yyyyMMdd"],
"timeZone": "GMT+8",
"encoding": "utf-8"
}
},
"core": {
"dataXServer": {
"address": "http://localhost:7001/api",
"timeout": 10000,
"reportDataxLog": false,
"reportPerfLog": false
},
"transport": {
"channel": {
"class": "com.alibaba.datax.core.transport.channel.memory.MemoryChannel",
"speed": {
"byte": -1,
"record": -1
},
"flowControlInterval": 20,
"capacity": 512,
"byteCapacity": 67108864
},
"exchanger": {
"class": "com.alibaba.datax.core.plugin.BufferedRecordExchanger",
"bufferSize": 32
}
},
"container": {
"job": {
"reportInterval": 10000
},
"taskGroup": {
"channel": 5
},
"trace": {
"enable": "false"
}
},
"statistics": {
"collector": {
"plugin": {
"taskClass": "com.alibaba.datax.core.statistics.plugin.task.StdoutPluginCollector",
"maxDirtyNumber": 10
}
}
}
}
}

View File

@ -0,0 +1,19 @@
package com.alibaba.datax.example.util;
import org.junit.Assert;
import org.junit.Test;
/**
* {@code Author} FuYouJ
* {@code Date} 2023/8/19 21:38
*/
public class PathUtilTest {
@Test
public void testParseClassPathFile() {
String path = "/pathTest.json";
String absolutePathFromClassPath = PathUtil.getAbsolutePathFromClassPath(path);
Assert.assertNotNull(absolutePathFromClassPath);
}
}

Some files were not shown because too many files have changed in this diff Show More