DataX/neo4jwriter/doc/neo4jwriter.md
2023-07-07 21:01:38 +08:00

194 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DataX neo4jWriter 插件文档
## 功能简介
本目前市面上的neo4j 批量导入主要有Cypher Create,Load CSV,第三方或者官方提供的Batch Import。Load CSV支持节点10W级别一下Batch Import 需要对数据库进行停机。要想实现不停机的数据写入Cypher是最好的方式。
## 支持版本
支持Neo4j 4 和Neo4j 5,如果是Neo4j 3,需要自行将驱动降低至相对应的版本进行编译。
## 实现原理
将datax的数据转换成了neo4j驱动能识别的对象利用 unwind 语法进行批量插入。
## 如何配置
### 配置项介绍
| 配置 | 说明 | 是否必须 | 默认值 | 示例 |
|:-------------------------------|--------------------| -------- | ------ | ---------------------------------------------------- |
| database | 数据库名字 | 是 | - | neo4j |
| uri | 数据库访问链接 | 是 | - | bolt://localhost:7687 |
| username | 访问用户名 | 是 | - | neo4j |
| password | 访问密码 | 是 | - | neo4j |
| bearerToken | 权限相关 | 否 | - | - |
| kerberosTicket | 权限相关 | 否 | - | - |
| cypher | 同步语句 | 是 | - | unwind $batch as row create(p) set p.name = row.name |
| batchDataVariableName | unwind 携带的数据变量名 | | | batch |
| properties | 定义neo4j中数据的属性名字和类型 | 是 | - | 见后续案例 |
| batchSize | 一批写入数据量 | 否 | 1000 | |
| maxTransactionRetryTimeSeconds | 事务运行最长时间 | 否 | 30秒 | 30 |
| maxConnectionTimeoutSeconds | 驱动最长链接时间 | 否 | 30秒 | 30 |
| retryTimes | 发生错误的重试次数 | 否 | 3次 | 3 |
| retrySleepMills | 重试失败后的等待时间 | 否 | 3秒 | 3 |
### 支持的数据类型
> 配置时均忽略大小写
```
BOOLEAN,
STRING,
LONG,
SHORT,
INTEGER,
DOUBLE,
FLOAT,
LOCAL_DATE,
LOCAL_TIME,
LOCAL_DATE_TIME,
LIST,
//map类型支持 . 属性表达式取值
MAP,
CHAR_ARRAY,
BYTE_ARRAY,
BOOLEAN_ARRAY,
STRING_ARRAY,
LONG_ARRAY,
INT_ARRAY,
SHORT_ARRAY,
DOUBLE_ARRAY,
FLOAT_ARRAY,
Object_ARRAY
```
### 写节点
这里提供了一个写节点包含很多类型属性的例子。你可以在我的测试方法中运行。
```json
"writer": {
"name": "neo4jWriter",
"parameter": {
"uri": "neo4j://localhost:7687",
"username": "neo4j",
"password": "Test@12343",
"database": "neo4j",
"cypher": "unwind $batch as row create(p:Person) set p.pbool = row.pbool,p.pstring = row.pstring,p.plong = row.plong,p.pshort = row.pshort,p.pdouble=row.pdouble,p.pstringarr=row.pstringarr,p.plocaldate=row.plocaldate",
"batchDataVariableName": "batch",
"batchSize": "33",
"properties": [
{
"name": "pbool",
"type": "BOOLEAN"
},
{
"name": "pstring",
"type": "STRING"
},
{
"name": "plong",
"type": "LONG"
},
{
"name": "pshort",
"type": "SHORT"
},
{
"name": "pdouble",
"type": "DOUBLE"
},
{
"name": "pstringarr",
"type": "STRING_ARRAY",
"split": ","
},
{
"name": "plocaldate",
"type": "LOCAL_DATE",
"dateFormat": "yyyy-MM-dd"
}
]
}
}
```
### 写关系
```json
"writer": {
"name": "neo4jWriter",
"parameter": {
"uri": "neo4j://localhost:7687",
"username": "neo4j",
"password": "Test@12343",
"database": "neo4j",
"cypher": "unwind $batch as row match(p1:Person) where p1.id = row.startNodeId match(p2:Person) where p2.id = row.endNodeId create (p1)-[:LINK]->(p2)",
"batchDataVariableName": "batch",
"batch_size": "33",
"properties": [
{
"name": "startNodeId",
"type": "STRING"
},
{
"name": "endNodeId",
"type": "STRING"
}
]
}
}
```
### 节点/关系类型动态写
> 需要使用AOPC函数拓展如果你的数据库没有请安装APOC函数拓展
```json
"writer": {
"name": "neo4jWriter",
"parameter": {
"uri": "bolt://localhost:7687",
"username": "yourUserName",
"password": "yourPassword",
"database": "yourDataBase",
"cypher": "unwind $batch as row CALL apoc.cypher.doIt( 'create (n:`' + row.Label + '`{id:$id})' ,{id: row.id} ) YIELD value RETURN 1 ",
"batchDataVariableName": "batch",
"batch_size": "1",
"properties": [
{
"name": "Label",
"type": "STRING"
},
{
"name": "id",
"type": "STRING"
}
]
}
}
```
## 注意事项
* properties定义的顺序需要与reader端顺序一一对应。
* 灵活使用map类型可以免去很多数据加工的烦恼。在cypher中可以根据 . 属性访问符号一直取值。比如 unwind $batch as row create (p) set p.name = row.prop.name,set p.age = row.prop.age在这个例子中prop是map类型包含name和age两个属性。
* 如果提示事务超时建议调大事务运行时间或者调小batchSize
* 如果用于更新场景,遇到死锁问题影响写入,建议二开源码加入死锁异常检测,并进行重试。
## 性能报告
**JVM参数**
16G G1垃圾收集器 8核心
**Neo4j数据库配置**
32核心256G
**datax 配置**
* Channel 20 batchsize = 1000
* 任务平均流量15.23MB/s
* 记录写入速度44440 rec/s
* 读出记录总数2222013