我在用datax开发同步工具插件,需要从kafka消费数据,写入HIVE中。测试工具的时候先使用TxtFileWriter作为writer,观察中间结果。
由于我在reader里面使用while(true)来消费数据。如下图,打日志发现数据读到了,也sendToWriter了,但是生产文件大小为0
public void startRead(RecordSender recordSender) {
LOG.info("[RowInKafkaTask] start to read here.");
Record record = recordSender.createRecord();
while (true) {
ConsumerRecords messages = consumer.poll(Constant.TIME_OUT);
for (ConsumerRecord message : messages) {
byte[] row = parseRowKeyFromKafkaMsg(message.value(), this.kafkaColumn);
try {
boolean result = putDataToRecord(record, row);
if (result) {
Log.info("[RowInKafkaTask] result is {}", result);
recordSender.sendToWriter(record);
recordSender.flush();
} else
LOG.error("[RowInKafkaTask] putDataToRecord false");
} catch (Exception e) {
LOG.error("[RowInKafkaTask] exception found.", e);
}
record = recordSender.createRecord();
}
recordSender.flush();
}
}
1、怀疑Writer没有拿到数据,于是走读了下TxtFileWriter的流程。发现写入文件是通过FileOutputStream来写入文件的。
@Override
public void startWrite(RecordReceiver lineReceiver) {
LOG.info("begin do write...");
String fileFullPath = this.buildFilePath();
LOG.info(String.format("write to file : [%s]", fileFullPath));
OutputStream outputStream = null;
try {
//此处用FileOutPutStream来写
File newFile = new File(fileFullPath);
newFile.createNewFile();
outputStream = new FileOutputStream(newFile);
UnstructuredStorageWriterUtil.writeToStream(lineReceiver,
outputStream, this.writerSliceConfig, this.fileName,
this.getTaskPluginCollector());
} catch (SecurityException se) {
...
怀疑Record没拿到。于是加了些日志如下UnstructuredStorageWriterUtil.java中doWriteToStream方法;
private static void doWriteToStream(RecordReceiver lineReceiver,
BufferedWriter writer, String contex, Configuration config,
TaskPluginCollector taskPluginCollector) throws IOException {
...省略
Record record = null;
while ((record = lineReceiver.getFromReader()) != null) {
LOG.info("[Unstrctured..Util] write one record.");
UnstructuredStorageWriterUtil.transportOneRecord(record,
nullFormat, dateParse, taskPluginCollector,
unstructuredWriter);
}
// warn:由调用方控制流的关闭
// IOUtils.closeQuietly(unstructuredWriter);
}
2、重新编译插件,/opt/datax/plugin/writer/txtfilewriter/libs路径下替换原来的plugin-unstructured-storage-util-0.0.1-SNAPSHOT.jar 发现日志可以打印出来,说明数据已经收到。
3、重复往kafka里面多放一些数据,发现文件终于写入,但是总是4K大小往上增加。和小伙伴讨论,得到疑点“缓存区是否没清空”
4、果断验证一下,在transportOneRecord里面增加flush()方法,如下
public static void transportOneRecord(Record record, String nullFormat,
DateFormat dateParse, TaskPluginCollector taskPluginCollector,
UnstructuredWriter unstructuredWriter) {
// warn: default is null
if (null == nullFormat) {
nullFormat = "null";
}
try {
//省略
}
unstructuredWriter.writeOneRecord(splitedRows);
//新增一行
unstructuredWriter.flush();
} catch (Exception e) {
// warn: dirty data
taskPluginCollector.collectDirtyRecord(record, e);
}
}
重新编译,替换,运行。
显然最后生效了。
这是datax的一个bug,只有在reader线程常驻的时候存在。
正好帮我水一贴。