DataX--TxtFileWriter不写数据问题

写在前面

我在用datax开发同步工具插件,需要从kafka消费数据,写入HIVE中。测试工具的时候先使用TxtFileWriter作为writer,观察中间结果。

 

遇到问题

由于我在reader里面使用while(true)来消费数据。如下图,打日志发现数据读到了,也sendToWriter了,但是生产文件大小为0

  public void startRead(RecordSender recordSender) {
        LOG.info("[RowInKafkaTask] start to read here.");
        Record record = recordSender.createRecord();
        while (true) {
            ConsumerRecords messages = consumer.poll(Constant.TIME_OUT);
            for (ConsumerRecord message : messages) {
                byte[] row = parseRowKeyFromKafkaMsg(message.value(), this.kafkaColumn);
                try {
                    boolean result = putDataToRecord(record, row);
                    if (result) {
                        Log.info("[RowInKafkaTask] result is {}", result);
                        recordSender.sendToWriter(record);
                        recordSender.flush();
                    } else
                        LOG.error("[RowInKafkaTask] putDataToRecord false");
                } catch (Exception e) {
                    LOG.error("[RowInKafkaTask] exception found.", e);
                }
                record = recordSender.createRecord();
            }
            recordSender.flush();
        }

    }

定位一下:

1、怀疑Writer没有拿到数据,于是走读了下TxtFileWriter的流程。发现写入文件是通过FileOutputStream来写入文件的。

   @Override
        public void startWrite(RecordReceiver lineReceiver) {
            LOG.info("begin do write...");
            String fileFullPath = this.buildFilePath();
            LOG.info(String.format("write to file : [%s]", fileFullPath));

            OutputStream outputStream = null;
            try {
//此处用FileOutPutStream来写
                File newFile = new File(fileFullPath);
                newFile.createNewFile();
                outputStream = new FileOutputStream(newFile);
                UnstructuredStorageWriterUtil.writeToStream(lineReceiver,
                        outputStream, this.writerSliceConfig, this.fileName,
                        this.getTaskPluginCollector());
            } catch (SecurityException se) {
...

怀疑Record没拿到。于是加了些日志如下UnstructuredStorageWriterUtil.java中doWriteToStream方法;

private static void doWriteToStream(RecordReceiver lineReceiver,
            BufferedWriter writer, String contex, Configuration config,
            TaskPluginCollector taskPluginCollector) throws IOException {

        ...省略

        Record record = null;
        while ((record = lineReceiver.getFromReader()) != null) {
            LOG.info("[Unstrctured..Util] write one record.");
            UnstructuredStorageWriterUtil.transportOneRecord(record,
                    nullFormat, dateParse, taskPluginCollector,
                    unstructuredWriter);
        }

        // warn:由调用方控制流的关闭
        // IOUtils.closeQuietly(unstructuredWriter);
    }

2、重新编译插件,/opt/datax/plugin/writer/txtfilewriter/libs路径下替换原来的plugin-unstructured-storage-util-0.0.1-SNAPSHOT.jar 发现日志可以打印出来,说明数据已经收到。

3、重复往kafka里面多放一些数据,发现文件终于写入,但是总是4K大小往上增加。和小伙伴讨论,得到疑点“缓存区是否没清空”

4、果断验证一下,在transportOneRecord里面增加flush()方法,如下

 public static void transportOneRecord(Record record, String nullFormat,
            DateFormat dateParse, TaskPluginCollector taskPluginCollector,
            UnstructuredWriter unstructuredWriter) {
        // warn: default is null
        if (null == nullFormat) {
            nullFormat = "null";
        }
        try {
           //省略
            }
            unstructuredWriter.writeOneRecord(splitedRows);
//新增一行
            unstructuredWriter.flush();
        } catch (Exception e) {
            // warn: dirty data
            taskPluginCollector.collectDirtyRecord(record, e);
        }
    }

重新编译,替换,运行。

写在最后

显然最后生效了。

这是datax的一个bug,只有在reader线程常驻的时候存在。

正好帮我水一贴。

你可能感兴趣的:(大数据,JAVA)