源码如下:
/**
* The deserialization schema describes how to turn the byte messages delivered by certain
* data sources (for example Apache Kafka) into data types (Java/Scala objects) that are
* processed by Flink.
* 翻译:
* 反序列化模式描述了如何将某些数据源(例如Apache Kafka)传递的字节消息转换为Flink处理的数据类型(Java / Scala对象)
*
* In addition, the DeserializationSchema describes the produced type ({@link #getProducedType()}),
* which lets Flink create internal serializers and structures to handle the type.
* 翻译:
* 此外,DeserializationSchema描述了产生的类型({@link #getProducedType()}),它使Flink可以创建内部序列化器和结构来处理该类型
*
Note: In most cases, one should start from {@link AbstractDeserializationSchema}, which
* takes care of producing the return type information automatically.
*
* 在大多数情况下,应该从{@link AbstractDeserializationSchema}开始,它负责自动生成返回类型信息
*
A DeserializationSchema must be {@link Serializable} because its instances are often part of
* an operator or transformation function.
*
DeserializationSchema必须为{@link Serializable},因为其实例通常是运算符或转换函数的一部分
* @param The type created by the deserialization schema.
*/
@Public
public interface DeserializationSchema<T> extends Serializable, ResultTypeQueryable<T> {
/**
* Deserializes the byte message.
*
* @param message The message, as a byte array.
*
* @return The deserialized message as an object (null if the message cannot be deserialized).
*/
T deserialize(byte[] message) throws IOException;
/**
* Method to decide whether the element signals the end of the stream. If
* true is returned the element won't be emitted.
*
* @param nextElement The element to test for the end-of-stream signal.
* @return True, if the element signals end of stream, false otherwise.
*/
boolean isEndOfStream(T nextElement);
}
DeserializationSchema是顶层接口,定义了如何将某些数据源转化成flink能够处理的数据类型。
/**
* The deserialization schema describes how to turn the Kafka ConsumerRecords
* into data types (Java/Scala objects) that are processed by Flink.
*
*翻译:反序列化模式描述了如何将Kafka ConsumerRecords转换为Flink处理的数据类型(Java / Scala对象)
* @param The type created by the keyed deserialization schema.
*/
@PublicEvolving
public interface KafkaDeserializationSchema<T> extends Serializable, ResultTypeQueryable<T> {
/**
* Method to decide whether the element signals the end of the stream. If
* true is returned the element won't be emitted.
*
* @param nextElement The element to test for the end-of-stream signal.
*
* @return True, if the element signals end of stream, false otherwise.
*/
boolean isEndOfStream(T nextElement);
/**
* Deserializes the Kafka record.
*
* @param record Kafka record to be deserialized.
*
* @return The deserialized message as an object (null if the message cannot be deserialized).
*/
T deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception;
}
可以发现,DeserializationSchema和KafkaDeserializationSchema是同级的接口,都继承了Serializable, ResultTypeQueryable这两个接口。不同点是,deserialize方法接口的参数不一样,KafkaDeserializationSchema接口为反序列化kafka数据而生。DeserializationSchema接口可以反序列化任意二进制数据,更加具有通用性。
/**
* Very simple serialization schema for strings.
*
* By default, the serializer uses "UTF-8" for string/byte conversion.
*/
public class SimpleStringSchema implements DeserializationSchema<String>, SerializationSchema<String>
此类实现了DeserializationSchema和SerializationSchema两个接口,可以同时用于序列化和反序列化,在接收和发送kafka数据的时候都可以使用。
官方注释说 Very simple serialization schema for strings :非常简单的字符串序列化架构。将数据序列化和反序列化成String类型,默认编码格式为UTF-8。
/**
* DeserializationSchema that deserializes a JSON String into an ObjectNode.
* 将JSON字符串反序列化为ObjectNode的DeserializationSchema
*
* Key fields can be accessed by calling objectNode.get("key").get(<name>).as(<type>)
*
*
Value fields can be accessed by calling objectNode.get("value").get(<name>).as(<type>)
*
*
Metadata fields can be accessed by calling objectNode.get("metadata").get(<name>).as(<type>) and include
* the "offset" (long), "topic" (String) and "partition" (int).
*/
@PublicEvolving
public class JSONKeyValueDeserializationSchema implements KafkaDeserializationSchema<ObjectNode> {
private static final long serialVersionUID = 1509391548173891955L;
private final boolean includeMetadata;
private ObjectMapper mapper;
public JSONKeyValueDeserializationSchema(boolean includeMetadata) {
this.includeMetadata = includeMetadata;
}
@Override
public ObjectNode deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
if (mapper == null) {
mapper = new ObjectMapper();
}
ObjectNode node = mapper.createObjectNode();
if (record.key() != null) {
node.set("key", mapper.readValue(record.key(), JsonNode.class));
}
if (record.value() != null) {
node.set("value", mapper.readValue(record.value(), JsonNode.class));
}
if (includeMetadata) {
node.putObject("metadata")
.put("offset", record.offset())
.put("topic", record.topic())
.put("partition", record.partition());
}
return node;
}
@Override
public boolean isEndOfStream(ObjectNode nextElement) {
return false;
}
@Override
public TypeInformation<ObjectNode> getProducedType() {
return getForClass(ObjectNode.class);
}
}
JSONKeyValueDeserializationSchema的构造方法需要传入一个布尔值,参数为true时,反序列化的数据中带有kafka metadata元数据信息,包括offset,topic,partition信息。参数为false时,反序列化的数据中只有kafka中key和value的信息。
deserialize方法的返回值为ObjectNode,由以下源码可知,ObjectNode由Map
public class ObjectNode extends ContainerNode<ObjectNode> {
protected final Map<String, JsonNode> _children;
注意:在使用此类反序列化时,要求kafka中传输的数据为JSON字符串,负责无法序列化。
由于作者项目中需要获取kafka中key和topic的值,同时kafka中的数据又包含非JSON类型的数据,所以需要自定义实现KafkaDeserializationSchema以获取key和topic信息。
/**
* @program:
* @description: 自定义反序列化类获取key,value,topic的值
* @author:
* @create:
**/
public class CustomKeyValueDeserializationSchema implements KeyedDeserializationSchema<String> {
@Override
public String deserialize(byte[] messageKey, byte[] message, String topic, int partition, long offset) throws IOException {
StringBuffer stringBuffer = new StringBuffer();
String mskey = new String(messageKey, StandardCharsets.UTF_8);
String ms = new String(message, StandardCharsets.UTF_8);
//使用"\t"进行分割,便于后续逻辑处理时做数据切分。返回值为String类型
return stringBuffer.append(ms).append("\t").append(mskey).append("\t").append(topic).toString();
}
@Override
public boolean isEndOfStream(String nextElement) {
return false;
}
//定义返回值类型。
@Override
public TypeInformation<String> getProducedType() {
return TypeInformation.of(String.class);
}
}
自定义类,可以灵活的处理反序列化的数据,获取自己想要的信息。