大数据批处理比较 spring batch vs flink vs stream Parallel
摘要:本文主要通过实际案例的对比分析,选择适合自己大数据批处理的应用技术方案
场景
针对以上场景,我们采用传统的处理方法无法满足响应时间的要求,我们首先想到的是多线程编程处理,多线程解决方案是没有问题的,问题点是多线程并行编程要求高,难度大,那么是否有开源界是否有响应的解决方案呢?答案是:yes.
本文对批处理的方案进行了对比,总结、归纳:
场景:一万条数据保存到student表,MySQL数据库
表结构
REATE TABLE `student` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`name` varchar(20) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=220000 DEFAULT CHARSET=utf8;
spring batch vs flink vs stream Parallel 性能对比
方案 | 执行时间 | 备注 |
---|---|---|
循环方案 | 34391s | |
stream Parallel | 8384s | |
springboot batch | 1035s | |
flink | 4310s |
Springboot 工程 pom.xml 配置
依赖
<parent>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-parentartifactId>
<version>2.0.1.RELEASEversion>
<relativePath />
parent>
<dependency>
<groupId>mysqlgroupId>
<artifactId>mysql-connector-javaartifactId>
<version>5.1.40version>
<scope>runtimescope>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-testartifactId>
<scope>testscope>
dependency>
<dependency>
<groupId>com.alibabagroupId>
<artifactId>druidartifactId>
<version>1.1.0version>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starterartifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-loggingartifactId>
exclusion>
exclusions>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-log4j2artifactId>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-actuatorartifactId>
<scope>compilescope>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-webartifactId>
<scope>compilescope>
dependency>
<dependency>
<groupId>org.projectlombokgroupId>
<artifactId>lombokartifactId>
<optional>trueoptional>
<scope>compilescope>
dependency>
<dependency>
<groupId>junitgroupId>
<artifactId>junitartifactId>
<scope>testscope>
dependency>
<dependency>
<groupId>org.mybatis.spring.bootgroupId>
<artifactId>mybatis-spring-boot-starterartifactId>
<version>2.0.1version>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-batchartifactId>
dependency>
<dependency>
<groupId>org.springframework.bootgroupId>
<artifactId>spring-boot-starter-data-jpaartifactId>
dependency>
<dependency>
<groupId>javax.xml.bindgroupId>
<artifactId>jaxb-apiartifactId>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-javaartifactId>
<version>${flink.version}version>
<scope>providedscope>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-java_2.11artifactId>
<version>${flink.version}version>
<scope>providedscope>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-jdbc_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-table-api-java-bridge_2.11artifactId>
<version>1.9.0version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-table-planner-blink_2.11artifactId>
<version>1.9.0version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-scala_2.11artifactId>
<version>1.9.0version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-table-commonartifactId>
<version>1.9.0version>
dependency>
温馨提示 Springboot 版本选用 2.0.1.RELEASE ,否则有兼容问题
application.properties
server.port=8070
spring.application.name=sea-spring-boot-batch
spring.batch.initialize-schema=always
spring.jpa.generate-ddl=true
mybatis.config-location=classpath:mybatis-config.xml
mybatis.mapper-locations=classpath:mybatis/*.xml
mybatis.type-aliases-package=org.sea.spring.cloud.nacos.model
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=utf-8&useSSL=false
spring.datasource.username=root
spring.datasource.password=mysql
spring.datasource.type=com.alibaba.druid.pool.DruidDataSource
mybatis-config.xml
<configuration>
<settings>
<setting name="cacheEnabled" value="true"/>
<setting name="lazyLoadingEnabled" value="true"/>
<setting name="aggressiveLazyLoading" value="true"/>
<setting name="multipleResultSetsEnabled" value="true"/>
<setting name="useColumnLabel" value="true"/>
<setting name="useGeneratedKeys" value="true"/>
<setting name="autoMappingBehavior" value="PARTIAL"/>
<setting name="defaultExecutorType" value="SIMPLE"/>
<setting name="mapUnderscoreToCamelCase" value="true"/>
<setting name="localCacheScope" value="SESSION"/>
<setting name="jdbcTypeForNull" value="NULL"/>
settings>
configuration>
for循环测试用例和 parallelStream测试用例自己编写,不重点讲了,重点是springbootbatch 、 flink.
package org.sea.spring.boot.batch.job;
import java.util.Iterator;
import java.util.List;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.item.ItemReader;
import org.springframework.batch.item.NonTransientResourceException;
import org.springframework.batch.item.ParseException;
import org.springframework.batch.item.UnexpectedInputException;
public class InputStudentItemReader implements ItemReader<StudentEntity>{
private final Iterator<StudentEntity> iterator;
public InputStudentItemReader(List<StudentEntity> data) {
this.iterator = data.iterator();
}
@Override
public StudentEntity read()
throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
if (iterator.hasNext()) {
return this.iterator.next();
} else {
return null;
}
}
}
package org.sea.spring.boot.batch.job;
import java.util.Set;
import javax.validation.ConstraintViolation;
import javax.validation.Validation;
import javax.validation.ValidatorFactory;
import org.springframework.batch.item.validator.ValidationException;
import org.springframework.batch.item.validator.Validator;
import org.springframework.beans.factory.InitializingBean;
public class StudentBeanValidator <T> implements Validator<T>, InitializingBean{
private javax.validation.Validator validator;
@Override
public void afterPropertiesSet() throws Exception {
ValidatorFactory validatorFactory = Validation.buildDefaultValidatorFactory();
validator = validatorFactory.usingContext().getValidator();
}
@Override
public void validate(T value) throws ValidationException {
Set<ConstraintViolation<T>> constraintViolations = validator.validate(value);
if (constraintViolations.size() > 0) {
StringBuilder message = new StringBuilder();
for (ConstraintViolation<T> constraintViolation : constraintViolations) {
message.append(constraintViolation.getMessage()).append("\n");
}
throw new ValidationException(message.toString());
}
}
}
package org.sea.spring.boot.batch.job;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.item.validator.ValidatingItemProcessor;
import org.springframework.batch.item.validator.ValidationException;
public class StudentItemProcessor extends ValidatingItemProcessor<StudentEntity> {
@Override
public StudentEntity process(StudentEntity item) throws ValidationException {
super.process(item);
return item;
}
}
package org.sea.spring.boot.batch.job;
import java.util.ArrayList;
import java.util.List;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.core.Job;
import org.springframework.batch.core.Step;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepScope;
import org.springframework.batch.core.launch.support.RunIdIncrementer;
import org.springframework.batch.core.launch.support.SimpleJobLauncher;
import org.springframework.batch.core.repository.JobRepository;
import org.springframework.batch.core.repository.support.JobRepositoryFactoryBean;
import org.springframework.batch.item.ItemProcessor;
import org.springframework.batch.item.ItemReader;
import org.springframework.batch.item.ItemWriter;
import org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider;
import org.springframework.batch.item.database.JdbcBatchItemWriter;
import org.springframework.batch.item.validator.Validator;
import org.springframework.batch.support.DatabaseType;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.transaction.PlatformTransactionManager;
import com.alibaba.druid.pool.DruidDataSource;
@Configuration
@EnableBatchProcessing
public class StudentBatchConfig {
/**
* ItemReader定义,用来读取数据
* @return FlatFileItemReader
*/
@Bean
@StepScope
public InputStudentItemReader reader() {
List<StudentEntity> list =new ArrayList<StudentEntity>(10000);
for(int i=0;i<10000;i++) {
list.add(init(i));
}
return new InputStudentItemReader(list);
}
private StudentEntity init(int i) {
StudentEntity student=new StudentEntity();
student.setName("name"+i);
student.setAge(i);
return student;
}
/**
* ItemProcessor定义,用来处理数据
*
* @return
*/
@Bean
public ItemProcessor<StudentEntity, StudentEntity> processor() {
StudentItemProcessor processor = new StudentItemProcessor();
processor.setValidator(studentBeanValidator());
return processor;
}
@Bean
public Validator<StudentEntity> studentBeanValidator() {
return new StudentBeanValidator<>();
}
/**
* ItemWriter定义,用来输出数据
* spring能让容器中已有的Bean以参数的形式注入,Spring Boot已经为我们定义了dataSource
*
* @param dataSource
* @return
*/
@Bean
public ItemWriter<StudentEntity> writer(DruidDataSource dataSource) {
JdbcBatchItemWriter<StudentEntity> writer = new JdbcBatchItemWriter<>();
//我们使用JDBC批处理的JdbcBatchItemWriter来写数据到数据库
writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>());
String sql="INSERT INTO student (name,age) values(:name,:age)" ;
//在此设置要执行批处理的SQL语句
writer.setSql(sql);
writer.setDataSource(dataSource);
return writer;
}
/**
*
* @param dataSource
* @param transactionManager
* @return
* @throws Exception
*/
@Bean
public JobRepository jobRepository(DruidDataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {
JobRepositoryFactoryBean jobRepositoryFactoryBean = new JobRepositoryFactoryBean();
jobRepositoryFactoryBean.setDataSource(dataSource);
jobRepositoryFactoryBean.setTransactionManager(transactionManager);
jobRepositoryFactoryBean.setDatabaseType(String.valueOf(DatabaseType.MYSQL));
// 下面事务隔离级别的配置是针对Oracle的
// jobRepositoryFactoryBean.setIsolationLevelForCreate(isolationLevelForCreate);
jobRepositoryFactoryBean.afterPropertiesSet();
return jobRepositoryFactoryBean.getObject();
}
/**
* JobLauncher定义,用来启动Job的接口
*
* @param dataSource
* @param transactionManager
* @return
* @throws Exception
*/
@Bean
public SimpleJobLauncher jobLauncher(DruidDataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {
SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
jobLauncher.setJobRepository(jobRepository(dataSource, transactionManager));
return jobLauncher;
}
/**
* Job定义,我们要实际执行的任务,包含一个或多个Step
*
* @param jobBuilderFactory
* @param s1
* @return
*/
@Bean
public Job importJob(JobBuilderFactory jobBuilderFactory, Step s1) {
return jobBuilderFactory.get("importJob")
.incrementer(new RunIdIncrementer())
.flow(s1)//为Job指定Step
.end()
.build();
}
/**
* step步骤,包含ItemReader,ItemProcessor和ItemWriter
*
* @param stepBuilderFactory
* @param reader
* @param writer
* @param processor
* @return
*/
@Bean
public Step step1(StepBuilderFactory stepBuilderFactory, ItemReader<StudentEntity> reader, ItemWriter<StudentEntity> writer,
ItemProcessor<StudentEntity, StudentEntity> processor) {
return stepBuilderFactory
.get("step1")
.<StudentEntity, StudentEntity>chunk(1000)//批处理每次提交65000条数据
.reader(reader)//给step绑定reader
.processor(processor)//给step绑定processor
.writer(writer)//给step绑定writer
.build();
}
}
package org.sea.spring.boot.batch.test;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.sea.spring.boot.batch.SpringBootBathApplication;
import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.JobParametersBuilder;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringRunner;
import org.springframework.util.StopWatch;
import lombok.extern.slf4j.Slf4j;
@RunWith(SpringRunner.class)
@SpringBootTest(classes=SpringBootBathApplication.class)
@Slf4j
public class TestBatchService {
@Autowired
private JobLauncher jobLauncher;
@Autowired
private Job importJob;
@Test
public void testBatch1() throws Exception {
StopWatch watch = new StopWatch("testAdd1");
watch.start("保存");
JobParameters jobParameters = new JobParametersBuilder()
.addLong("time", System.currentTimeMillis())
.toJobParameters();
jobLauncher.run(importJob, jobParameters);
watch.stop();
log.info(watch.prettyPrint());
}
}
package org.sea.spring.boot.batch;
import java.util.List;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.io.jdbc.JDBCAppendTableSink;
import org.apache.flink.api.java.io.jdbc.JDBCInputFormat;
import org.apache.flink.api.java.typeutils.RowTypeInfo;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.types.Row;
import org.sea.spring.boot.batch.model.StudentEntity;
public class FLink2Mysql {
private static String driverClass = "com.mysql.jdbc.Driver";
private static String dbUrl = "jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=utf-8&useSSL=false";
private static String userName = "root";
private static String passWord = "mysql";
public static void add(List<StudentEntity> students) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<StudentEntity> input = env.fromCollection(students);
DataStream<Row> ds = input.map(new RichMapFunction<StudentEntity, Row>() {
private static final long serialVersionUID = 1L;
@Override
public Row map(StudentEntity student) throws Exception {
return Row.of(student.getId(), student.getName(), student.getAge());
}
});
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] { BasicTypeInfo.INT_TYPE_INFO ,BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO };
JDBCAppendTableSink sink = JDBCAppendTableSink.builder().setDrivername(driverClass).setDBUrl(dbUrl)
.setUsername(userName).setPassword(passWord).setParameterTypes(fieldTypes)
.setQuery("insert into student values(?,?,?)").build();
sink.emitDataStream(ds);
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void query() {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] { BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO };
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
// 查询mysql
JDBCInputFormat jdbcInputFormat = JDBCInputFormat.buildJDBCInputFormat().setDrivername(driverClass)
.setDBUrl(dbUrl).setUsername(userName).setPassword(passWord).setQuery("select * from student")
.setRowTypeInfo(rowTypeInfo).finish();
DataStreamSource<Row> input1 = env.createInput(jdbcInputFormat);
input1.print();
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
}
package org.sea.spring.boot.batch.test;
import java.util.ArrayList;
import java.util.List;
import org.junit.Test;
import org.sea.spring.boot.batch.FLink2Mysql;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.util.StopWatch;
import lombok.extern.log4j.Log4j2;
@Log4j2
public class TestFlink {
@Test
public void test() {
//构造
StopWatch watch = new StopWatch("testAdd1");
watch.start("构造");
List<StudentEntity> list =new ArrayList<StudentEntity>(10000);
for(int i=0;i<10000;i++) {
list.add(init(i+210000));
}
watch.stop();
//保存
watch.start("保存");
FLink2Mysql.add(list);
watch.stop();
log.info(watch.prettyPrint());
}
private StudentEntity init(int i) {
StudentEntity student=new StudentEntity();
student.setId(i);
student.setName("name"+i);
student.setAge(i);
return student;
}
}
github源码:https://github.com/loveseaone/sea-spring-boot-batch.git
如果觉得文章有帮助,关注下作者的公众号,赞个人气,不胜感激!
同时可以下载作者整理的工作10多年来阅读过的电子书籍。
公众号: TalkNewClass