本文主要内容为spring batch的进阶内容,也就是spring batch的扩展(Multithreaded Step 多线程执行一个Step;Parallel Step 通过多线程并行执行多个Step;Remote Chunking 在远端节点上执行分布式Chunk作;Partitioning Step 对数据进行分区,并分开执行;)的 Partitioning Step。本文构建的实例可为主服务,从服务,主从混用等模式,可以大大提高spring batch在单机处理时的时效。
@Bean("step2")
public Step step2(StepBuilderFactory stepBuilderFactory, @Qualifier("listResourceEq") MultiResourceItemReader listResourceEq, ItemWriter writer,
ItemProcessor processor){
return stepBuilderFactory
.get(step1)
.chunk(65000)//批处理每次提交65000条数据
.reader(listResourceEq)//给step绑定reader
.faultTolerant()
.skip(NullPointerException.class)
.skip(Exception.class)
.skipLimit(10000)
.processor(processor)//给step绑定processor
.writer(writer)//给step绑定writer
.taskExecutor(new SimpleAsyncTaskExecutor())//可以自定义线程池
.throttleLimit(16)//限制线程数
.build();
}
多线程Step实现的是单个Step的多线程化。在多线程Step中,需要确保Reader、Processor和Writer是线程安全的,否则容易出现并发问题
@Bean("importJob1")
public Job importJob1(JobBuilderFactory jobBuilderFactory, @Qualifier("step2") Step s1,@Qualifier("step3") Step s2){
return jobBuilderFactory.get("importJob1")
.incrementer(new RunIdIncrementer())
.flow(s1)//为Job指定Step
.next(s2)//指定多个,让他们并行
.end()
.listener(csvJobListener())//绑定监听器csvJobListener
.build();
}
并行执行多个step,该种模式提供的是多个Step的并行处理
在Spring batch中,Partitioning意味着对 数据 进行分片,然后每片实现专门处理,假设单 线程 处理100个数据需要10分钟,但是 我们将100个数据分片成十块,每块单独处理:这样整个过程可能只需要1 分钟。从处理器的节点可以是远程 服务器 的服务,也可以是本地执行的线程。主处理器发送给从处理器的消息是不需要持久或实现 JMS 那种严格的保证消息传递的, Spring Batch 元数据 Job Repository会确保每个slave执行一次,每次Job执行只执行一次。
@Configuration
@EnableBatchProcessing
public class PartitionBatch {
private String step1 ="THREE";
@Bean
public ThreadPoolTaskExecutor threadPoolExecutor(){
ThreadPoolTaskExecutor threadPoolExecutor = new ThreadPoolTaskExecutor();
threadPoolExecutor.setMaxPoolSize(10);
threadPoolExecutor.setCorePoolSize(10);
return threadPoolExecutor;
}
@Bean()
@StepScope
//这个step scope的作用是连接batches的各个steps。这个机制允许配置在Spring的beans当steps开始时才实例化并且允许你为这个step指定配置和参数
//注意需要设置Bean的scope属性为step,这是SpringBatch的一个后绑定技术,就是在生成Step的时候,才去创建bean,
// 因为这个时候jobparameter才传过来。如果加载配置信息的时候就创建bean,这个时候jobparameter的值还没有产生,会抛出异常。
public FlatFileItemReader reader( @Value("#{stepExecutionContext[filename]}") Resource resource)throws Exception {
System.out.println("%%%%%%%%%"+resource);
FlatFileItemReader reader = new FlatFileItemReader<>();
reader.setResource(resource);
reader.setEncoding("GBK");
reader.setLineMapper(new LineMapper() {
@Override
public User mapLine(String s, int i) throws Exception {
System.out.println("读取"+s);
System.out.println(Thread.currentThread().getName());
if(s==null||"".equals(s)){
return new User();
}
List collect = Arrays.stream(s.split(" ")).filter(a -> !a.trim().equals("")).collect(Collectors.toList());
String s1 =collect.size()>=1? collect.get(0):" ";
String s2 =collect.size()>=2? collect.get(1):" ";
String s3 =collect.size()>=3? collect.get(2):" ";
User user =new User(){{
setName(s1);
setPassword(s2);
setEmail(s3);
}};
return user;
}
});
return reader;
}
@Bean
public ItemWriter writer(@Qualifier("dataSource") DataSource dataSource){
JdbcBatchItemWriter writer = new JdbcBatchItemWriter<>();
//我们使用JDBC批处理的JdbcBatchItemWriter来写数据到数据库
writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>());
String sql = "insert into user2 "+" (gamename,name,email) "
+" values(:name,:password,:email)";
//在此设置要执行批处理的SQL语句
writer.setSql(sql);
writer.setDataSource(dataSource);
return writer;
}
@Bean
public JobRepository jobRepository(@Qualifier("dataSource") DataSource dataSource, PlatformTransactionManager transactionManager)throws Exception{
//以下方式JobRepositoryFactoryBean,将会把job的执行状态保存在数据库
/* JobRepositoryFactoryBean jobRepositoryFactoryBean = new JobRepositoryFactoryBean();
jobRepositoryFactoryBean.setDataSource(dataSource);
jobRepositoryFactoryBean.setTransactionManager(transactionManager);
jobRepositoryFactoryBean.setDatabaseType(DatabaseType.MYSQL.name());
return jobRepositoryFactoryBean.getObject();*/
//以下方式MapJobRepositoryFactoryBean将只会进行内存形式的保存
MapJobRepositoryFactoryBean jobRepositoryFactoryBean =new MapJobRepositoryFactoryBean();
jobRepositoryFactoryBean.setTransactionManager(transactionManager);
return jobRepositoryFactoryBean.getObject();
}
@Bean
public SimpleJobLauncher jobLauncher(JobRepository jobRepository, ThreadPoolTaskExecutor threadPoolExecutor)throws Exception{
SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
jobLauncher.setJobRepository(jobRepository);
// jobLauncher.setTaskExecutor(threadPoolExecutor );
return jobLauncher;
}
@Bean
public Job importJob(JobBuilderFactory jobBuilderFactory,@Qualifier("step3") Step s1){
return jobBuilderFactory.get("importJob")
.incrementer(new RunIdIncrementer())
.start(s1)
.listener(new CsvJobListener())
// .flow(s1)//为Job指定Step
// .end()
// .listener(csvJobListener())//绑定监听器csvJobListener
.build();
}
@Bean("step1")
public Step step1(StepBuilderFactory stepBuilderFactory, FlatFileItemReader reader, ItemWriter writer,
ItemProcessor processor){
return stepBuilderFactory
.get(step1)
.chunk(100000)//批处理每次提交65000条数据
.reader(reader)//给step绑定reader
// .faultTolerant()
/* .skip(NullPointerException.class)
.skipLimit(1000)*/
.processor(processor)//给step绑定processor
.writer(writer)//给step绑定writer
.build();
}
@Bean
public ItemProcessor processor(){
//使用我们自定义的ItemProcessor的实现CsvItemProcessor
return new MyPartionProcess();
}
@Bean("step3")
public Step step3(StepBuilderFactory stepBuilderFactory, @Qualifier("partitionHandler") PartitionHandler partitionHandler,@Qualifier("step1") Step step1){
List list =new ArrayList<>();
for (int a =1;a<51;a++) {
ClassPathResource classPathResource = new ClassPathResource("tianya_" + a + ".txt");
if(classPathResource.exists()){
list.add(classPathResource);
}
}
int start=list.size();
Object[] obj= list.toArray();
Resource[] qiye=new Resource[start];
for(int i=0;i
本地分区为两步,1. 数据分区(MultiResourcePartitioner) 2. 分区处理(PartitionHandler ),可以自己实现自己的分区逻辑,分区原理是一个master主 处理器 对应多个从slave处理器:将分区数据通过jms发送到远端就变成了远程分区
@Profile({"master", "mixed"})
@Bean
public Job job(@Qualifier("masterStep") Step masterStep) {
return jobBuilderFactory.get("endOfDayjob")
.start(masterStep)
.incrementer(new BatchIncrementer())
.listener(new JobListener())
.build();
}
@Bean("masterStep")
public Step masterStep(@Qualifier("slaveStep") Step slaveStep,
PartitionHandler partitionHandler,
DataSource dataSource) {
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep.getName(), new ColumnRangePartitioner(dataSource))
.step(slaveStep)
.partitionHandler(partitionHandler)
.build();
}
master节点关键部分是,他的Step需要设置从节点Step的Name,和一个数据分区器,数据分区器需要实现Partitioner接口, 它返回一个Map
的数据结构,这个结构完整的描述了每个从节点需要处理的分区片段。ExecutionContext保存了从节点要处理的数据边界,当然,ExecutionContext里的参数是根据你的业务来的,我这里,已数据ID为边界划分了每个区。具体的Partitioner实现如下:
/**
* Created by kl on 2018/3/1.
* Content :根据数据ID分片
*/
public class ColumnRangePartitioner implements Partitioner {
private JdbcOperations jdbcTemplate;
ColumnRangePartitioner(DataSource dataSource){
this.jdbcTemplate = new JdbcTemplate(dataSource);
}
@Override
public Map partition(int gridSize) {
int min = jdbcTemplate.queryForObject("SELECT MIN(arcid) from kl_article", Integer.class);
int max = jdbcTemplate.queryForObject("SELECT MAX(arcid) from kl_article", Integer.class);
int targetSize = (max - min) / gridSize + 1;
Map result = new HashMap();
int number = 0;
int start = min;
int end = start + targetSize - 1;
while (start <= max) {
ExecutionContext value = new ExecutionContext();
result.put("partition" + number, value);
if (end >= max) {
end = max;
}
value.putInt("minValue", start);
value.putInt("maxValue", end);
start += targetSize;
end += targetSize;
number++;
}
return result;
}
}
@Configuration
//注入配置文件属性
@ConfigurationProperties(prefix = "spring.rabbit")
public class IntegrationConfiguration {
private String host;
private Integer port=5672;
private String username;
private String password;
private String virtualHost;
private int connRecvThreads=5;
private int channelCacheSize=10;
//配置连接的ActiveMQ服务器,采用连接池
@Bean
public ConnectionFactory connectionFactory() {
CachingConnectionFactory connectionFactory = new CachingConnectionFactory(host, port);
connectionFactory.setUsername(username);
connectionFactory.setPassword(password);
connectionFactory.setVirtualHost(virtualHost);
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(connRecvThreads);
executor.initialize();
connectionFactory.setExecutor(executor);
connectionFactory.setPublisherConfirms(true);
connectionFactory.setChannelCacheSize(channelCacheSize);
return connectionFactory;
}
//integrate用来发送和接收消息的,原生的jms使用复杂spring已经简化,这里设置了默认的信道,所以分片完成后会自动发送到DirectChannel信道中
@Bean
public MessagingTemplate messageTemplate() {
MessagingTemplate messagingTemplate = new MessagingTemplate(outboundRequests());
messagingTemplate.setReceiveTimeout(60000000l);
return messagingTemplate;
}
//integrate支持点对点,和发布订阅,此处采用点对点,发送消息的信道
@Bean
public DirectChannel outboundRequests() {
return new DirectChannel();
}
// TODO: 2019/1/15 分区后的数据由PartitionHandler发送到信道里面,
//所有发送到outboundRequests的消息将会被此方法拦截,AMQP提供的AmqpTemplate接口定义了发送和接收消息的基本操作,扮演者关键的角色(此处会注入RabbitTemplate)
//inboundRequests,进过此方法处理后将会发给QueueChannel
@Bean
@ServiceActivator(inputChannel = "outboundRequests")
//setExpectReply请求/回复,设置为true时将进行发送并会期待接受回复,setRoutingKey因为设置了接受回复,设置了partition.requests这里默认绑定到name为partition.requests的队列
public AmqpOutboundEndpoint amqpOutboundEndpoint(AmqpTemplate template) {
AmqpOutboundEndpoint endpoint = new AmqpOutboundEndpoint(template);
endpoint.setExpectReply(true);
endpoint.setOutputChannel(inboundRequests());
endpoint.setRoutingKey("partition.requests");
return endpoint;
}
/*
channel只是用来与队列交互的一个东西,不能直接操作队列。
接收是在channel上订阅指定的队列消息
发送一般是通过channel带上routingKey发送到指定的exchange,exchange上根据routingKey绑定queue来决定发送到什么队列
不显式声明交换机时并且发送消息不指定交换机,则默认使用Direct,并且声明队列时,
不显式绑定队列与交换机,则队列以队列名为routing-key绑定到默认的direct交换机,发送消息不指定交换机时,则将消息发到默认的direct交换机
* */
@Bean
public Queue requestQueue() {
return new Queue("partition.requests", false);
}
//QueueChannel,允许消息接收者轮询获得消息,用一个队列(queue)接收消息,队列的容量大小可配置
@Bean
public QueueChannel inboundRequests() {
return new QueueChannel();
}
public String getHost() {
return host;
}
public void setHost(String host) {
this.host = host;
}
public Integer getPort() {
return port;
}
public void setPort(Integer port) {
this.port = port;
}
public String getUsername() {
return username;
}
public void setUsername(String username) {
this.username = username;
}
public String getPassword() {
return password;
}
public void setPassword(String password) {
this.password = password;
}
public String getVirtualHost() {
return virtualHost;
}
public void setVirtualHost(String virtualHost) {
this.virtualHost = virtualHost;
}
public int getConnRecvThreads() {
return connRecvThreads;
}
public void setConnRecvThreads(int connRecvThreads) {
this.connRecvThreads = connRecvThreads;
}
public int getChannelCacheSize() {
return channelCacheSize;
}
public void setChannelCacheSize(int channelCacheSize) {
this.channelCacheSize = channelCacheSize;
}
}
@Bean
@Profile({"slave","mixed"})
@ServiceActivator(inputChannel = "inboundRequests", outputChannel = "outboundStaging")
public StepExecutionRequestHandler stepExecutionRequestHandler() {
StepExecutionRequestHandler stepExecutionRequestHandler = new StepExecutionRequestHandler();
BeanFactoryStepLocator stepLocator = new BeanFactoryStepLocator();
stepLocator.setBeanFactory(this.applicationContext);
stepExecutionRequestHandler.setStepLocator(stepLocator);
stepExecutionRequestHandler.setJobExplorer(this.jobExplorer);
return stepExecutionRequestHandler;
}
@Bean("slaveStep")
public Step slaveStep(MyProcessorItem processorItem,
JpaPagingItemReader reader) {
CompositeItemProcessor itemProcessor = new CompositeItemProcessor();
List processorList = new ArrayList<>();
processorList.add(processorItem);
itemProcessor.setDelegates(processorList);
return stepBuilderFactory.get("slaveStep")
.chunk(1000)//事务提交批次
.reader(reader)
.processor(itemProcessor)
.writer(new PrintWriterItem())
.build();
}
从节点最关键的地方在于StepExecutionRequestHandler,他会接收MQ消息中间件中的消息,并从分区信息中获取到需要处理的数据边界,如下ItemReader
@Bean(destroyMethod = "")
@StepScope
public JpaPagingItemReader jpaPagingItemReader(
@Value("#{stepExecutionContext['minValue']}") Long minValue,
@Value("#{stepExecutionContext['maxValue']}") Long maxValue) {
System.err.println("接收到分片参数["+minValue+"->"+maxValue+"]");
JpaPagingItemReader reader = new JpaPagingItemReader<>();
JpaNativeQueryProvider queryProvider = new JpaNativeQueryProvider<>();
String sql = "select * from kl_article where arcid >= :minValue and arcid <= :maxValue";
queryProvider.setSqlQuery(sql);
queryProvider.setEntityClass(Article.class);
reader.setQueryProvider(queryProvider);
Map queryParames= new HashMap();
queryParames.put("minValue",minValue);
queryParames.put("maxValue",maxValue);
reader.setParameterValues(queryParames);
reader.setEntityManagerFactory(entityManagerFactory);
return reader;
}
中的minValuemin,maxValue,正是前文中Master节点分区中设置的值
总结:如上,已经完成了整个spring batch 远程分区处理的实例,需要注意的是,一个实例,即可主可从可主从,是有spring profile来控制的,细心的人可能会发现@Profile({“master”, “mixed”})等注解,所以如果你在测试的时候,别忘了在spring boot中配置好spring.profiles.active=slave等
参考:参考一
参考二
源码远程分区下载地址:
链接:https://pan.baidu.com/s/13kgxnLmzxaZo-1AEJhdgVg
提取码:143v
本地分区
链接:https://pan.baidu.com/s/1vDfY0Iep-H_8ANk3lV7FfA
提取码:sg9e