可参考Elasticsearch Bulk Processor
BulkProcessor提供了一个简单的接口来实现批量提交请求(多种请求,如IndexRequest,DeleteRequest),且可根据请求数量、大小或固定频率进行flush提交。flush方式可选同步或异步。以下是一个官方例子
import org.elasticsearch.action.bulk.BackoffPolicy;
import org.elasticsearch.action.bulk.BulkProcessor;
import org.elasticsearch.common.unit.ByteSizeUnit;
import org.elasticsearch.common.unit.ByteSizeValue;
import org.elasticsearch.common.unit.TimeValue;
// create a bulkProcessor
BulkProcessor bulkProcessor = BulkProcessor.builder(
client,
new BulkProcessor.Listener() {
@Override
public void beforeBulk(long executionId,
BulkRequest request) {
request.numberOfActions() }
@Override
public void afterBulk(long executionId,
BulkRequest request,
BulkResponse response) {
response.hasFailures() }
@Override
public void afterBulk(long executionId,
BulkRequest request,
Throwable failure) {
failure.getMessage() }
})
// 每10000个request flush一次
.setBulkActions(10000)
// bulk数据每达到5MB flush一次
.setBulkSize(new ByteSizeValue(5, ByteSizeUnit.MB))
// 每5秒flush一次
.setFlushInterval(TimeValue.timeValueSeconds(5))
// 0代表同步提交即只能提交一个request;
// 1代表当有一个新的bulk正在累积时,1个并发请求可被允许执行
.setConcurrentRequests(1)
// 设置当出现代表ES集群拥有很少的可用资源来处理request时抛出
// EsRejectedExecutionException造成N个bulk内request失败时
// 进行重试的策略,初始等待100ms,后面指数级增加,总共重试3次.
// 不重试设为BackoffPolicy.noBackoff()
.setBackoffPolicy(
BackoffPolicy.exponentialBackoff(TimeValue.timeValueMillis(100), 3))
.build();
// add a IndexRequest to bulkprocessor
bulkProcessor.add(new IndexRequest("twitter", "_doc", "1").source(/* your doc here */));
// add a DeleteRequest to bulkprocessor
bulkProcessor.add(new DeleteRequest("twitter", "_doc", "2"));
// await close the bulkprocessor
bulkProcessor.awaitClose(10, TimeUnit.MINUTES);
就是上述例子中的build()
方法,创建时用了builder模式。
BulkProcessor.builder(client,listener)
其中client为TransportClient实例,listener bulk各事件监听器。
使用各种.setxxx
方法设置属性,最后使用.build()
方法创建一个BulkProcessor实例:
BulkProcessor(Client client, BackoffPolicy backoffPolicy, Listener listener, @Nullable String name, int concurrentRequests, int bulkActions, ByteSizeValue bulkSize, @Nullable TimeValue flushInterval) {
this.bulkActions = bulkActions;
this.bulkSize = bulkSize.bytes();
this.bulkRequest = new BulkRequest();
// 根据concurrentRequests值不同设置同步或异步Handler
this.bulkRequestHandler = (concurrentRequests == 0) ? BulkRequestHandler.syncHandler(client, backoffPolicy, listener) : BulkRequestHandler.asyncHandler(client, backoffPolicy, listener, concurrentRequests);
if (flushInterval != null) {
// 设置了flushInterval,就会开启一个定时执行的线程池
this.scheduler = (ScheduledThreadPoolExecutor) Executors.newScheduledThreadPool(1, EsExecutors.daemonThreadFactory(client.settings(), (name != null ? "[" + name + "]" : "") + "bulk_processor"));
this.scheduler.setExecuteExistingDelayedTasksAfterShutdownPolicy(false);
this.scheduler.setContinueExistingPeriodicTasksAfterShutdownPolicy(false);
// 按配置的flush定时提交bulk
this.scheduledFuture = this.scheduler.scheduleWithFixedDelay(new Flush(), flushInterval.millis(), flushInterval.millis(), TimeUnit.MILLISECONDS);
} else {
this.scheduler = null;
this.scheduledFuture = null;
}
}
下面看看Flush任务是什么:
class Flush implements Runnable {
@Override
public void run() {
synchronized (BulkProcessor.this) {
if (closed) {
// close,什么都不做
return;
}
if (bulkRequest.numberOfActions() == 0) {
// 无数据,什么都不做
return;
}
//否则执行bulk提交
execute();
}
}
}
同步的SyncBulkRequestHandler的execute方法如下:
public void execute(BulkRequest bulkRequest, long executionId) {
boolean afterCalled = false;
try {
// 调用注册的listener的beforeBulk方法
listener.beforeBulk(executionId, bulkRequest);
// 根据重试策略同步的写入数据到ES
BulkResponse bulkResponse = Retry
.on(EsRejectedExecutionException.class)
.policy(backoffPolicy)
.withSyncBackoff(client, bulkRequest);
afterCalled = true;
// 调用注册的listener的afterBulk方法
listener.afterBulk(executionId, bulkRequest, bulkResponse);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
if (!afterCalled) {
//出错时调用afterBulk
listener.afterBulk(executionId, bulkRequest, e);
}
} catch (Throwable t) {
logger.warn("Failed to execute bulk request {}.", t, executionId);
if (!afterCalled) {
listener.afterBulk(executionId, bulkRequest, t);
}
}
}
该类的bulk提交是异步的。
private AsyncBulkRequestHandler(Client client, BackoffPolicy backoffPolicy, BulkProcessor.Listener listener, int concurrentRequests) {
super(client);
this.backoffPolicy = backoffPolicy;
assert concurrentRequests > 0;
this.listener = listener;
// 这里就是setConcurrentRequests的值
this.concurrentRequests = concurrentRequests;
// 这里创建了concurrentRequests个Semaphore许可
this.semaphore = new Semaphore(concurrentRequests);
}
public void execute(final BulkRequest bulkRequest, final long executionId) {
boolean bulkRequestSetupSuccessful = false;
boolean acquired = false;
try {
listener.beforeBulk(executionId, bulkRequest);
//申请许可,无许可时阻塞
semaphore.acquire();
acquired = true;
Retry.on(EsRejectedExecutionException.class)
.policy(backoffPolicy)
// 异步方式提交bulk
.withAsyncBackoff(client, bulkRequest, new ActionListener<BulkResponse>() {
@Override
// es响应后调用此方法
public void onResponse(BulkResponse response) {
try {
listener.afterBulk(executionId, bulkRequest, response);
} finally {
// 该bulk提交完了,才会释放许可
semaphore.release();
}
}
@Override
public void onFailure(Throwable e) {
try {
listener.afterBulk(executionId, bulkRequest, e);
} finally {
// 该bulk提交完失败了,也会释放许可
semaphore.release();
}
}
});
// 异步提交成功
bulkRequestSetupSuccessful = true;
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
logger.info("Bulk request {} has been cancelled.", e, executionId);
listener.afterBulk(executionId, bulkRequest, e);
} catch (Throwable t) {
logger.warn("Failed to execute bulk request {}.", t, executionId);
listener.afterBulk(executionId, bulkRequest, t);
} finally {
if (!bulkRequestSetupSuccessful && acquired) {
// if we fail on client.bulk() release the semaphore
// 即异步提交过程就失败了,也会释放许可
semaphore.release();
}
}
}
下面看看bulkProcessor.add(IndexRequest)
方法,实际执行的是以下方法
// 注意该方法拥有全局对象锁
private synchronized void internalAdd(ActionRequest request, @Nullable Object payload) {
// 确保没有close
ensureOpen();
// bulkRequest内部有一个ActionRequestList,会将request放入
// 然后增加累积的size(每个request的source及50字节的head)
bulkRequest.add(request, payload);
// 该方法会判断当前request条数或字节数超过阈值则会执行execute方法
// 需要注意的是这个execute方法还是需要申请许可,跟flush那个execute是同一个
// 否则什么都不做
executeIfNeeded();
}
从以上代码可以看到,setConcurrentRequests
这个值在异步bulk场景中是很关键的。
比如setConcurrentRequests=2
,且在某个较大bulk请求提交后,真实响应未在flush时间内返回,那么当flush间隔到了后第二个bulk请求也能申请到许可并提交bulk。
或在数据量很大,很快达到action/size阈值触发提交时,也需要高并行度来提高效率。
如果设置setConcurrentRequests=0
,则是完全同步的bulk提交,每次提交需要等待上一次bulk提交任务完成后再等待flush时间才能继续下一次bulk提交。