This post covers to use Apache flume to gather customer product search clicks and store the information using hadoop and elasticsearch sinks. The data may consist of different product search events like filtering based on different facets, sorting information, pagination information and further the products viewed and some of the products marked as favorite by the customers. In later posts we will analyze data further to use the same information for display and analytic.
Product Search Functionality
Any eCommerce platform offers different products to customers and search functionality is one of the basics of that. Allowing user for guided navigation using different facets/filters or free text search for the content is trivial of the any of existing search functionality.
SearchQueryInstruction
Consider a similar scenario where customer can search for a product and allows us to capture the product search behavior with following information,
01 |
public class SearchQueryInstruction implements Serializable { |
03 |
private final String _eventIdSuffix; |
04 |
private String eventId; |
05 |
private String hostedMachineName; |
06 |
private String pageUrl; |
07 |
private Long customerId; |
08 |
private String sessionId; |
09 |
private String queryString; |
10 |
private String sortOrder; |
11 |
private Long pageNumber; |
12 |
private Long totalHits; |
13 |
private Long hitsShown; |
14 |
private final Long createdTimeStampInMillis; |
15 |
private String clickedDocId; |
16 |
private Boolean favourite; |
18 |
private Map<String, Set<String>> filters; |
19 |
@JsonProperty (value = "filters" ) |
20 |
private List<FacetFilter> _filters; |
22 |
public SearchQueryInstruction() { |
23 |
_eventIdSuffix = UUID.randomUUID().toString(); |
24 |
createdTimeStampInMillis = new Date().getTime(); |
29 |
private static class FacetFilter implements Serializable { |
33 |
public FacetFilter(String code, String value) { |
Further source information available at, SearchQueryInstruction. The data is serialized in JSON format to be able to directly use with ElasticSearch for further display purposes.
Sample data, how the clicks information look like based on user clicks. The data is converted to json format before sending it to the embedded flume agent.
1 |
{ "eventid" : "629e9b5f-ff4a-4168-8664-6c8df8214aa7-1399386809805-24" , "hostedmachinename" : "192.168.182.1330" , "pageurl" : "http://jaibigdata.com/5" , "customerid" :24, "sessionid" : "648a011d-570e-48ef-bccc-84129c9fa400" , "querystring" :null, "sortorder" : "desc" , "pagenumber" :3, "totalhits" :28, "hitsshown" :7, "createdtimestampinmillis" :1399386809805, "clickeddocid" : "41" , "favourite" :null, "eventidsuffix" : "629e9b5f-ff4a-4168-8664-6c8df8214aa7" , "filters" :[{ "code" : "searchfacettype_color_level_2" , "value" : "Blue" },{ "code" : "searchfacettype_age_level_2" , "value" : "12-18 years" }]} |
2 |
{ "eventid" : "648b5cf7-7ca9-4664-915d-23b0d45facc4-1399386809782-298" , "hostedmachinename" : "192.168.182.1333" , "pageurl" : "http://jaibigdata.com/4" , "customerid" :298, "sessionid" : "7bf042ea-526a-4633-84cd-55e0984ea2cb" , "querystring" : "queryString48" , "sortorder" : "desc" , "pagenumber" :0, "totalhits" :29, "hitsshown" :19, "createdtimestampinmillis" :1399386809782, "clickeddocid" : "9" , "favourite" :null, "eventidsuffix" : "648b5cf7-7ca9-4664-915d-23b0d45facc4" , "filters" :[{ "code" : "searchfacettype_color_level_2" , "value" : "Green" }]} |
3 |
{ "eventid" : "74bb7cfe-5f8c-4996-9700-0c387249a134-1399386809799-440" , "hostedmachinename" : "192.168.182.1330" , "pageurl" : "http://jaibigdata.com/1" , "customerid" :440, "sessionid" : "940c9a0f-a9b2-4f1d-b114-511ac11bf2bb" , "querystring" : "queryString16" , "sortorder" : "asc" , "pagenumber" :3, "totalhits" :5, "hitsshown" :32, "createdtimestampinmillis" :1399386809799, "clickeddocid" :null, "favourite" :null, "eventidsuffix" : "74bb7cfe-5f8c-4996-9700-0c387249a134" , "filters" :[{ "code" : "searchfacettype_brand_level_2" , "value" : "Apple" }]} |
4 |
{ "eventid" : "9da05913-84b1-4a74-89ed-5b6ec6389cce-1399386809828-143" , "hostedmachinename" : "192.168.182.1332" , "pageurl" : "http://jaibigdata.com/1" , "customerid" :143, "sessionid" : "08a4a36f-2535-4b0e-b86a-cf180202829b" , "querystring" :null, "sortorder" : "desc" , "pagenumber" :0, "totalhits" :21, "hitsshown" :34, "createdtimestampinmillis" :1399386809828, "clickeddocid" : "38" , "favourite" : true , "eventidsuffix" : "9da05913-84b1-4a74-89ed-5b6ec6389cce" , "filters" :[{ "code" : "searchfacettype_color_level_2" , "value" : "Blue" },{ "code" : "product_price_range" , "value" : "10.0 - 20.0" }]} |
Apache Flume
Apache Flume is used to gather and aggregate data. Here Embedded Flume agent is used to capture Search Query instruction Events. In real scenario based on the usage,
- Either you can use embedded agent to collect data
- Or through rest api to push data from page to backend api service dedicated for events collections
- Or you can use application logging functionality to log all search events and tail the log file to collect data
Consider a scenario depending on application, multiple web/app servers sending events data to collector flume agent. As depicted in the diagram below the search clicks events are collected from multiple web/app servers and a collector/consolidator agent to collect data from all agents. The data is further divided based on selector using multiplexing strategy to store in Hadoop HDFS and also directing relevant data to ElasticSearch, eg. recently viewed items.
Embedded Flume Agent
Embedded Flume Agent allows us to include the flume agent within the application itself and allows us to collect data and send further to collector agent.
01 |
private static EmbeddedAgent agent; |
02 |
private void createAgent() { |
03 |
final Map<String, String> properties = new HashMap<String, String>(); |
04 |
properties.put( "channel.type" , "memory" ); |
05 |
properties.put( "channel.capacity" , "100000" ); |
06 |
properties.put( "channel.transactionCapacity" , "1000" ); |
07 |
properties.put( "sinks" , "sink1" ); |
08 |
properties.put( "sink1.type" , "avro" ); |
09 |
properties.put( "sink1.hostname" , "localhost" ); |
10 |
properties.put( "sink1.port" , "44444" ); |
11 |
properties.put( "processor.type" , "default" ); |
13 |
agent = new EmbeddedAgent( "searchqueryagent" ); |
14 |
agent.configure(properties); |
16 |
} catch ( final Exception ex) { |
17 |
LOG.error( "Error creating agent!" , ex); |
Store Search Events Data
Flume provides multiple sink option to store the data for future analysis. As shown in the diagram, we will take the scenario to store the data in Apache Hadoop and also on ElasticSearch for recently viewed items functionality.
Hadoop Sink
Allows to store the data permanently to HDFS to be able to analyze it later for analytics.
Based on the incoming events data, let’s say we want to store same based on hourly basis. “/searchevents/2014/05/15/16″ directory will store all incoming events for hour 16.
01 |
private HDFSEventSink sink; |
02 |
sink = new HDFSEventSink(); |
03 |
sink.setName( "HDFSEventSink-" + UUID.randomUUID()); |
04 |
channel = new MemoryChannel(); |
05 |
Map<String, String> channelParamters = new HashMap<>(); |
06 |
channelParamters.put( "capacity" , "100000" ); |
07 |
channelParamters.put( "transactionCapacity" , "1000" ); |
08 |
Context channelContext = new Context(channelParamters); |
09 |
Configurables.configure(channel, channelContext); |
10 |
channel.setName( "HDFSEventSinkChannel-" + UUID.randomUUID()); |
12 |
Map<String, String> paramters = new HashMap<>(); |
13 |
paramters.put( "hdfs.type" , "hdfs" ); |
14 |
String hdfsBasePath = hadoopClusterService.getHDFSUri() |
16 |
paramters.put( "hdfs.path" , hdfsBasePath + "/%Y/%m/%d/%H" ); |
17 |
paramters.put( "hdfs.filePrefix" , "searchevents" ); |
18 |
paramters.put( "hdfs.fileType" , "DataStream" ); |
19 |
paramters.put( "hdfs.rollInterval" , "0" ); |
20 |
paramters.put( "hdfs.rollSize" , "0" ); |
21 |
paramters.put( "hdfs.idleTimeout" , "1" ); |
22 |
paramters.put( "hdfs.rollCount" , "0" ); |
23 |
paramters.put( "hdfs.batchSize" , "1000" ); |
24 |
paramters.put( "hdfs.useLocalTimeStamp" , "true" ); |
26 |
Context sinkContext = new Context(paramters); |
27 |
sink.configure(sinkContext); |
28 |
sink.setChannel(channel); |
Check FlumeHDFSSinkServiceImpl.java for detailed start/stop of the hdfs sink.
Sample data below, is stored in hadoop like,
1 |
Check:hdfs://localhost.localdomain:54321/searchevents/2014/05/06/16/searchevents.1399386809864 |
2 |
body is:{ "eventid" : "e8470a00-c869-4a90-89f2-f550522f8f52-1399386809212-72" , "hostedmachinename" : "192.168.182.1334" , "pageurl" : "http://jaibigdata.com/0" , "customerid" :72, "sessionid" : "7871a55c-a950-4394-bf5f-d2179a553575" , "querystring" :null, "sortorder" : "desc" , "pagenumber" :0, "totalhits" :8, "hitsshown" :44, "createdtimestampinmillis" :1399386809212, "clickeddocid" : "23" , "favourite" :null, "eventidsuffix" : "e8470a00-c869-4a90-89f2-f550522f8f52" , "filters" :[{ "code" : "searchfacettype_brand_level_2" , "value" : "Apple" },{ "code" : "searchfacettype_color_level_2" , "value" : "Blue" }]} |
3 |
body is:{ "eventid" : "2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0-1399386809743-61" , "hostedmachinename" : "192.168.182.1330" , "pageurl" : "http://jaibigdata.com/0" , "customerid" :61, "sessionid" : "78286f6d-cc1e-489c-85ce-a7de8419d628" , "querystring" : "queryString59" , "sortorder" : "asc" , "pagenumber" :3, "totalhits" :32, "hitsshown" :9, "createdtimestampinmillis" :1399386809743, "clickeddocid" :null, "favourite" :null, "eventidsuffix" : "2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0" , "filters" :[{ "code" : "searchfacettype_age_level_2" , "value" : "0-12 years" }]} |
ElasticSearch Sink
For view purpose to display recently viewed items to end user. The ElasticSearch Sink allows to automatically create daily recently viewed items. The functionality can be used to display customer recently viewed items.
Let’s say you already have ES instance running at localhost/9310.
01 |
private ElasticSearchSink sink; |
02 |
sink = new ElasticSearchSink(); |
03 |
sink.setName( "ElasticSearchSink-" + UUID.randomUUID()); |
04 |
channel = new MemoryChannel(); |
05 |
Map<String, String> channelParamters = new HashMap<>(); |
06 |
channelParamters.put( "capacity" , "100000" ); |
07 |
channelParamters.put( "transactionCapacity" , "1000" ); |
08 |
Context channelContext = new Context(channelParamters); |
09 |
Configurables.configure(channel, channelContext); |
10 |
channel.setName( "ElasticSearchSinkChannel-" + UUID.randomUUID()); |
12 |
Map<String, String> paramters = new HashMap<>(); |
13 |
paramters.put(ElasticSearchSinkConstants.HOSTNAMES, "127.0.0.1:9310" ); |
14 |
String indexNamePrefix = "recentlyviewed" ; |
15 |
paramters.put(ElasticSearchSinkConstants.INDEX_NAME, indexNamePrefix); |
16 |
paramters.put(ElasticSearchSinkConstants.INDEX_TYPE, "clickevent" ); |
17 |
paramters.put(ElasticSearchSinkConstants.CLUSTER_NAME, |
18 |
"jai-testclusterName" ); |
19 |
paramters.put(ElasticSearchSinkConstants.BATCH_SIZE, "10" ); |
20 |
paramters.put(ElasticSearchSinkConstants.SERIALIZER, |
21 |
ElasticSearchJsonBodyEventSerializer. class .getName()); |
23 |
Context sinkContext = new Context(paramters); |
24 |
sink.configure(sinkContext); |
25 |
sink.setChannel(channel); |
Check FlumeESSinkServiceImpl.java for details to start/stop the ElasticSearch sink.
Sample data in elasticsearch is stored as,
1 |
{timestamp=1399386809743, body={pageurl=http://jaibigdata.com/0, querystring=queryString59, pagenumber=3, hitsshown=9, hostedmachinename=192.168.182.1330, createdtimestampinmillis=1399386809743, sessionid=78286f6d-cc1e-489c-85ce-a7de8419d628, eventid=2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0-1399386809743-61, totalhits=32, clickeddocid=null, customerid=61, sortorder=asc, favourite=null, eventidsuffix=2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0, filters=[{value=0-12 years, code=searchfacettype_age_level_2}]}, eventId=2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0} |
2 |
{timestamp=1399386809757, body={pageurl=http://jaibigdata.com/1, querystring=null, pagenumber=1, hitsshown=34, hostedmachinename=192.168.182.1330, createdtimestampinmillis=1399386809757, sessionid=e6a3fd51-fe07-4e21-8574-ce5ab8bfbd68, eventid=fe5279b7-0bce-4e2b-ad15-8b94107aa792-1399386809757-134, totalhits=9, clickeddocid=22, customerid=134, sortorder=desc, favourite=null, eventidsuffix=fe5279b7-0bce-4e2b-ad15-8b94107aa792, filters=[{value=Blue, code=searchfacettype_color_level_2}]}, State=VIEWED, eventId=fe5279b7-0bce-4e2b-ad15-8b94107aa792} |
3 |
{timestamp=1399386809765, body={pageurl=http://jaibigdata.com/0, querystring=null, pagenumber=4, hitsshown=2, hostedmachinename=192.168.182.1331, createdtimestampinmillis=1399386809765, sessionid=29864de8-5708-40ab-a78b-4fae55698b01, eventid=886e9a28-4c8c-4e8c-a866-e86f685ecc54-1399386809765-317, totalhits=2, clickeddocid=null, customerid=317, sortorder=asc, favourite=null, eventidsuffix=886e9a28-4c8c-4e8c-a866-e86f685ecc54, filters=[{value=0-12 years, code=searchfacettype_age_level_2}, {value=0.0 - 10.0, code=product_price_range}]}, eventId=886e9a28-4c8c-4e8c-a866-e86f685ecc54} |
ElasticSearchJsonBodyEventSerializer
To control how the data will be indexed in the ElasticSearch. Update event searializer as per your strategy to see how data should be indexed.
01 |
public class ElasticSearchJsonBodyEventSerializer implements ElasticSearchEventSerializer { |
03 |
public BytesStream getContentBuilder( final Event event) throws IOException { |
04 |
final XContentBuilder builder = jsonBuilder().startObject(); |
05 |
appendBody(builder, event); |
06 |
appendHeaders(builder, event); |