代码如下:
/** Partition by host / domain or IP. */
public int getPartition(FloatWritable key, Writable value,
int numReduceTasks) {
return partitioner.getPartition(((SelectorEntry) value).url, key,
numReduceTasks);
}
同样,这里的partitioner类是:URLPartitioner
分析下其getPartition函数的内容:
---
String urlString = key.toString();//取出url
URL url = null;
int hashCode = urlString.hashCode();//计算url的哈希码
---
try {
urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_PARTITION);//url归一化
url = new URL(urlString);//构造URL对象
hashCode = url.getHost().hashCode();//根据host计算哈希码
} catch (MalformedURLException e) {
LOG.warn("Malformed URL: '" + urlString + "'");
}
-------------
if (mode.equals(PARTITION_MODE_DOMAIN) && url != null)
hashCode = URLUtil.getDomainName(url).hashCode();//通过工具类获取域名
else if (mode.equals(PARTITION_MODE_IP)) {
try {
InetAddress address = InetAddress.getByName(url.getHost());
hashCode = address.getHostAddress().hashCode();//如果是IP,则进行host<->ip反转
} catch (UnknownHostException e) {
Generator.LOG.info("Couldn't find IP for host: "
+ url.getHost());
}
}
先解释下,这里的mode默认是host,可以通过配置文件的下列配置项分配
<property>
<name>partition.url.mode</name>
<value>byHost</value>
<description>Determines how to partition URLs. Default value is 'byHost',
also takes 'byDomain' or 'byIP'.
</description>
</property>
所以上面的代码就不难理解了。
但是我个人认为这里有个可以优化的地方,如果是IP映射的话,不如在这里做一个缓存。
这样不必每次都进行反转,耗时间。
========================================================
hashCode ^= seed;
return (hashCode & Integer.MAX_VALUE) % numReduceTasks;
最后就是进行一个模运算。很简单!