cassandra的一组sstable中虽然index.db已经记录了rowkey所在data.db文件中的位置,但是index.db是磁盘文件,如果每次查询都要通过该文件查询得到,那么查询就必然增加一次IO,所以就有了keycache,这个记录了sstable中rowkey对应的data.db文件中的位置,并且存储在内存中,减少一次IO来加快通过rowkey的检索速度。
在cassandra运行中,cache的总体管理和控制主要由CacheService管理
public final AutoSavingCache
而一组sstable就会对应在内存中映射出一个SSTableReader对象进行对应SSTable的控制。
所以在SSTableReader中就会有相应的指针指向keycache
private InstrumentingCache
1、keycache的初始化,是在CF创建生成内存对象ColumnFamilyStore的时候,根据cache的配置参数而进行开启的。
keycache的相关属性介绍如下:
keycache的内存容量:cassandra.yaml文件中的配置项key_cache_size_in_mb:如果不为空,则keycache在内存中的容量即使这个配置值,否则如果总量内存的5%如果大于100MB,则取100MB,否则取总量内存的5%。
keycache的内存结构:public final AutoSavingCache
首先InstrumentingCache
private volatile boolean capacitySetManually;判断是否进行过手动更改cache容量的
private final ICache
private final String type;cache的类型,现在cassandra中主要有两种一种是keycache,一种是rowcache。
private CacheMetrics metrics;cache的度量器,主要是cache的命中几率等等记录
然后是AutoSavingCache中有四个属性:
public static final Set
protected volatile ScheduledFuture> saveTask; 保存cache的任务,后面详细介绍
protected final CacheService.CacheType cacheType; cache的类型
private CacheSerializer
rowkey的长度 | rowkey | sstable的标示ssid | 标示是否进行index升级的版本 | 如果升级过index则需要序列化RowIndexEntry,如果没有就返回 |
RowIndexEntry的序列化已经在indexinfo中介绍过了。
其次KeyCacheKey的属性
public final Descriptor desc;sstable的相关属性
public final byte[] key;rowkey信息
/**
* We can use Weighers.singleton() because Long can't be leaking memory
* @return auto saving cache object
*/
private AutoSavingCache initKeyCache()
{
logger.info("Initializing key cache with capacity of {} MBs.", DatabaseDescriptor.getKeyCacheSizeInMB());
long keyCacheInMemoryCapacity = DatabaseDescriptor.getKeyCacheSizeInMB() * 1024 * 1024;
// as values are constant size we can use singleton weigher
// where 48 = 40 bytes (average size of the key) + 8 bytes (size of value)
ICache kc;
if (MemoryMeter.isInitialized())
{
kc = ConcurrentLinkedHashCache.create(keyCacheInMemoryCapacity);
}
else
{
logger.warn("MemoryMeter uninitialized (jamm not specified as java agent); KeyCache size in JVM Heap will not be calculated accurately. " +
"Usually this means cassandra-env.sh disabled jamm because you are using a buggy JRE; upgrade to the Sun JRE instead");
/* We don't know the overhead size because memory meter is not enabled. */
EntryWeigher weigher = new EntryWeigher()
{
public int weightOf(KeyCacheKey key, RowIndexEntry entry)
{
return key.key.length + entry.serializedSize();
}
};
kc = ConcurrentLinkedHashCache.create(keyCacheInMemoryCapacity, weigher);
}
AutoSavingCache keyCache = new AutoSavingCache(kc, CacheType.KEY_CACHE, new KeyCacheSerializer());
int keyCacheKeysToSave = DatabaseDescriptor.getKeyCacheKeysToSave();
logger.info("Scheduling key cache save to each {} seconds (going to save {} keys).",
DatabaseDescriptor.getKeyCacheSavePeriod(),
keyCacheKeysToSave == Integer.MAX_VALUE ? "all" : keyCacheKeysToSave);
keyCache.scheduleSaving(DatabaseDescriptor.getKeyCacheSavePeriod(), keyCacheKeysToSave);
return keyCache;
}
/**
* Wraps an ICache in requests + hits tracking.
*/
public class InstrumentingCache
{
private volatile boolean capacitySetManually;
private final ICache map;
private final String type;
private CacheMetrics metrics;
public InstrumentingCache(String type, ICache map)
{
this.map = map;
this.type = type;
this.metrics = new CacheMetrics(type, map);
}
public void put(K key, V value)
{
map.put(key, value);
}
public boolean putIfAbsent(K key, V value)
{
return map.putIfAbsent(key, value);
}
public boolean replace(K key, V old, V value)
{
return map.replace(key, old, value);
}
public V get(K key)
{
V v = map.get(key);
metrics.requests.mark();
if (v != null)
metrics.hits.mark();
return v;
}
public V getInternal(K key)
{
return map.get(key);
}
public void remove(K key)
{
map.remove(key);
}
public long getCapacity()
{
return map.capacity();
}
public boolean isCapacitySetManually()
{
return capacitySetManually;
}
public void updateCapacity(long capacity)
{
map.setCapacity(capacity);
}
public void setCapacity(long capacity)
{
updateCapacity(capacity);
capacitySetManually = true;
}
public int size()
{
return map.size();
}
public long weightedSize()
{
return map.weightedSize();
}
public void clear()
{
map.clear();
metrics = new CacheMetrics(type, map);
}
public Set getKeySet()
{
return map.keySet();
}
public Set hotKeySet(int n)
{
return map.hotKeySet(n);
}
public boolean containsKey(K key)
{
return map.containsKey(key);
}
public boolean isPutCopying()
{
return map.isPutCopying();
}
public CacheMetrics getMetrics()
{
return metrics;
}
}
package org.apache.cassandra.cache;
import java.io.*;
import java.nio.ByteBuffer;
import java.util.*;
import java.util.concurrent.Future;
import java.util.concurrent.ScheduledFuture;
import java.util.concurrent.TimeUnit;
import org.cliffc.high_scale_lib.NonBlockingHashSet;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.apache.cassandra.config.CFMetaData;
import org.apache.cassandra.config.DatabaseDescriptor;
import org.apache.cassandra.db.Table;
import org.apache.cassandra.db.compaction.CompactionInfo;
import org.apache.cassandra.db.compaction.CompactionManager;
import org.apache.cassandra.db.compaction.OperationType;
import org.apache.cassandra.db.ColumnFamilyStore;
import org.apache.cassandra.io.FSWriteError;
import org.apache.cassandra.io.util.FileUtils;
import org.apache.cassandra.io.util.LengthAvailableInputStream;
import org.apache.cassandra.io.util.SequentialWriter;
import org.apache.cassandra.service.CacheService;
import org.apache.cassandra.service.StorageService;
import org.apache.cassandra.utils.ByteBufferUtil;
import org.apache.cassandra.utils.Pair;
public class AutoSavingCache extends InstrumentingCache
{
private static final Logger logger = LoggerFactory.getLogger(AutoSavingCache.class);
/** True if a cache flush is currently executing: only one may execute at a time. */
public static final Set flushInProgress = new NonBlockingHashSet();
protected volatile ScheduledFuture> saveTask;
protected final CacheService.CacheType cacheType;
private CacheSerializer cacheLoader;
private static final String CURRENT_VERSION = "b";
public AutoSavingCache(ICache cache, CacheService.CacheType cacheType, CacheSerializer cacheloader)
{
super(cacheType.toString(), cache);
this.cacheType = cacheType;
this.cacheLoader = cacheloader;
}
public File getCachePath(String ksName, String cfName, String version)
{
return DatabaseDescriptor.getSerializedCachePath(ksName, cfName, cacheType, version);
}
public Writer getWriter(int keysToSave)
{
return new Writer(keysToSave);
}
public void scheduleSaving(int savePeriodInSeconds, final int keysToSave)
{
if (saveTask != null)
{
saveTask.cancel(false); // Do not interrupt an in-progress save
saveTask = null;
}
if (savePeriodInSeconds > 0)
{
Runnable runnable = new Runnable()
{
public void run()
{
submitWrite(keysToSave);
}
};
saveTask = StorageService.optionalTasks.scheduleWithFixedDelay(runnable,
savePeriodInSeconds,
savePeriodInSeconds,
TimeUnit.SECONDS);
}
}
public int loadSaved(ColumnFamilyStore cfs)
{
int count = 0;
long start = System.currentTimeMillis();
// old cache format that only saves keys
File path = getCachePath(cfs.table.name, cfs.columnFamily, null);
if (path.exists())
{
DataInputStream in = null;
try
{
logger.info(String.format("reading saved cache %s", path));
in = new DataInputStream(new LengthAvailableInputStream(new BufferedInputStream(new FileInputStream(path)), path.length()));
Set keys = new HashSet();
while (in.available() > 0)
{
keys.add(ByteBufferUtil.readWithLength(in));
count++;
}
cacheLoader.load(keys, cfs);
}
catch (Exception e)
{
logger.debug(String.format("harmless error reading saved cache %s fully, keys loaded so far: %d", path.getAbsolutePath(), count), e);
return count;
}
finally
{
FileUtils.closeQuietly(in);
}
}
// modern format, allows both key and value (so key cache load can be purely sequential)
path = getCachePath(cfs.table.name, cfs.columnFamily, CURRENT_VERSION);
if (path.exists())
{
DataInputStream in = null;
try
{
logger.info(String.format("reading saved cache %s", path));
in = new DataInputStream(new LengthAvailableInputStream(new BufferedInputStream(new FileInputStream(path)), path.length()));
List>> futures = new ArrayList>>();
while (in.available() > 0)
{
Future> entry = cacheLoader.deserialize(in, cfs);
// Key cache entry can return null, if the SSTable doesn't exist.
if (entry == null)
continue;
futures.add(entry);
count++;
}
for (Future> future : futures)
{
Pair entry = future.get();
put(entry.left, entry.right);
}
}
catch (Exception e)
{
logger.debug(String.format("harmless error reading saved cache %s", path.getAbsolutePath()), e);
}
finally
{
FileUtils.closeQuietly(in);
}
}
if (logger.isDebugEnabled())
logger.debug(String.format("completed reading (%d ms; %d keys) saved cache %s",
System.currentTimeMillis() - start, count, path));
return count;
}
public Future> submitWrite(int keysToSave)
{
return CompactionManager.instance.submitCacheWrite(getWriter(keysToSave));
}
public void reduceCacheSize()
{
if (getCapacity() > 0)
{
int newCapacity = (int) (DatabaseDescriptor.getReduceCacheCapacityTo() * weightedSize());
logger.warn(String.format("Reducing %s capacity from %d to %s to reduce memory pressure",
cacheType, getCapacity(), newCapacity));
setCapacity(newCapacity);
}
}
public class Writer extends CompactionInfo.Holder
{
private final Set keys;
private final CompactionInfo info;
private long keysWritten;
protected Writer(int keysToSave)
{
if (keysToSave >= getKeySet().size())
keys = getKeySet();
else
keys = hotKeySet(keysToSave);
OperationType type;
if (cacheType == CacheService.CacheType.KEY_CACHE)
type = OperationType.KEY_CACHE_SAVE;
else if (cacheType == CacheService.CacheType.ROW_CACHE)
type = OperationType.ROW_CACHE_SAVE;
else
type = OperationType.UNKNOWN;
info = new CompactionInfo(new CFMetaData(Table.SYSTEM_KS, cacheType.toString(), null, null, null),
type,
0,
keys.size(),
"keys");
}
public CacheService.CacheType cacheType()
{
return cacheType;
}
public CompactionInfo getCompactionInfo()
{
// keyset can change in size, thus total can too
return info.forProgress(keysWritten, Math.max(keysWritten, keys.size()));
}
public void saveCache()
{
logger.debug("Deleting old {} files.", cacheType);
deleteOldCacheFiles();
if (keys.isEmpty())
{
logger.debug("Skipping {} save, cache is empty.", cacheType);
return;
}
long start = System.currentTimeMillis();
HashMap, SequentialWriter> writers = new HashMap, SequentialWriter>();
try
{
for (K key : keys)
{
Pair path = key.getPathInfo();
SequentialWriter writer = writers.get(path);
if (writer == null)
{
writer = tempCacheFile(path);
writers.put(path, writer);
}
try
{
cacheLoader.serialize(key, writer.stream);
}
catch (IOException e)
{
throw new FSWriteError(e, writer.getPath());
}
keysWritten++;
}
}
finally
{
for (SequentialWriter writer : writers.values())
FileUtils.closeQuietly(writer);
}
for (Map.Entry, SequentialWriter> info : writers.entrySet())
{
Pair path = info.getKey();
SequentialWriter writer = info.getValue();
File tmpFile = new File(writer.getPath());
File cacheFile = getCachePath(path.left, path.right, CURRENT_VERSION);
cacheFile.delete(); // ignore error if it didn't exist
if (!tmpFile.renameTo(cacheFile))
logger.error("Unable to rename " + tmpFile + " to " + cacheFile);
}
logger.info(String.format("Saved %s (%d items) in %d ms", cacheType, keys.size(), System.currentTimeMillis() - start));
}
private SequentialWriter tempCacheFile(Pair pathInfo)
{
File path = getCachePath(pathInfo.left, pathInfo.right, CURRENT_VERSION);
File tmpFile = FileUtils.createTempFile(path.getName(), null, path.getParentFile());
return SequentialWriter.open(tmpFile, true);
}
private void deleteOldCacheFiles()
{
File savedCachesDir = new File(DatabaseDescriptor.getSavedCachesLocation());
if (savedCachesDir.exists() && savedCachesDir.isDirectory())
{
for (File file : savedCachesDir.listFiles())
{
if (file.isFile() && file.getName().endsWith(cacheType.toString()))
{
if (!file.delete())
logger.warn("Failed to delete {}", file.getAbsolutePath());
}
if (file.isFile() && file.getName().endsWith(CURRENT_VERSION + ".db"))
{
if (!file.delete())
logger.warn("Failed to delete {}", file.getAbsolutePath());
}
}
}
}
}
public interface CacheSerializer
{
void serialize(K key, DataOutput out) throws IOException;
Future> deserialize(DataInputStream in, ColumnFamilyStore cfs) throws IOException;
@Deprecated
void load(Set buffer, ColumnFamilyStore cfs);
}
}
2、keycache的数据生成和使用
keycache中的数据主要来源于查询,或者是系统启动的时候对持久化的keycache数据进行加载。
对于查询主要是带着rowkey条件的检索的时候,会通过SStableReader来查询对应Rowkey的Position,如果这个时候Keycache不为null,则通过keycache查询,否则走常规的indexinfo的查询,然后将查询的结果写入到keycache中。这里不需要担心如果key不存在与这个sstable中,而会进行大量的重复检索的动作,因为bloomfilter就会过滤掉百分之九十以上的这种场景。
public void cacheKey(DecoratedKey key, RowIndexEntry info)
{
CFMetaData.Caching caching = metadata.getCaching();
if (caching == CFMetaData.Caching.NONE
|| caching == CFMetaData.Caching.ROWS_ONLY
|| keyCache == null
|| keyCache.getCapacity() == 0)
{
return;
}
KeyCacheKey cacheKey = new KeyCacheKey(descriptor, key.key);
logger.trace("Adding cache entry for {} -> {}", cacheKey, info);
keyCache.put(cacheKey, info);
}
public RowIndexEntry getCachedPosition(DecoratedKey key, boolean updateStats)
{
return getCachedPosition(new KeyCacheKey(descriptor, key.key), updateStats);
}
private RowIndexEntry getCachedPosition(KeyCacheKey unifiedKey, boolean updateStats)
{
if (keyCache != null && keyCache.getCapacity() > 0)
return updateStats ? keyCache.get(unifiedKey) : keyCache.getInternal(unifiedKey);
return null;
}
/**
* Get position updating key cache and stats.
* @see #getPosition(org.apache.cassandra.db.RowPosition, org.apache.cassandra.io.sstable.SSTableReader.Operator, boolean)
*/
public RowIndexEntry getPosition(RowPosition key, Operator op)
{
return getPosition(key, op, true);
}
/**
* @param key The key to apply as the rhs to the given Operator. A 'fake' key is allowed to
* allow key selection by token bounds but only if op != * EQ
* @param op The Operator defining matching keys: the nearest key to the target matching the operator wins.
* @param updateCacheAndStats true if updating stats and cache
* @return The index entry corresponding to the key, or null if the key is not present
*/
public RowIndexEntry getPosition(RowPosition key, Operator op, boolean updateCacheAndStats)
{
// first, check bloom filter
if (op == Operator.EQ)
{
assert key instanceof DecoratedKey; // EQ only make sense if the key is a valid row key
if (!bf.isPresent(((DecoratedKey)key).key))
{
logger.debug("Bloom filter allows skipping sstable {}", descriptor.generation);
return null;
}
}
// next, the key cache (only make sense for valid row key)
if ((op == Operator.EQ || op == Operator.GE) && (key instanceof DecoratedKey))
{
DecoratedKey decoratedKey = (DecoratedKey)key;
KeyCacheKey cacheKey = new KeyCacheKey(descriptor, decoratedKey.key);
RowIndexEntry cachedPosition = getCachedPosition(cacheKey, updateCacheAndStats);
if (cachedPosition != null)
{
logger.trace("Cache hit for {} -> {}", cacheKey, cachedPosition);
Tracing.trace("Key cache hit for sstable {}", descriptor.generation);
return cachedPosition;
}
}
// next, see if the sampled index says it's impossible for the key to be present
long sampledPosition = getIndexScanPosition(key);
if (sampledPosition == -1)
{
if (op == Operator.EQ && updateCacheAndStats)
bloomFilterTracker.addFalsePositive();
// we matched the -1th position: if the operator might match forward, we'll start at the first
// position. We however need to return the correct index entry for that first position.
if (op.apply(1) >= 0)
{
sampledPosition = 0;
}
else
{
Tracing.trace("Index sample allows skipping sstable {}", descriptor.generation);
return null;
}
}
// scan the on-disk index, starting at the nearest sampled position.
// The check against IndexInterval is to be exit the loop in the EQ case when the key looked for is not present
// (bloom filter false positive). But note that for non-EQ cases, we might need to check the first key of the
// next index position because the searched key can be greater the last key of the index interval checked if it
// is lesser than the first key of next interval (and in that case we must return the position of the first key
// of the next interval).
int i = 0;
Iterator segments = ifile.iterator(sampledPosition, INDEX_FILE_BUFFER_BYTES);
while (segments.hasNext() && i <= DatabaseDescriptor.getIndexInterval())
{
FileDataInput in = segments.next();
try
{
while (!in.isEOF() && i <= DatabaseDescriptor.getIndexInterval())
{
i++;
ByteBuffer indexKey = ByteBufferUtil.readWithShortLength(in);
boolean opSatisfied; // did we find an appropriate position for the op requested
boolean exactMatch; // is the current position an exact match for the key, suitable for caching
// Compare raw keys if possible for performance, otherwise compare decorated keys.
if (op == Operator.EQ)
{
opSatisfied = exactMatch = indexKey.equals(((DecoratedKey) key).key);
}
else
{
DecoratedKey indexDecoratedKey = decodeKey(partitioner, descriptor, indexKey);
int comparison = indexDecoratedKey.compareTo(key);
int v = op.apply(comparison);
opSatisfied = (v == 0);
exactMatch = (comparison == 0);
if (v < 0)
{
Tracing.trace("Partition index lookup allows skipping sstable {}", descriptor.generation);
return null;
}
}
if (opSatisfied)
{
// read data position from index entry
RowIndexEntry indexEntry = RowIndexEntry.serializer.deserialize(in, descriptor.version);
if (exactMatch && updateCacheAndStats)
{
assert key instanceof DecoratedKey; // key can be == to the index key only if it's a true row key
DecoratedKey decoratedKey = (DecoratedKey)key;
if (logger.isTraceEnabled())
{
// expensive sanity check! see CASSANDRA-4687
FileDataInput fdi = dfile.getSegment(indexEntry.position);
DecoratedKey keyInDisk = SSTableReader.decodeKey(partitioner, descriptor, ByteBufferUtil.readWithShortLength(fdi));
if (!keyInDisk.equals(key))
throw new AssertionError(String.format("%s != %s in %s", keyInDisk, key, fdi.getPath()));
fdi.close();
}
// store exact match for the key
cacheKey(decoratedKey, indexEntry);
}
if (op == Operator.EQ && updateCacheAndStats)
bloomFilterTracker.addTruePositive();
Tracing.trace("Partition index lookup complete for sstable {}", descriptor.generation);
return indexEntry;
}
RowIndexEntry.serializer.skip(in, descriptor.version);
}
}
catch (IOException e)
{
markSuspect();
throw new CorruptSSTableException(e, in.getPath());
}
finally
{
FileUtils.closeQuietly(in);
}
}
if (op == Operator.EQ && updateCacheAndStats)
bloomFilterTracker.addFalsePositive();
Tracing.trace("Partition index lookup complete (bloom filter false positive) for sstable {}", descriptor.generation);
return null;
}
3、keycache的持久化
主要是AutoSavingCache中的writer内部类
keycache的持久化,不是为了使keycache的容量变大,使用中内存中存不下的数据放入磁盘中,而是为了启动的时候,迅速的加载keycache数据,这样便于启动以后的查询性能提升。
主要实现在AutoSavingCache中,在KeyCache生成的时候,就会提交一个后台进程任务,每隔key_cache_save_period(4 hours)就会将进行持久化。
如果keycache中的数据小于key_cache_keys_to_save(100),则全部持久化,如果大于,则从新到旧取出对应的key_cache_keys_to_save的数据进行持久化。
每次保存keycache的时候,就需要先删除旧的cache文件,然后生成cache文件,文件存放的目录为cassandra.yaml文件中的saved_caches_directory配置项,文件命名为“ksName - cfName- cacheType - version.db",然后再将keycache中选择出来的数据序列化到对应的文件中。
每个CF对应的cache文件都不一样,所以需要分别生成对应的文件,然后逐个遍历keycache,分别写入对应的sstable中。
4、keycache的持久化文件的加载
主要是AutoSavingCache中的public int loadSaved(ColumnFamilyStore cfs)方法
主要是每个CF在加载内存结构ColumnFamilyStore的时候就会加载keycache的持久化文件,但是如果没有cache版本的时候(即ksName - cfName- cacheType.db文件),加载keycache文件,主要只是加载rowkey,然后通过load方法,一个一个rowkey通过查询的方法,查询出RowIndexEngtry对象,然后加载到keycache集合中。
如果有cache版本(即ksName - cfName- cacheType - version.db)的文件,直接反序列化对应的KeyCacheKey和RowIndexEntry对象即可。
public void load(Set buffers, ColumnFamilyStore cfs)
{
for (ByteBuffer key : buffers)
{
DecoratedKey dk = cfs.partitioner.decorateKey(key);
for (SSTableReader sstable : cfs.getSSTables())
{
RowIndexEntry entry = sstable.getPosition(dk, Operator.EQ);
if (entry != null)
keyCache.put(new KeyCacheKey(sstable.descriptor, key), entry);
}
}
}