相关的类
- ElementPath
- PathMatchExpression
在阅读 MatchExpression
时,出现很多的 FieldRef
的使用,因此,决定先了解 FieldRef
的实现。
简要概要
在 mongo 的源码中,FieldRef
的实现中有如下的注释:
A FieldPath represents a path in a document, starting from the root. The path
is made of "field parts" separated by dots. The class provides an efficient means to
"split" the dotted fields in its parts, but no validation is done.Any field part may be replaced, after the "original" field reference was parsed. Any
part can be accessed through a StringData object.The class is not thread safe.
其中,FieldRef
用于表示文档中的路径(注释中写的是 FieldPath
,可能是写错了),如 {"a": {"b": 3}}
中的 "a.b"
,它有两个part,分别是 a
和 b
。FieldRef
支持高效的完成如下工作:
- 根据
.
拆分路径; - 支持替换路径的part的功能。
在单元测试中能够佐证这些用途,如:
TEST(Normal, MulitplePartsVariable) {
const char* parts[] = {"a", "b", "c", "d", "e"};
size_t size = sizeof(parts) / sizeof(char*);
std::string field = "a.b.c.d.e";
FieldRef fieldRef(field);
ASSERT_EQUALS(fieldRef.numParts(), size);
for (size_t i = 0; i < size; i++) { // the field is splitted by dot
ASSERT_EQUALS(fieldRef.getPart(i), parts[i]);
}
ASSERT_EQUALS(fieldRef.dottedField(), field);
}
TEST(Replacement, InMultipleField) {
std::string field = "a.b.c.$.e";
FieldRef fieldRef(field);
ASSERT_EQUALS(fieldRef.numParts(), 5U);
ASSERT_EQUALS(fieldRef.getPart(3), "$");
std::string newField = "d";
fieldRef.setPart(3, newField);
ASSERT_EQUALS(fieldRef.numParts(), 5U);
ASSERT_EQUALS(fieldRef.getPart(3), newField);
ASSERT_EQUALS(fieldRef.dottedField(), "a.b.c.d.e");
}
成员变量
在了解了 FieldRef
的大致功能后,我们看一下 FieldRef
使用了那些成员变量,存储哪些信息。
class FieldRef {
private:
// Number of field parts in the cached dotted name (_dotted).
mutable FieldIndex _cachedSize = 0u;
// Field components. Each component is either a StringView backed by the
// _dotted string or boost::none to indicate that getPart() should read the string from the
// _replacements list.
mutable boost::container::small_vector, kFewDottedFieldParts>
_parts;
/**
* Cached copy of the complete dotted name string. The StringView objects in "_parts" reference
* this string.
*/
mutable std::string _dotted;
/**
* String storage for path parts that have been replaced with setPart() or added with
* appendPart() since the lasted time "_dotted" was materialized.
*/
mutable std::vector _replacements;
};
_cachedSize
存储Field
中有多少个 part,至于为什么以cached
命名,我们留到后面去观察。FieldIndex
相关的定义为:using FieldIndex = BSONDepthIndex;
其中,BSONDepthIndex
为std::uint8_t
的别名。这意味着文档的路径part数量是小于等于256
的;-
_parts
是一个small_vector,存储文档路径具体的part,small_vector的元素为 StringView 或者 boost::none,当元素为 boost::none 时,本应该存储的该 part 在_replacements
对应的下标下;-
small_vector
针对仅包含少量元素的vector
做了特殊的优化,与 small string optimization 类似,可以参考这里 和 这里,被称之为 small buffer optimization -
kFewDottedFieldParts
为 4,是一个基于经验配置的值,意味着大多数文档的路径长度小于等于 4
-
_dotted
是一个字符串,存储该对象表示的路径的字符串表示;_replacements
是一个元素为字符串的 vector,用于存储对路径的修改。
上述的几个字段是存在冗余信息的,那么我们针对这些字段之间的关系,分析这样设计背后的逻辑。
字段之间的关系
为了了解这些字段之间的关系,因此需要了解与这些字段相关的函数。这些函数有:
size_t appendParsedPart(StringView part)
void appendPart(StringData part)
void setPart(FieldIndex i, StringData part)
void reserialize()
appendParsedPart
appendParsedPart
的实现是怎样的?
size_t FieldRef::appendParsedPart(FieldRef::StringView part) {
_parts.push_back(part);
_cachedSize++;
return _parts.size();
}
该函数输入一个 StringView
,将其 push_back
到 _parts
内,并同时更新 _cachedSize
。该函数是私有的,仅由 parse
函数调用,parse
函数的实现如下:
void FieldRef::parse(StringData path) {
clear(); // ...; _cachedSize = 0; ...;
if (path.size() == 0) {
return;
}
// We guarantee that accesses through getPart() will be valid while 'this' is. So we
// keep a copy in a local sting.
_dotted = path.toString();
// Separate the field parts using '.' as a delimiter.
std::string::iterator beg = _dotted.begin();
std::string::iterator cur = beg;
const std::string::iterator end = _dotted.end();
while (true) {
if (cur != end && *cur != '.') {
cur++;
continue;
}
// 此时 beg 和 cur 表示的区间对应的字符串不包含 '.'
// If cur != beg then we advanced cur in the loop above, so we have a real sequence
// of characters to add as a new part. Otherwise, we may be parsing something odd,
// like "..", and we need to add an empty StringData piece to represent the "part"
// in-between the dots. This also handles the case where 'beg' and 'cur' are both
// at 'end', which can happen if we are parsing anything with a terminal "."
// character. In that case, we still need to add an empty part, but we will break
// out of the loop below since we will not execute the guarded 'continue' and will
// instead reach the break statement.
if (cur != beg) {
size_t offset = beg - _dotted.begin();
size_t len = cur - beg;
appendParsedPart(StringView{offset, len});
} else {
appendParsedPart(StringView{});
}
if (cur != end) {
beg = ++cur;
continue;
}
break;
}
}
parse
函数基于 .
拆分 _dotted
字符串,拆分的结果使用 StringView
存储,StringView
的实现如下。StringView
不会保存字符串的地址,仅包含 offset
和 len
,这样可以减少一个 sizeof(char*)
的存储空间。
struct StringView {
// Constructs an empty StringView.
StringView() = default;
StringView(std::size_t offset, std::size_t len) : offset(offset), len(len){};
StringData toStringData(const std::string& viewInto) const {
return {viewInto.c_str() + offset, len};
};
std::size_t offset = 0;
std::size_t len = 0;
};
由此,在调用 parse
函数后,满足一些关系:
-
_parts
和_cachedSize
是一致的; -
_dotted
和parts
是一致的; -
_replacements
长度为 0;
appendPart
为了了解 appendPart
的实现,我们先看该函数的代码:
void FieldRef::appendPart(StringData part) {
if (_replacements.empty()) {
_replacements.resize(_parts.size());
}
_replacements.push_back(part.toString());
_parts.push_back(boost::none);
}
_replacements
为空是可能的(在 parse
调用后),此时对 _replacements
调用 resize
函数,使其大小与 _parts
的大小一致,但是并没有为 _replacements
内的各个元素赋值,因此其内部的这些元素皆为空字符串。
因此,通过调用 appendPart
,满足:
-
_replacements
和_parts
的长度一致; -
_parts
某下标对应的值为boost::none
时,_replacements
对应下标的元素为有效的值; -
_cachedSize
与_parts
的长度不一致,但是与_dotted
对应的 Part 数量是一致的,这可能是cached
这个词的来源;
setPart
setPart
的代码如下:
void FieldRef::setPart(FieldIndex i, StringData part) {
dassert(i < _parts.size());
if (_replacements.empty()) {
_replacements.resize(_parts.size());
}
_replacements[i] = part.toString();
_parts[i] = boost::none;
}
与 appendPart
类似的,同步 resize
_replacements
的大小。此外,修改 FieldRef
中的某一个 Part
,会将 _replacemnets
对应的下标赋值成修改后的值,而将原 _parts
对应下标的值设置成 boost::none
,表示此时以 _replacements
中的值为准。
因此,通过调用 setPart
,同样满足:
-
_replacements
和_parts
的长度一致; -
_parts
某下标对应的值为boost::none
时,_replacements
对应下标的元素为有效的值;
reserialize
reserialize
的实现如下:
void FieldRef::reserialize() const {
auto parts = _parts.size();
std::string nextDotted;
// Reserve some space in the string. We know we will have, at minimum, a character for
// each component we are writing, and a dot for each component, less one. We don't want
// to reserve more, since we don't want to forfeit the SSO if it is applicable.
nextDotted.reserve((parts > 0) ? (parts * 2) - 1 : 0);
// Concatenate the fields to a new string
for (size_t i = 0; i != _parts.size(); ++i) {
if (i > 0)
nextDotted.append(1, '.');
const StringData part = getPart(i);
nextDotted.append(part.rawData(), part.size());
}
// Make the new string our contents
_dotted.swap(nextDotted);
// Before we reserialize, it's possible that _cachedSize != _size because parts were added or
// removed. This reserialization process reconciles the components in our cached string
// (_dotted) with the modified path.
_cachedSize = parts;
// Fixup the parts to refer to the new string
std::string::const_iterator where = _dotted.begin();
const std::string::const_iterator end = _dotted.end();
for (size_t i = 0; i != parts; ++i) {
boost::optional& part = _parts[i];
const size_t size = part ? part->len : _replacements[i].size();
// There is one case where we expect to see the "where" iterator to be at "end" here: we
// are at the last part of the FieldRef and that part is the empty string. In that case, we
// need to make sure we do not dereference the "where" iterator.
invariant(where != end || (size == 0 && i == parts - 1));
if (!size) {
part = StringView{};
} else {
std::size_t offset = where - _dotted.begin();
part = StringView{offset, size};
}
where += size;
// skip over '.' unless we are at the end.
if (where != end) {
dassert(*where == '.');
++where;
}
}
// Drop any replacements
_replacements.clear();
}
该函数重新计算了 _parts
和 _dotted
,并将 _cachedSize
修改成对应的数值,然后将 _replacements
的元素清空。
因此,总的来说,我们可以得到如下规律:
-
parse
从StringData
中解析,生成_parts
,_dotted
和_cachedSize
; - 通过
appendPart/setPart
修改该路径对象,会同步影响_parts
(boost::none)和_replacements
; -
reserialize
将修改重新应用到_parts
,_dotted
和_cachedSize
上,并清空_replacements
;
所以,_replacements
类似于一个缓冲区,用于缓存修改。
总结
这样设计有如下优点:
-
利用内部类
StringView
减少额外空间的占用;- 考虑到如果使用
StringView
配合_dotted
来优化存储空间,无法支持 in-place 对_parts
做修改(因为其元素类型为StringView
,不包含字符指针,无法获取数据),所以使用_replacemnets
辅助修改;而且修改的场景可能不是很多,因此引入 vector 的开销没有那么显著
- 考虑到如果使用
考虑到大部分的
FieldRef
的长度不超过 4,因此_parts
使用 small_vector;缓存
_dotted
对于需要返回整个字符串形式的路径,效率的提升很大,如果没有缓存会带来大量的构造字符串的开销;