MongoDB源码:FieldRef

相关的类

  • ElementPath
  • PathMatchExpression

在阅读 MatchExpression 时,出现很多的 FieldRef 的使用,因此,决定先了解 FieldRef 的实现。

简要概要

在 mongo 的源码中,FieldRef 的实现中有如下的注释:

A FieldPath represents a path in a document, starting from the root. The path
is made of "field parts" separated by dots. The class provides an efficient means to
"split" the dotted fields in its parts, but no validation is done.

Any field part may be replaced, after the "original" field reference was parsed. Any
part can be accessed through a StringData object.

The class is not thread safe.

其中,FieldRef 用于表示文档中的路径(注释中写的是 FieldPath,可能是写错了),如 {"a": {"b": 3}} 中的 "a.b",它有两个part,分别是 abFieldRef 支持高效的完成如下工作:

  • 根据 . 拆分路径;
  • 支持替换路径的part的功能。

在单元测试中能够佐证这些用途,如:

TEST(Normal, MulitplePartsVariable) {
    const char* parts[] = {"a", "b", "c", "d", "e"};
    size_t size = sizeof(parts) / sizeof(char*);
    std::string field = "a.b.c.d.e";

    FieldRef fieldRef(field);
    ASSERT_EQUALS(fieldRef.numParts(), size);
    for (size_t i = 0; i < size; i++) { // the field is splitted by dot
        ASSERT_EQUALS(fieldRef.getPart(i), parts[i]);
    }
    ASSERT_EQUALS(fieldRef.dottedField(), field);
}
TEST(Replacement, InMultipleField) {
    std::string field = "a.b.c.$.e";
    FieldRef fieldRef(field);
    ASSERT_EQUALS(fieldRef.numParts(), 5U);
    ASSERT_EQUALS(fieldRef.getPart(3), "$");

    std::string newField = "d";
    fieldRef.setPart(3, newField);
    ASSERT_EQUALS(fieldRef.numParts(), 5U);
    ASSERT_EQUALS(fieldRef.getPart(3), newField);
    ASSERT_EQUALS(fieldRef.dottedField(), "a.b.c.d.e");
}

成员变量

在了解了 FieldRef 的大致功能后,我们看一下 FieldRef 使用了那些成员变量,存储哪些信息。

class FieldRef {
private:
    // Number of field parts in the cached dotted name (_dotted).
    mutable FieldIndex _cachedSize = 0u;

    // Field components. Each component is either a StringView backed by the
    // _dotted string or boost::none to indicate that getPart() should read the string from the
    // _replacements list.
    mutable boost::container::small_vector, kFewDottedFieldParts>
        _parts;

    /**
     * Cached copy of the complete dotted name string. The StringView objects in "_parts" reference
     * this string.
     */
    mutable std::string _dotted;

    /**
     * String storage for path parts that have been replaced with setPart() or added with
     * appendPart() since the lasted time "_dotted" was materialized.
     */
    mutable std::vector _replacements;
};
  • _cachedSize 存储 Field 中有多少个 part,至于为什么以 cached 命名,我们留到后面去观察。FieldIndex 相关的定义为:using FieldIndex = BSONDepthIndex; 其中,BSONDepthIndexstd::uint8_t 的别名。这意味着文档的路径part数量是小于等于 256 的;

  • _parts 是一个small_vector,存储文档路径具体的part,small_vector的元素为 StringView 或者 boost::none,当元素为 boost::none 时,本应该存储的该 part 在 _replacements 对应的下标下;

    • small_vector 针对仅包含少量元素的 vector 做了特殊的优化,与 small string optimization 类似,可以参考这里 和 这里,被称之为 small buffer optimization
    • kFewDottedFieldParts 为 4,是一个基于经验配置的值,意味着大多数文档的路径长度小于等于 4
  • _dotted 是一个字符串,存储该对象表示的路径的字符串表示;

  • _replacements 是一个元素为字符串的 vector,用于存储对路径的修改。

上述的几个字段是存在冗余信息的,那么我们针对这些字段之间的关系,分析这样设计背后的逻辑。

字段之间的关系

为了了解这些字段之间的关系,因此需要了解与这些字段相关的函数。这些函数有:

  • size_t appendParsedPart(StringView part)
  • void appendPart(StringData part)
  • void setPart(FieldIndex i, StringData part)
  • void reserialize()

appendParsedPart

appendParsedPart 的实现是怎样的?

size_t FieldRef::appendParsedPart(FieldRef::StringView part) {
    _parts.push_back(part);
    _cachedSize++;
    return _parts.size();
}

该函数输入一个 StringView,将其 push_back_parts 内,并同时更新 _cachedSize。该函数是私有的,仅由 parse 函数调用,parse 函数的实现如下:

void FieldRef::parse(StringData path) {
    clear(); // ...; _cachedSize = 0; ...;

    if (path.size() == 0) {
        return;
    }

    // We guarantee that accesses through getPart() will be valid while 'this' is. So we
    // keep a copy in a local sting.

    _dotted = path.toString();

    // Separate the field parts using '.' as a delimiter.
    std::string::iterator beg = _dotted.begin();
    std::string::iterator cur = beg;
    const std::string::iterator end = _dotted.end();
    while (true) {
        if (cur != end && *cur != '.') {
            cur++;
            continue;
        }
        // 此时 beg 和 cur 表示的区间对应的字符串不包含 '.'

        // If cur != beg then we advanced cur in the loop above, so we have a real sequence
        // of characters to add as a new part. Otherwise, we may be parsing something odd,
        // like "..", and we need to add an empty StringData piece to represent the "part"
        // in-between the dots. This also handles the case where 'beg' and 'cur' are both
        // at 'end', which can happen if we are parsing anything with a terminal "."
        // character. In that case, we still need to add an empty part, but we will break
        // out of the loop below since we will not execute the guarded 'continue' and will
        // instead reach the break statement.

        if (cur != beg) {
            size_t offset = beg - _dotted.begin();
            size_t len = cur - beg;
            appendParsedPart(StringView{offset, len});
        } else {
            appendParsedPart(StringView{});
        }

        if (cur != end) {
            beg = ++cur;
            continue;
        }

        break;
    }
}

parse 函数基于 . 拆分 _dotted 字符串,拆分的结果使用 StringView 存储,StringView 的实现如下。StringView 不会保存字符串的地址,仅包含 offsetlen,这样可以减少一个 sizeof(char*) 的存储空间。

struct StringView {
    // Constructs an empty StringView.
    StringView() = default;

    StringView(std::size_t offset, std::size_t len) : offset(offset), len(len){};

    StringData toStringData(const std::string& viewInto) const {
        return {viewInto.c_str() + offset, len};
    };

    std::size_t offset = 0;
    std::size_t len = 0;
};

由此,在调用 parse 函数后,满足一些关系:

  • _parts_cachedSize 是一致的;
  • _dottedparts 是一致的;
  • _replacements 长度为 0;

appendPart

为了了解 appendPart 的实现,我们先看该函数的代码:

void FieldRef::appendPart(StringData part) {
    if (_replacements.empty()) {
        _replacements.resize(_parts.size());
    }

    _replacements.push_back(part.toString());
    _parts.push_back(boost::none);
}

_replacements 为空是可能的(在 parse 调用后),此时对 _replacements 调用 resize 函数,使其大小与 _parts 的大小一致,但是并没有为 _replacements 内的各个元素赋值,因此其内部的这些元素皆为空字符串。

因此,通过调用 appendPart,满足:

  • _replacements_parts 的长度一致;
  • _parts 某下标对应的值为 boost::none 时,_replacements 对应下标的元素为有效的值;
  • _cachedSize_parts 的长度不一致,但是与 _dotted 对应的 Part 数量是一致的,这可能是 cached 这个词的来源;

setPart

setPart 的代码如下:

void FieldRef::setPart(FieldIndex i, StringData part) {
    dassert(i < _parts.size());

    if (_replacements.empty()) {
        _replacements.resize(_parts.size());
    }

    _replacements[i] = part.toString();
    _parts[i] = boost::none;
}

appendPart 类似的,同步 resize _replacements 的大小。此外,修改 FieldRef 中的某一个 Part,会将 _replacemnets 对应的下标赋值成修改后的值,而将原 _parts 对应下标的值设置成 boost::none,表示此时以 _replacements 中的值为准。

因此,通过调用 setPart,同样满足:

  • _replacements_parts 的长度一致;
  • _parts 某下标对应的值为 boost::none 时,_replacements 对应下标的元素为有效的值;

reserialize

reserialize 的实现如下:

void FieldRef::reserialize() const {
    auto parts = _parts.size();
    std::string nextDotted;
    // Reserve some space in the string. We know we will have, at minimum, a character for
    // each component we are writing, and a dot for each component, less one. We don't want
    // to reserve more, since we don't want to forfeit the SSO if it is applicable.
    nextDotted.reserve((parts > 0) ? (parts * 2) - 1 : 0);

    // Concatenate the fields to a new string
    for (size_t i = 0; i != _parts.size(); ++i) {
        if (i > 0)
            nextDotted.append(1, '.');
        const StringData part = getPart(i);
        nextDotted.append(part.rawData(), part.size());
    }

    // Make the new string our contents
    _dotted.swap(nextDotted);

    // Before we reserialize, it's possible that _cachedSize != _size because parts were added or
    // removed. This reserialization process reconciles the components in our cached string
    // (_dotted) with the modified path.
    _cachedSize = parts;

    // Fixup the parts to refer to the new string
    std::string::const_iterator where = _dotted.begin();
    const std::string::const_iterator end = _dotted.end();
    for (size_t i = 0; i != parts; ++i) {
        boost::optional& part = _parts[i];
        const size_t size = part ? part->len : _replacements[i].size();

        // There is one case where we expect to see the "where" iterator to be at "end" here: we
        // are at the last part of the FieldRef and that part is the empty string. In that case, we
        // need to make sure we do not dereference the "where" iterator.
        invariant(where != end || (size == 0 && i == parts - 1));
        if (!size) {
            part = StringView{};
        } else {
            std::size_t offset = where - _dotted.begin();
            part = StringView{offset, size};
        }
        where += size;
        // skip over '.' unless we are at the end.
        if (where != end) {
            dassert(*where == '.');
            ++where;
        }
    }

    // Drop any replacements
    _replacements.clear();
}

该函数重新计算了 _parts_dotted,并将 _cachedSize 修改成对应的数值,然后将 _replacements 的元素清空。

因此,总的来说,我们可以得到如下规律:

  • parseStringData 中解析,生成 _parts_dotted_cachedSize
  • 通过 appendPart/setPart 修改该路径对象,会同步影响 _parts (boost::none)和 _replacements
  • reserialize 将修改重新应用到 _parts_dotted_cachedSize 上,并清空 _replacements

所以,_replacements 类似于一个缓冲区,用于缓存修改。

总结

这样设计有如下优点:

  • 利用内部类 StringView 减少额外空间的占用;

    • 考虑到如果使用 StringView 配合 _dotted 来优化存储空间,无法支持 in-place 对 _parts 做修改(因为其元素类型为 StringView,不包含字符指针,无法获取数据),所以使用 _replacemnets 辅助修改;而且修改的场景可能不是很多,因此引入 vector 的开销没有那么显著
  • 考虑到大部分的 FieldRef 的长度不超过 4,因此 _parts 使用 small_vector;

  • 缓存 _dotted 对于需要返回整个字符串形式的路径,效率的提升很大,如果没有缓存会带来大量的构造字符串的开销;

你可能感兴趣的:(MongoDB源码:FieldRef)