使用现代C++如何避免bugs(上)
How to avoid bugs using modern C++
C++的主要问题之一是拥有大量的构造,它们的行为是未定义的,或者程序员只是意想不到的。在各种项目中使用静态分析器时,我们经常会遇到这些问题。但是,众所周知,最好的方法是在编译阶段检测错误。让我们看看现代C++中的哪些技术不仅帮助编写简单明了的代码,而且使它更安全、更可靠。
什么是现代C++?
在C++ 11发布后,现代C++这个术语变得非常流行。这是什么意思?首先,现代C++是一组模式和习语,旨在消除好的旧的“C类”,很多C++程序员都习惯了,特别是如果他们开始在C++编程11看起来更简洁易懂,这是非常重要的。
当人们谈论现代C++时,人们通常会想到什么?并行性、编译时计算、RAII、lambdas、ranges、concepts、modules和标准库的其他同等重要的组件(例如,用于文件系统的API)。这些都是非常酷的现代化,我们期待着在下一套标准中看到它们。不过,我想提醒大家注意新标准允许编写更安全的代码的方式。当开发静态分析器时,我们看到了大量的不同的错误,有时我们会忍不住想:“但是在现代C++中,这是可以避免的”。因此,我建议我们检查PVS Studio在各种开源项目中发现的几个错误。另外,我们还将看看如何修复它们。
Automatic type inference
In C++, the keywords auto and decltype were added. Of course, you already know how they work.
在C++中,增加了关键字auto和decltype。当然,你已经知道它们是怎么工作的了。
std::map
It’s very convenient to shorten long types, without losing the
readability of the code. However, these keywords become quite expansive, together with templates: there is no need to specify the type of the returning value with auto and decltype.
在不损失代码可读性的情况下,缩短长类型非常方便。但是,这些关键字和模板扩展性很强:不需要使用auto和decltype指定返回值的类型。
But let’s go back to our topic. Here is an example of a 64-bit error:
string str = …;
unsigned n = str.find(“ABC”);
if (n != string::npos)
In a 64-bit application, the value of string::npos is greater than the maximum value of UINT_MAX, which can be represented by a variable of unsigned type. It could seem that this is a case where auto can save us from this kind of problem: the type of the n variable isn’t important to us, the main thing is that it can accommodate all possible values of string::find. And indeed, if we rewrite this example with auto,
the error is gone:
在64位应用程序中,string::npos的值大于UINT_MAX的最大值,该值可以由无符号类型的变量表示。这似乎是一个auto可以将我们从这类问题中拯救出来的例子:n变量的类型对我们来说并不重要,主要的是它可以容纳string::find的所有可能值。实际上,如果我们用auto重写这个示例,错误就消失了:
str = …;auto n = str.find(“ABC”);if (n != string::npos)
But not everything is as simple. Using auto is not a panacea, and there are many pitfalls related to its use. For example, you can write the code like this:
但并不是所有事情都那么简单。使用auto并不是万能的,它的使用也存在许多陷阱。例如,您可以编写如下代码:
auto n = 1024 * 1024 * 1024 * 5;
char* buf = new char[n];
Auto won’t save us from the integer overflow and there
will be less memory allocated for the buffer than 5GiB.
Auto also isn’t of any great help when it comes to
a very common error: an incorrectly written loop. Let’s look at an example:
Auto不会将我们从整数溢出中拯救出来,而且为缓冲区分配的内存将少于5GiB。
当遇到一个非常常见的错误:一个写得不正确的循环时,Auto也没有任何帮助。让我们看一个例子:
std::vector bigVector;
for (unsigned i = 0; i < bigVector.size(); ++i)
{ … }
For large size arrays, this loop becomes an infinity loop. It’s no surprise that there are such errors in the code: they reveal themselves in very rare cases, for which there were no tests.
Can we rewrite this fragment with auto?
对于大型数组,此循环变为无限循环。代码中存在这样的错误并不奇怪:它们在非常罕见的情况下会暴露自己,因为没有测试。
我们能用auto重写这个片段吗?
std::vector bigVector;
for (auto i = 0; i < bigVector.size(); ++i)
{ … }
No. Not only is the error is still here. It has become even worse.
With simple types auto behaves very badly. Yes, in the simplest cases (auto x = y) it works, but as soon as there are additional constructions, the behavior can become more unpredictable. What’s worse, the error will be more difficult to notice, because the types of variables aren’t that obvious at first glance. Fortunately it is not a problem for static analyzers: they don’t get tired, and don’t lose attention. But for us, as simple mortals it’s better to specify the types explicitly. We can also
get rid of the narrowing casting using other methods, but we’ll speak about that later.
不,不仅错误还在这里。情况变得更糟了。
对于简单类型,auto的行为非常糟糕。是的,在最简单的情况下(auto x=y),它可以工作,但是一旦有了额外的构造,行为就会变得更加不可预测。更糟糕的是,这个错误更难被注意到,因为变量的类型乍一看就不那么明显。幸运的是,对于静态分析器来说这不是一个问题:它们不会感到疲倦,也不会失去注意力。但对我们来说,简单来说,最好显式地指定类型。我们也可以使用其他方法来消除缩小范围的情况,但稍后我们将讨论这个问题。
Dangerous count of
One of the “dangerous” types in C++ is an array. Often when passing it to the function, programmers forget that it is passed as a pointer, and try to calculate the number of elements with sizeof.
危险计数
C++中的一种“危险”类型是数组。通常在将它传递给函数时,程序员会忘记它是作为指针传递的,并尝试使用sizeof计算元素的数量。
NUMBER_OF_V1(A) (sizeof(A)/sizeof((A)[0]))
#define _ARRAYSIZE(A) RTL_NUMBER_OF_V1(A)
int GetAllNeighbors( const CCoreDispInfo *pDisp, int iNeighbors[512] )
{
…
if ( nNeighbors < _ARRAYSIZE( iNeighbors ) )
iNeighbors[nNeighbors++] = pCorner->m_Neighbors[i];
…
}
Note: This code is taken from the Source Engine SDK.
PVS-Studio warning: V511 The sizeof() operator returns size of the pointer, and not of the array, in ‘sizeof (iNeighbors)’ expression. Vrad_dll disp_vrad.cpp 60
Such confusion can arise because of specifying the size of an array in the argument: this number means nothing to the compiler, and is just a
hint to the programmer.
The trouble is that this code gets compiled, and the programmer is
unaware that something is not right. The obvious solution would be to use metaprogramming:
注意:此代码取自源引擎SDK。
VS Studio警告:V511 size of()运算符返回“sizeof(iNeighbors)”表达式中指针的大小,而不是数组的大小。虚拟现实动态链接库Vrad_dll disp_vrad.cpp 60个
这种混淆可能是由于在参数中指定数组的大小而引起的:这个数字对编译器来说没有任何意义,对程序员来说只是一个警示。
问题是这段代码被编译了,而程序员却不知道有些事情不对。显而易见的解决方案是使用元编程:
template < class T, size_t N >
constexpr size_t countof( const T (&array)[N] )
{ return N;}
countof(iNeighbors); //compile-time error
If we pass to this function, not an array, we get a compilation
error. In C ++17 you can use std::size.
In C++11, the function std::extent was added, but it isn’t suitable as countof, because it returns 0 for inappropriate types.
如果我们传递给这个函数,而不是一个数组,就会得到一个编译错误。在C++17中,可以使用std::size。
在C++ 11中,函数std::extent被添加,但是它不匹配作为countof,因为它返回不匹配类型的0。
std::extent
You can make an error not only with countof, but
with sizeof as well.
VisitedLinkMaster::TableBuilder::TableBuilder( VisitedLinkMaster* master, const uint8 salt[LINK_SALT_LENGTH]) : master_(master), success_(true)
{
fingerprints_.reserve(4096);
memcpy(salt_, salt, sizeof(salt));
}
Note: This code is taken from Chromium.
PVS-Studio warnings:
V511 The sizeof() operator returns size of the pointer, and not of the array, in ‘sizeof (salt)’ expression. browser visitedlink_master.cc 968
V512 A call of the ‘memcpy’ function will lead to underflow of the buffer ‘salt_’. browser visitedlink_master.cc 968
As you can see, the standard C++ arrays have a lot of problems.
This is why you should use std::array: in the modern C++ its API is
similar to std::vector and other containers, and it’s harder
to make an error when using it.
注:此代码取自Chromium。
PVS-Studio警告:
•V511 sizeof()运算符返回“sizeof(salt)”表达式中指针的大小,而不是数组的大小。浏览器visitedlink_master.cc 968。
•V512调用“memcpy”函数将导致缓冲区“salt_”下溢。浏览器visitedlink_master.cc 968。正如你所看到的,标准的C++阵列有很多问题。这就是为什么你应该使用STD::SART:在现代C++中,它的API类似于STD::vector和其他容器,使用它时出错更难。
Foo(std::array
How to make a mistake in a simple for
One more source of errors is a simple for loop.
You may think, “Where can you make a mistake there? Is it something
connected with the complex exit condition or saving on the lines of code?” No, programmers make error in the simplest loops. Let’s take a look at the fragments from the projects:
lWindow::kBaudrates[] = { 50, 75, 110, … }; SerialWindow::SerialWindow() :
…{ … for(int i = sizeof(kBaudrates) / sizeof(char*); --i >= 0;)
{ message->AddInt32(“baudrate”, kBaudrateConstants[i]); … }
}
Note: This code is taken from Haiku Operation System.
PVS-Studio warning: V706 Suspicious division: sizeof (kBaudrates) / sizeof (char *). Size of every element in ‘kBaudrates’ array does not equal to divisor. SerialWindow.cpp 162
We have examined such errors in detail in the previous chapter:
the array size wasn’t evaluated correctly again. We can easily fix it by
using std::size:
const int SerialWindow::kBaudrates[] = { 50, 75, 110, … }; SerialWindow::SerialWindow() : …{ … for(int i = std::size(kBaudrates); --i >= 0;) { message->AddInt32(“baudrate”, kBaudrateConstants[i]); … }}
But there is a better way. Let’s take a look at one more fragment.
inline void CXmlReader::CXmlInputStream::UnsafePutCharsBack( const TCHAR* pChars, size_t nNumChars){ if (nNumChars > 0) { for (size_t nCharPos = nNumChars - 1; nCharPos >= 0; --nCharPos) UnsafePutCharBack(pChars[nCharPos]); }}
Note: This code is taken from Shareaza.
PVS-Studio warning: V547 Expression
‘nCharPos >= 0’ is always true. Unsigned type value is always >= 0.
BugTrap xmlreader.h 946
It’s a typical error when writing a reverse loop: the programmer
forgot that the iterator of an unsigned type and the check always return true.
You might think, “How come? Only novices and students make such mistakes.
We, professionals, don’t.” Unfortunately, this is not completely true. Of
course, everyone understands that (unsigned >= 0)- true.
Where do such errors come from? They often occur as a result of refactoring.
Imagine this situation: the project migrates from the 32-bit platform to
64-bit. Previously, int/unsigned was used for indexing and a
decision was made to replace them with size_t/ptrdiff_t. But in one
fragment they accidentally used an unsigned type instead of a signed one.
What shall we do to avoid this situation in your code? Some people
advise the use of signed types, as in C# or Qt. Perhaps, it could be a way out,
but if we want to work with large amounts of data, then there is no way to
avoid size_t.Is there any more secure way to iterate through array
in C++? Of course there is. Let’s start with the simplest one: non-member
functions. There are standard functions to work with collections, arrays
and initializer_list; their principle should be familiar to
you.
buf[4] = { ‘a’, ‘b’, ‘c’, ‘d’ };for (auto it = rbegin(buf); it != rend(buf); ++it) { std::cout << *it;}
Great, now we do not need to remember the difference between a
direct and reverse cycle. There is also no need to think about whether we use a simple array or an array - the loop will work in any case. Using iterators is a great way to avoid headaches, but even that is not always good enough. It is best to use the range-based for loop:
buf[4] = { ‘a’, ‘b’, ‘c’, ‘d’ };for (auto it : buf) { std::cout << it;}
Of course, there are some flaws in the range-based for: it doesn’t allow flexible management of the loop, and if there is more complex work with indexes required, then for won’t be of much help to us. But such situations should be examined separately. We have quite a simple situation: we have to move along the items in the reverse order.
However, at this stage, there are already difficulties. There are no additional classes in the standard library for range-based for. Let’s see how it could be implemented:
template reversed_wrapper { const T& _v; reversed_wrapper (const T& v) : _v(v) {} auto begin() -> decltype(rbegin(_v)) { return rbegin(_v); } auto end() -> decltype(rend(_v)) { return rend(_v); }}; template reversed_wrapper reversed(const T& v){ return reversed_wrapper(v);}
In C++14 you can simplify the code by removing the decltype.
You can see how auto helps you write template functions
Now we can rewrite the fragment as follows:
char buf[4] = { ‘a’, ‘b’, ‘c’, ‘d’ };for (auto it : reversed(buf)) { std::cout << it;}
What’s great about this code? Firstly, it is very easy to read. We
immediately see that the array of the elements is in the reverse order.
Secondly, it’s harder to make an error. And thirdly, it works with any type.
This is much better than what it was.
You can use boost::adaptors::reverse(arr) in boost.
But let’s go back to the original example. There, the array is passed by a pair pointer-size. It is obvious that our idea with reversed will not work for it. What shall we do? Use classes like span/array_view.
In C++17 we have string_view, and I suggest using that:
void Foo(std::string_view s);std::string str = “abc”;Foo(std::string_view(“abc”, 3));Foo(“abc”);Foo(str);
string_view does not own the string, in fact it’s a wrapper around the const char* and the length. That’s why in the code example, the string is passed by value, not by the reference. A key feature of the string_view is compatibility with strings in various string presentations: const char*, std::string and non-null terminated const char*.
As a result, the function takes the following form:
inline void CXmlReader::CXmlInputStream::UnsafePutCharsBack( std::wstring_view chars){ for (wchar_t ch : reversed(chars)) UnsafePutCharBack(ch);}
Passing to the function, it’s important to remember that the constructor string_view(const char*) is implicit, that’s why we can write like this:
Foo(pChars);
Not this way:
Foo(wstring_view(pChars, nNumChars));
A string that the string_view points to, does not need to be null- terminated, the very name string_view::data gives us a hint about this, and it is necessary to keep that in mind when using it. When passing its value to a function from cstdlib, which is
waiting for a C string, you can get undefined behavior. You can easily miss it, if in most cases that you are testing, there is std::string or
null-terminated strings used.
Enum
Let’s leave C++ for a second and think about good old C. How is
security there? After all, there are no problems with implicit constructor
calls and operators, or type conversion, and there are no problems with various types of the strings. In practice, errors often occur in the simplest constructions: the most complicated ones are thoroughly reviewed and debugged, because they cause some doubts. At the same time programmers forget to check simple constructions. Here is an example of a dangerous structure, which came to us from C:
enum iscsi_param { … ISCSI_PARAM_CONN_PORT, ISCSI_PARAM_CONN_ADDRESS, …}; enum iscsi_host_param { … ISCSI_HOST_PARAM_IPADDRESS, …};int iscsi_conn_get_addr_param(…, enum iscsi_param param, …){ … switch (param) { case ISCSI_PARAM_CONN_ADDRESS: case ISCSI_HOST_PARAM_IPADDRESS: … } return len;}
An example of the Linux kernel. PVS-Studio warning: V556 The values of different enum types are compared: switch(ENUM_TYPE_A) { case ENUM_TYPE_B: … }. libiscsi.c 3501
Pay attention to the values in the switch-case: one of the named constants is taken from a different enumeration. In the original, of course, there is much more code and more possible values and the error isn’t so obvious. The reason for that is lax typing of enum - they may be implicitly casting to int, and this leaves a lot of room for errors.
In C++11 you can, and should, use enum class: such a trick won’t work there, and the error will show up at the compilation stage. As a result, the following code does not compile, which is exactly what we need:
enum class ISCSI_PARAM { … CONN_PORT, CONN_ADDRESS, …}; enum class ISCSI_HOST { … PARAM_IPADDRESS, …};int iscsi_conn_get_addr_param(…, ISCSI_PARAM param, …){ … switch (param) { case ISCSI_PARAM::CONN_ADDRESS: case ISCSI_HOST::PARAM_IPADDRESS: … } return len;}
The following fragment is not quite connected with the enum, but
has similar symptoms:
void adns__querysend_tcp(…) { … if (!(errno == EAGAIN || EWOULDBLOCK || errno == EINTR || errno == ENOSPC || errno == ENOBUFS || errno == ENOMEM)) { …}
Note: This code is taken from ReactOS.
Yes, the values of errno are declared as macros,
which is bad practice in C++ (in C as well), but even if the programmer
used enum, it wouldn’t make life easier. The lost comparison will
not reveal itself in case of enum (and especially in case of a
macro). At the same time enum class would not allow this, as
there will were no implicit casting to bool.
Initialization in the constructor
But back to the native C++ problems. One of them reveals when
there is a need to initialize the object in the same way in several
constructors. A simple situation: there is a class, two constructors, one of
them calls another. It all looks pretty logical: the common code is put into a
separate method - nobody likes to duplicate the code. What’s the pitfall?
Guess::Guess() { language_str = DEFAULT_LANGUAGE; country_str = DEFAULT_COUNTRY; encoding_str = DEFAULT_ENCODING;}Guess::Guess(const char * guess_str) { Guess(); …}
Note: This code is taken from LibreOffice.
PVS-Studio warning: V603 The
object was created but it is not being used. If you wish to call constructor,
‘this->Guess::Guess(…)’ should be used. guess.cxx 56
The pitfall is in the syntax of the constructor call. Quite often
it gets forgotten, and the programmer creates one more class instance, which
then gets immediately destroyed. That is, the initialization of the original
instance isn’t happening. Of course, there are 1001 ways to fix this. For
example, we can explicitly call the constructor via this, or
put everything into a separate function:
Guess::Guess(const char * guess_str){ this->Guess(); …} Guess::Guess(const char * guess_str){ Init(); …}
By the way, an explicit repeated call of the constructor, for
example, via this is a dangerous game, and we need to
understand what’s going on. The variant with the Init() is much better and clearer. For those who want to understand the details of these “pitfalls” better, I suggest looking at chapter 19, “How to properly call one constructor from another”, from this book.
But it is best to use the delegation of the constructors here. So we can explicitly call one constructor from another in the following way:
Guess::Guess(const char * guess_str) : Guess(){ …}
Such constructors have several limitations. First: delegated
constructors take full responsibility for the initialization of an object. That
is, it won’t be possible to initialize another class field with it in the
initialization list:
Guess::Guess(const char * guess_str) : Guess(), m_member(42){ …}
And of course, we have to make sure that the delegation doesn’t
create a loop, as it will be impossible to exit it. Unfortunately, this code
gets compiled:
Guess::Guess(const char * guess_str) : Guess(std::string(guess_str)){ …} Guess::Guess(std::string guess_str) : Guess(guess_str.c_str()){ …}