Cracking C++(13): 读取不超过n个字符

文章目录

    • 1. 目的
    • 2. 正确用法实例
    • 3. 纠正错误用法
      • 3.1 错误用法
      • 3.2 让 AddressSanitizer 告诉你错误
      • 3.3 解释
    • 4. 总结

1. 目的

在读取 pgm 格式图像的 meta 信息时, 使用了 %2s 这个格式串, 之前不是很了解, 尝试后发现, 如果不小心容易内存越界,而内存越界并不总是让运行的程序立即 crash, 会导致排查错误的成本升高。记录如下。

2. 正确用法实例

读取 .pgm 图像文件的 meta 信息时, 第一行是 P5 两个字符串。读取的代码为

char magic[3];
fscanf(fp, "%2s", magic);

意思是说从 文件句柄 fp 读取不超过2个字符, 存储到 magic 这个内存buffer中。

也可以是从控制台(stdin)读取输入:

char buf[3];
scanf("%2s", buf);

问: 明明是读取不超过 2 个字符, 为什么要申请3个字符呢?是否必要?
答: 必要的。
问: 那我偏要申请2个字符, 程序运行也没 crash 啊?
答: 内存越界并不总是立即 crash, 除非触发缺页中断。你可以开启 Address Sanitizer, 它会告诉你,你越界了。
问: 我不理解。fscanf 和 scanf 为啥要“多管闲事”? 那多出来的字符跟 fscanf 和 scanf 有啥关系?
答: 你打印下结果字符串就知道了。

3. 纠正错误用法

3.1 错误用法

// test.c
#include 

int ex1()
{
    char buf[2];  // 只申请了两个字符
    scanf("%2s", buf); // 读取不超过2个字符
    printf("buf is %s\n", buf); // 输出
    return 0;
}

int main()
{
    ex1();
    return 0;
}

编译和运行:

zz@Legion-R7000P% gcc test.c 
zz@Legion-R7000P% ./a.out 
he
buf is he

乍一看,程序运行良好,觉得“可以收工回家吃饭”了。

3.2 让 AddressSanitizer 告诉你错误

通常是搭配 -fsanitize=address -fno-omit-frame-pointer -g 编译选项使用,运行时第一次内存越界时,程序会终止, 并打印输出错误类型,错误原因。

此处使用 GCC, 你也可以使用 Visual Studio 或 XCode。 Visual Studio 需要 VS2019 >= 16.7 版本, 或 VS2022, 才支持 Address Sanitizer。

zz@Legion-R7000P% gcc test1.c -fsanitize=address -fno-omit-frame-pointer -g
zz@Legion-R7000P% ./a.out 
he
=================================================================
==162004==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc96f1fd32 at pc 0x7fe14c1d786c bp 0x7ffc96f1fbb0 sp 0x7ffc96f1f338
WRITE of size 3 at 0x7ffc96f1fd32 thread T0
    #0 0x7fe14c1d786b in scanf_common ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_format.inc:342
    #1 0x7fe14c1d84d3 in __interceptor___isoc99_vscanf ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1530
    #2 0x7fe14c1d85e6 in __interceptor___isoc99_scanf ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1551
    #3 0x55bca1d712ca in ex1 /home/zz/work/lenet_c/test1.c:7
    #4 0x55bca1d71354 in main /home/zz/work/lenet_c/test1.c:14
    #5 0x7fe14bf7cd8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #6 0x7fe14bf7ce3f in __libc_start_main_impl ../csu/libc-start.c:392
    #7 0x55bca1d71164 in _start (/home/zz/work/lenet_c/a.out+0x1164)

Address 0x7ffc96f1fd32 is located in stack of thread T0 at offset 34 in frame
    #0 0x55bca1d71238 in ex1 /home/zz/work/lenet_c/test1.c:5

  This frame has 1 object(s):
    [32, 34) 'buf' (line 6) <== Memory access at offset 34 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors_format.inc:342 in scanf_common
Shadow bytes around the buggy address:
  0x100012ddbf50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100012ddbf60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100012ddbf70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100012ddbf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100012ddbf90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x100012ddbfa0: 00 00 f1 f1 f1 f1[02]f3 f3 f3 00 00 00 00 00 00
  0x100012ddbfb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100012ddbfc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100012ddbfd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100012ddbfe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100012ddbff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==162004==ABORTING

Cracking C++(13): 读取不超过n个字符_第1张图片

3.3 解释

scanf/fscanf 并不是“多管闲事”。 C 的字符串总是需要多分配一个字符, 如果 scanf/fscanf 简单粗暴的写入 n 个字符, 那么后续使用 strlen() 等函数时, 无法根据字符串结尾处的 \0 判断结束, 因为此时没有 \0 字符。

在执行 scanf()/fscanf() 的前、后, 分别打印 buf 字符串的内容, 可以发现读取2个字符后, 第3个字符会被自动设置为 0.

// test.c
#include 

int ex1()
{
    char buf[2];
    scanf("%2s", buf);
    printf("buf is %s\n", buf);
    return 0;
}

void print_buf(char* buf, int len)
{
    for (int i = 0; i < 3; i++)
    {
        printf("buf[%d] = %d\n", i, buf[i]);
    }
}

int ex2()
{
    char buf[3] = {1, 1, 1};
    print_buf(buf, 3);
    int n = scanf("%2s", buf);
    printf("n = %d\n", n);
    printf("-----\n");
    print_buf(buf, 3);
    return 0;
}

int main()
{
    //ex1();
    ex2();
    return 0;
}

运行如下

zz@Legion-R7000P% gcc test.c -fsanitize=address -fno-omit-frame-pointer -g 
zz@Legion-R7000P% ./a.out 
buf[0] = 1
buf[1] = 1
buf[2] = 1
he
n = 1
-----
buf[0] = 104
buf[1] = 101
buf[2] = 0

4. 总结

无论是 scanf() 还是 fscanf(), 都支持读取不超过 n 个字符, 通常目的就是读取 n 个字符, 这样就不用手写循环那么那麻烦了。

而读取不超过 n 个字符, 存储这 n 个字符的字符串, 需要 n+1 个字节的内存空间, 最后一个字符用于存储 \0

  • 如果内存空间等于 n 个字节, 虽然程序可能不会 crash, 但无法确保总是不 crash。换言之开发过程中总是应该开启 Address Sanitizer 来确保绝对安全正确。
  • 如果内存空间大于等于 n+1 个字符, 那么索引为 n 的字符将被 scanf()/fscanf() 填充为 \0
  • 正确用法, 再次举例: 读取(不超过)2个字符,代码为
    char buf[3];
    int n = scanf("%2s", buf);
    printf("buf = %s\n", buf);

你可能感兴趣的:(c++,开发语言)