2019独角兽企业重金招聘Python工程师标准>>>
Content
1. 使用 od 命令 dump 文件内容
2. 文件内容解析
(1) file magic
(2) version
(3) time stamp
(4) FUNCTION tag
(5) COUNTER tag
(6) OBJECT SUMMARY tag
(7) PROGRAM SUMMARY tag
(8) file end
3. 文件读取函数及其调用过程
3.1 读取 / 写入相关调用
3.2 程序退出点
Appendix: gcov 文件格式定义
本文仍以 Linux 平台代码覆盖率测试工具 GCOV 简介 一文的例子为例,分析 gcda/gcno 的文件格式和读取 / 写入方法。
1. 使用 od 命令 dump 文件内容
# od -t x4 -w16 test.gcda
0000000 67636461 34303170 4e8eb3f0 01000000 //magic version stamp tag
0000020 00000002 00000003 eb65a768 01a10000 //length func_ident checksum tag
0000040 0000000a 0000000a 00000000 00000000 //length
0000060 00000000 00000001 00000000 00000000
0000100 00000000 00000001 00000000 a1000000 //0xa1000000 is GCOV_TAG_OBJECT_SUMMARY
0000120 00000009 00000000 00000005 00000001 //length
0000140 0000000c 00000000 0000000a 00000000
0000160 0000000a 00000000 a3000000 00000009 //0xa3000000 is GCOV_TAG_PROGRAM_SUMMARY
0000200 51924f98 00000005 00000001 0000000c //checksum number runs
0000220 00000000 0000000a 00000000 0000000a
0000240 00000000 00000000
0000250
od 命令的使用方法可参考其 manual 页。
2. 文件内容解析
(1) file magic
0x67636461 is file magic, that is, "gcda".
0x67636461 是怎么来的呢?细心的读者一定会发现,实际上就是 'g','c','d','a' 字符的 ASCII 码组成的,即 0x67,0x63,0x64,0x61 。即采用字符的 ASCII 码作为文件 magic 。可参考附录的解释。
defined as the following.
/* File suffixes. */
#define GCOV_DATA_SUFFIX ".gcda"
#define GCOV_NOTE_SUFFIX ".gcno"
/* File magic. Must not be palindromes. */
#define GCOV_DATA_MAGIC ((gcov_unsigned_t)0x67636461 ) /* "gcda" */
#define GCOV_NOTE_MAGIC ((gcov_unsigned_t)0x67636e6f ) /* "gcno" */
(2) version
0x34303170 is the GCOV_VERSION, that is, 401p, 即 4.1.2p
该版本常量在 gcov_iov.h 文件中定义,如下。
/* Generated automatically by the program `./gcov-iov'
from `4.1.2 (4 1) and p (p)'. */
#define GCOV_VERSION ((gcov_unsigned_t)0x34303170 ) /* 401p */
然而,这个文件是在编译 GCC 时自动产生的;文件的内容,是有 gcov_iov 程序产生,该程序由 gcov_iov.c 编译得来, 我们可以直接在 gcc 源代码下的 gcc 目录编译该文件,例如。
# cd /home/abo/gcc-4.1.2/gcc
# gcc -g -o gcov-iov gcov-iov.c
# ./gcov-iov 4.1.2 p
/* Generated automatically by the program `./gcov-iov'
from `4.1.2 (4 1) and p (p)'. */
#define GCOV_VERSION ((gcov_unsigned_t)0x34303170) /* 401p */
# ./gcov-iov 4.1 p
/* Generated automatically by the program `./gcov-iov'
from `4.1 (4 1) and p (p)'. */
#define GCOV_VERSION ((gcov_unsigned_t)0x34303170) /* 401p */
同样的道理, 0x34303170 即为字符 '4','0','1','p' 的 ASCII 码组成。 'p' 代表 prerelease ,请参考附录。
(3) time stamp
0x4e8eb3f0=1317975024 is the time stamp from GreenWich, it will be read and discarded.
可以使用 date 名验证这个时间,如下。不过,数值上好像有些差异,至于原因,本文不再研究。
# date -d @1317975024 +"%F %T %z"
2011-10-07 16:10:24 +0800
# date --date='2011-04-13 11:13:07' +%s
1302664387
(4) FUNCTION tag
0x01000000 is a FUNCTION tag, defined as follows.
/* The record tags. Values [1..3f] are for tags which may be in either
file. Values [41..9f] for those in the note file and [a1..ff] for
the data file. The tag value zero is used as an explicit end of
file marker -- it is not required to be present. */
# define GCOV_TAG_FUNCTION ((gcov_unsigned_t) 0x01000000 )
# define GCOV_TAG_FUNCTION_LENGTH ( 2 )
# define GCOV_TAG_BLOCKS ((gcov_unsigned_t) 0x01410000 )
# define GCOV_TAG_BLOCKS_LENGTH(NUM) (NUM)
# define GCOV_TAG_BLOCKS_NUM(LENGTH) (LENGTH)
# define GCOV_TAG_ARCS ((gcov_unsigned_t) 0x01430000 )
# define GCOV_TAG_ARCS_LENGTH(NUM) ( 1 + (NUM) * 2 )
# define GCOV_TAG_ARCS_NUM(LENGTH) (((LENGTH) - 1 ) / 2 )
# define GCOV_TAG_LINES ((gcov_unsigned_t) 0x01450000 )
# define GCOV_TAG_COUNTER_BASE ((gcov_unsigned_t) 0x01a10000 )
# define GCOV_TAG_COUNTER_LENGTH(NUM) ((NUM) * 2 )
# define GCOV_TAG_COUNTER_NUM(LENGTH) ((LENGTH) / 2 )
# define GCOV_TAG_OBJECT_SUMMARY ((gcov_unsigned_t) 0xa1000000 )
# define GCOV_TAG_PROGRAM_SUMMARY ((gcov_unsigned_t) 0xa3000000 )
# define GCOV_TAG_SUMMARY_LENGTH ( 1 + GCOV_COUNTERS_SUMMABLE * ( 2 + 3 * 2 ))
then, 0x00000002 is its length; and 0x00000003 is the function identifier. Next, 0xeb65a768 is the checksum.
注 1 :只有是 FUNCTION tag 时,才会有后续的 length, function identifier 和 checksum 。
FUNCTION 数据结构如下。
/* Information about a single function. This uses the trailing array
idiom. The number of counters is determined from the counter_mask
in gcov_info. We hold an array of function info, so have to
explicitly calculate the correct array stride. */
struct gcov_fn_info
{
gcov_unsigned_t ident; /* unique ident of function */
gcov_unsigned_t checksum; /* function checksum */
unsigned n_ctrs[0]; /* instrumented counters */
};
(5) COUNTER tag
0x01a10000 is a COUNTER tag, defined as above. then, 0x0000000a is its length . 因此, counter number 由宏 GCOV_TAG_COUNTER_NUM 计算得来,为 5 。
接下来就是 5 个 counters ,每个 counter 为 2 个 words ,每个 word 为 4Byte ,即每个 counter 为 64bits 的整数,共 40Bytes 。 由 gcov_read_counter () 函数完成读取,从对该函数的调用可以看出,每次均读取 2 个 words(8Bytes) 。
counter 定义如下。
/* Type of function used to merge counters. */
typedef void (*gcov_merge_fn) (gcov_type *, gcov_unsigned_t);
/* Information about counters. */
struct gcov_ctr_info
{
gcov_unsigned_t num; /* number of counters. */
gcov_type *values; /* their values. */
gcov_merge_fn merge; /* The function used to merge them. */
};
由该结构也可看出, num 后面即是他们的 vlaues ,类型是 gcov_type 的指针,然后是 merge function 指针。
(6) OBJECT SUMMARY tag
0xa1000000 is GCOV_TAG_OBJECT_SUMMARY , and the following 0x00000009 is the length. Next, 9 words following it.
Object 结构如下。
/* Cumulative counter data. */
struct gcov_ctr_summary
{
gcov_unsigned_t num; /* number of counters. */
gcov_unsigned_t runs; /* number of program runs */
gcov_type sum_all; /* sum of all counters accumulated. */
gcov_type run_max; /* maximum value on a single run. */
gcov_type sum_max; /* sum of individual run max values. */
};
/* Object & program summary record. */
struct gcov_summary
{
gcov_unsigned_t checksum; /* checksum of program */
struct gcov_ctr_summary ctrs[GCOV_COUNTERS_SUMMABLE];
};
(7) PROGRAM SUMMARY tag
0xa3000000 is GCOV_TAG_PROGRAM_SUMMARY , and the following 0x00000009 is the length, then, 0x51924f98 is the checksum. 其结构定义与 Object 相同,如上所示。
then, 3 counters, that is, sum of all counters accumulated, maximum value on a single run, and sum of individual run max values, 即 sum_all, run_max, sum_max 。
每个 program summary 是 32Bytes 。
(8) file end
最后一个 unsigned 数是 0x00000000 ,读到后即退出循环,并关闭文件。 由 tag 的定义解释可以看出该设计,如 (4) 。
至此,文件分析完毕。以上所有定义基本上都在 gcov_io.h 文件中。
注 2 :本文的分析,只看到了值为 0x 01 000000, 0x 01 a10000, 0x a1 000000 tag 信息,另外的 0x 0143 0000 , 0x 014 5 0000 tag 信息可在 .gcno 文件中看到,本文不再叙述。附录的解释很清楚,如下。
Level values [1..3f] are used for common tags, values [41..9f] for the notes file and [a1..ff] for the data file.
注 3 :附录中的解释是从 gcov_io.h 文件中摘录而来,该官方文档对文件格式的解释非常详细,供参考。
(9) .gcda/.gcno 文件格式小结
其文件格式如下图所示。
其中,
magic 是注释文 件和数据文件的区别标记;
v erson 记录 GCC 的版本信息;
s tamp 是时间戳,主要用于区别编译 / 运行 / 再编译的阶段周 期;
record 记录信息,由 header 和 data 两 部分组成 ,不能嵌套使用,通过 header 中的 tag 组织成层次结构。 tag 在文件中是唯一的。 header 部分的 length 表示 data 项的数量。
summary 给出了整个目标文件和程序的相关信息。有 Object summary 和 program summary 两种,其数据结构相同。
record 记录中的数据主要放在 data 部分。在 data 项中, unit 用来区别同一 record 记录下不同的数据项。
function_data 和 summary 项则真正记录了笔者关心的剖视信息。
function_ data 包含 2 个部分: announce_function 和 arc_counts ;
announce_function 各个域如下。
tag(32 位 )
length(32 位 )
ident(32 位 )
checksum(32 位 )
标记 arc_counts
数据项数
函数的唯一标识
校验码
arc_counts 中包含的 tag 域给出标明了该 arc_counts 项记 录的信息类型,目前 GCC 所能支持的值信息类型主要有 7 种,可在 gcov_io.h 文件中看到。
3. 文件读取函数及其调用过程
3.1 读取 / 写入相关调用
上述读取该文件的函数均在 gcov_io.c 文件中实现。该过程的函数调用顺序如下。
main
->toplev_main
->do_compile
->compile_file
->coverage_init
->read_counts_file // 读取 gcda 文件便在该函数中完成
read_counts_file() 函数将调用 gcov_read_words (), gcov_read_ unsigned(), gcov_read_ counter(), gcov_read_ string(), gcov_read_ summary() 完成读取。
相反地, gcov_ write _words (), gcov_ write _ unsigned(), gcov_ write _ counter() 等完成写入。且大部分写入操作均在 gcov_exit() 中完成。
gcov_exit() 将调用 3 个文件操作 gcov_open, gcov_close, gcov_write_block ,当然也会调用中间层次的函数如 gcov_write_tag_length , gcov_write_counter, gcov_write_summary, gcov_write_unsigned, gcov_write_words 等。
3.2 程序退出点
程序是在 atexit() 中调用 gcov_exit() 退出的,在 gcov_exit() ,将调用写入操作,如上分析。 atexit 是 glibc 的函数。
那么注册 gcov_exit() 函数是谁完成的呢?
__gcov_init () 函数调用 atexit() 完成 gcov_exit() 的注册。因此,当程序退出时将在 atexit() 中调用 gcov_exit() 完成文件的写入。 __gcov_init 函数在 libgcov.c 文件中。
Reference
man date
info date
man od
date 源代码
gcov_iov.c
gcov_io.h
gcov_io.c
Libgcov.c
Coverage.c
Coverage.h
Appendix: gcov 文件格式定义
// 此段文字描述 coverage information 的文件,有 .gcno(note 文件 ) 和 .gcda(data 文件 ) 。
Coverage information is held in two files. A notes file , which is generated by the compiler, and a data file , which is generated by the program under test. Both files use a similar structure. We do not attempt to make these files backwards compatible with previous versions, as you only need coverage information when developing a program. We do hold version information, so that mismatches can be detected, and we use a format that allows tools to skip information they do not understand or are not interested in.
Numbers are recorded in the 32 bit unsigned binary form of the endianness of the machine generating the file. 64 bit numbers are stored as two 32 bit numbers, the low part first. Strings are padded with 1 to 4 NUL bytes, to bring the length up to a multiple of 4. The number of 4 bytes is stored, followed by the padded string. Zero length and NULL strings are simply stored as a length of zero (they have no trailing NUL or padding).
int32: byte3 byte2 byte1 byte0 | byte0 byte1 byte2 byte3 //32Bits 的数据构成
int64: int32:low int32:high //64Bits 的数据构成
string: int32:0 | int32:length char* char:0 padding
padding: | char:0 | char:0 char:0 | char:0 char:0 char:0
item: int32 | int64 | string //item 由 1 个 32Bits , 1 个 64Bits 和 1 个 string 构成
// 文件格式如下
The basic format of the files is
// 此处印证本文的分析,文件前面 12 字节即为 magic,version,stamp( 各 4Bytes)
file : int32:magic int32:version int32:stamp record*
// 此段文字描述文件头各个字段的作用。
The magic ident is different for the notes and the data files. The magic ident is used to determine the endianness of the file, when reading. The version is the same for both files and is derived from gcc's version number. The stamp value is used to synchronize note and data files and to synchronize merging within a data file. It need not be an absolute time stamp , merely a ticker that increments fast enough and cycles slow enough to distinguish different compile/run/compile cycles.
// 此段文字表述文件头各个字段怎么来的,尤其详细介绍了 version 的构成。
Although the ident and version are formally 32 bit numbers, they are derived from 4 character ASCII strings . The version number consists of the single character major version number , a two character minor version number (leading zero for versions less than 10), and a single character indicating the status of the release.
That will be 'e' experimental, 'p' prerelease and 'r' for release . Because, by good fortune, these are in alphabetical order, string collating can be used to compare version strings. Be aware that the 'e' designation will (naturally) be unstable and might be incompatible with itself. For gcc 3.4 experimental, it would be '304e' (0x33303465). When the major version reaches 10, the letters A-Z will be used. Assuming minor increments releases every 6 months, we have to make a major increment every 50 years. Assuming major increments releases every 5 years, we're ok for the next 155 years -- good enough for me.
A record has a tag, length and variable amount of data.
record: header data //record 由 header 和 data 组成
header: int32:tag int32:length //header 由一个 32Bits 的 tag 和一个 32bits 的 length 组成
data: item* // 随后就是一些数据
// 此段文字描述 tag 的组成规则,由 4 个 level 的数字组成,每个 level 是 1 个字节,反映的是 record 层次。
Records are not nested, but there is a record hierarchy. Tag numbers reflect this hierarchy. Tags are unique across note and data files. Some record types have a varying amount of data. The LENGTH is the number of 4bytes that follow and is usually used to determine how much data. The tag value is split into 4 8-bit fields, one for each of four possible levels. The most significant is allocated first. Unused levels are zero. Active levels are odd-valued, so that the LSB of the level is one. A sub-level incorporates the values of its superlevels. This formatting allows you to determine the tag hierarchy, without understanding the tags themselves, and is similar to the standard section numbering used in technical documents. Level values [1..3f] are used for common tags, values [41..9f] for the notes file and [a1..ff] for the data file.
The basic block graph file contains the following records // 注意缩进,缩进代表构成层次
note: unit function-graph* //note 文件由 unit 和 function-graph 数据组成 ( 解释方法下同 )
unit: header int32:checksum string:source //unit 由 header 和 32Bits 的 checksum 和 source 字符串组成
string:name string:source int32:lineno
function-graph: announce_function basic_blocks {arcs | lines}* //* 表示 0 个或多个
announce_function: header int32:ident int32:checksum
basic_block: header int32:flags* // 基本块由 header 和 0 个或多个 32bits 的 flag 构成
arcs: header int32:block_no arc* //arcs 即为跳转表,由 header,32bits 的块号和 0 个或多个 arc 构成
arc: int32:dest_block int32:flags // 跳转由 32bits 的目标块和 32bits 的 flag 构成
lines: header int32:block_no line*
int32:0 string:NULL
line: int32:line_no | int32:0 string:filename
// 此段文字描述基本块 (basic block) 的组成
The BASIC_BLOCK record holds per-bb flags. The number of blocks can be inferred from its data length . There is one ARCS record per basic block. The number of arcs from a bb is implicit from the data length. It enumerates the destination bb and per-arc flags. There is one LINES record per basic block, it enumerates the source lines which belong to that basic block. Source file names are introduced by a line number of 0, following lines are from the new source file. The initial source file for the function is NULL, but the current source file should be remembered from one LINES record to the next. The end of a block is indicated by an empty filename this does not reset the current source file. Note there is no ordering of the ARCS and LINES records: they may be in any order, interleaved in any manner. The current filename follows the order the LINES records are stored in the file, *not* the ordering of the blocks they are for.
//data 文件的构成
The data file contains the following records.
data: {unit function-data* summary:object summary:program*}* // 定义方式同上
unit: header int32:checksum
function-data: announce_function arc_counts //function-data 构成
announce_function: header int32:ident int32:checksum
arc_counts: header int64:count*
summary: int32:checksum {count-summary}GCOV_COUNTERS
count-summary: int32:num int32:runs int64:sum int64:max int64:sum_max //32Bytes
// 此处的 count-summary 描述正对应 gcov_ctr_summary 结构,如下。
/* Cumulative counter data. */
struct gcov_ctr_summary
{
gcov_unsigned_t num; /* number of counters. */
gcov_unsigned_t runs; /* number of program runs */
gcov_type sum_all; /* sum of all counters accumulated. */
gcov_type run_max; /* maximum value on a single run. */
gcov_type sum_max; /* sum of individual run max values. */
};
// 此段文字描述每个字段的作用
The ANNOUNCE_FUNCTION record is the same as that in the note file, but without the source location. The ARC_COUNTS gives the counter values for those arcs that are instrumented. The SUMMARY records give information about the whole object file and about the whole program. The checksum is used for whole program summaries, and disambiguates different programs which include the same instrumented object file. There may be several program summaries, each with a unique checksum. The object summary's checksum is zero. Note that the data file might contain information from several runs concatenated, or the data might be merged.
// 此段文字描述该文件会被 gcc 源代码、 gcov 工具和运行库包含,且通过宏 IN_LIBGCOV 和 IN_GCOV 来区分。
This file is included by both the compiler, gcov tools and the runtime support library libgcov. IN_LIBGCOV and IN_GCOV are used to distinguish which case is which. If IN_LIBGCOV is nonzero, libgcov is being built. If IN_GCOV is nonzero, the gcov tools are being built. Otherwise the compiler is being built. IN_GCOV may be positive or negative. If positive, we are compiling a tool that requires additional functions (see the code for knowledge of what those functions are).
// 宏总结如下
Build libgcov : IN_LIBGCOV=1
Build gcov tool: IN_GCOV=1 ( 为正值时需要额外的函数 )
Build gcc : otherwise
注:了解了文件格式后,再写读取和写入的程序就容易多了,因为采用的是二进制方式读取 / 写入,用的最多的操作就是 fread 和 fwrite 。