Studying note of GCC-3.4.6 source (1)

About 4 years ago, I joined GDNT – the coventure of Nortel in China and worked in project of Radio access network of 3G UMTS; where I used GCC first time. At that time Nortel widely used GCC as the official compiler for its giant project of UMTS, CDMA, self developed tools etc (later GCC is also used in the 4G projects of Wimax and LTE). Every week I needed build several project loads to verify my code of bug fix; in every loadbuild the files got compiled amounted to thousands, and the final object file was larger than 100 megabytes. GCC worked with surprising stability. In these years I only encounter two collapses of GCC, one in which dued to I try to make separate compilation of template which GCC still doesn’t support yet (as far as I can tell, till now only the front-end of EDG can do that. And before the C++ standard being stableISO-IEC - 14882-1998, the EDG front-end is the pratical standard. At http://www.edg.com/ can find something about EDG), and fed GCC with weird codes. And cause generating wrong immediate tree code which triggers assertion within GCC (GCC can’t find out the semantics error). Though GCC exitted immediately, it gave detailed diagnostical dump – an elegant give up.

This fantastic tool attracts me deeply! Though I had learnt principle of compilation before, but facing it I just found I knew little about it. Thanks to its open source, I can peep into its mystical body (at least to me). By these years digging into its source code, I have known something about this compiler but its veil is still not fully opened. I am just glad to share all of you with my studying note of GCC source (the GCC concerned is v3.4.6, C++ front-end, host: x86/Linux, target: x86/Linux); the note is far from complete and still growing.

Reference

[1] Programming language pragmatics, 2nd edition

[2] gccint, version 3.4.6

[3] ISO-IEC-14882-2003

[4] The C Preprocessor April 2001, for GCC V3

[5] cppinternals

[6] Using the GNU Compiler Collection

[7] Inside The C++ Object Model, Stanley B.Lippman

[8] GCC Complete Reference

[9] The design and evolution of C++, by Bjarne Stroustrup

[10] Linkers & Loaders, by John R. Levine

[11] Efficient Instruction Scheduling Using Finite State Automata, by Vasanth Bala, Norman Rubin

[12] Compilers: Principles, Techniques, and Tools, 2nd edition

 [2] and [5] can be found at YOUR-GCC-SOURCE-DIR/gcc/doc. [1], [7], [9], [10] and [12] give some useful background knowledge.

Preparation before going ahead

Some non-trivial GCC’s codes are generated by GCC’s tools. Before really dipping into the source code; we need first compile the GCC project to make them generated. I just skip the steps to download the source(http://gcc.gnu.org/mirrors.html is the official site for download), configure and compile it. There are plent of links in the net tell about it (Note: if already installed GCC, using g++ -### can check the configure command).

Software architecture of GCC

The GCC as a whole can be separated into two parts: the front-end, and the back-end. Preprocessor (if there is), lexer (for syntax analysis), parser (for semantic analysis) are implemented in front-end, the purpose of front-end is to transform the source language into intermediate form which is source language independent. In theory, to introduce new language, what needs to do is to realize preprocessor, lexer and parser, but in fact, we still need to wrtie codes to setup the needed environment.

In GCC of version 3.4.6, the common intermediate language is RTL (register transfer language). RTL is a simple langauge, which can be easily degraded into assemble language; so it is inappropriate to reduce the source code into it. In fact, the front-end will reduce the code first into an intermediate tree and do lot of manipulation before turning it into RTL form which will be fed to the back-end.

The aim of back-end is to generate assemble code. As a widely used compiler, GCC can be built to support varies host platforms. To achieve this goal, GCC uses files called machine description which describes not only the instruction set, but also the pipeline’s characteristices, and chance for optimization due to the architecture. For specified target, these files will be handled by several tools to generate related files which will be used in compiling the GCC. To introduce new machine, the major work is to offer the new correct machine description file (usually, it also needs define some helper functions to implement the necessary processing logic).

And At here http://www.ibm.com/developerworks/linux/library/l-gcc4/index.html?S_TACT=105AGX52&S_CMP=content, find information about the newest version of GCC.

The front end

When we call “gcc –o xxx xxx.c”, we just call a shell, this shell will parse the command line, do some preparation upon the host platform (pay attention to the difference between host machine and target machine) following to the options it recognizes, and then calls the appropriate compiler according to the postfix of the file and passes those unrecognized options to the compiler. We don’t see this shell here, what we focus is the real compiler.

1.        Overview

The front-end will read in and parse the program written in specified language and transform it into a language independent form. In theory, every front-end can generate distinctive form; however, in GCC to reuse code as possible, all intermediate trees built by different front-ends just use the same set of tree node. Of course, the nodes in tree must be diverse enough to cover the languages. This form of tree is so important in handling C/C++ program; it is worth focus and effort first.

1.1. Tree representation in front-ends

To accommodate the already existing and to be added front-ends, there are tens kinds of nodes (not all can be leaf). All nodes begin with a common part as below. It occupies the beginning of all the structures.

 

129  struct tree_common GTY(())                                                                              in tree.h

130  {

131    tree chain;

132    tree type;

133 

134    ENUM_BITFIELD(tree_code) code : 8;

135 

136    unsigned side_effects_flag : 1;

137    unsigned constant_flag : 1;

138    unsigned addressable_flag : 1;

139    unsigned volatile_flag : 1;

140    unsigned readonly_flag : 1;

141    unsigned unsigned_flag : 1;

142    unsigned asm_written_flag: 1;

143    unsigned unused_0 : 1;

144 

145    unsigned used_flag : 1;

146    unsigned nothrow_flag : 1;

147    unsigned static_flag : 1;

148    unsigned public_flag : 1;

149    unsigned private_flag : 1;

150    unsigned protected_flag : 1;

151    unsigned deprecated_flag : 1;

152    unsigned unused_1 : 1;

153 

154    unsigned lang_flag_0 : 1;

155    unsigned lang_flag_1 : 1;

156    unsigned lang_flag_2 : 1;

157    unsigned lang_flag_3 : 1;

158    unsigned lang_flag_4 : 1;

159    unsigned lang_flag_5 : 1;

160    unsigned lang_flag_6 : 1;

161    unsigned unused_2 : 1;

162  };

 

Above at line 134, ENUM_BITFIELD for Version 3.46, will be expanded into __extension__ enum, and chain at line 131 will link the node into the tree if needed.

The meaning of some flags defined in the structure, and macros defined to access them are shown in below (words in red).

Ø         TREE_TYPE ((NODE)->common.type)

In all nodes that are expressions, this is the data type of the expression.

²        In POINTER_TYPE nodes, this is the type that the pointer points to.

²        In ARRAY_TYPE nodes, this is the type of the elements.

²        In VECTOR_TYPE nodes, this is the type of the elements.

Ø         TREE_ADDRESSABLE ((NODE)->common.addressable_flag)

²        In VAR_DECL nodes, nonzero means address of this is needed. So it cannot be in a register.

²        In a FUNCTION_DECL, nonzero means its address is needed. So it must be compiled even if it is an inline function.

²        In a FIELD_DECL node, it means that the programmer is permitted to construct the address of this field. This is used for aliasing purposes: see record_component_aliases

²        In CONSTRUCTOR nodes, it means object constructed must be in memory.

²        In LABEL_DECL nodes, it means a goto for this label has been seen from a place outside all binding contours that restore stack levels.

²        In *_TYPE nodes, it means that objects of this type must be fully addressable. This means that pieces of this object cannot go into register parameters, for example.

²        In IDENTIFIER_NODEs, this means that some extern decl for this name had its address taken. That matters for inline functions.

Ø         TREE_STATIC ((NODE)->common.static_flag)

²        In a VAR_DECL, nonzero means allocate static storage.

²        In a FUNCTION_DECL, nonzero if function has been defined.

²        In a CONSTRUCTOR, nonzero means allocate static storage.

Ø         TREE_VIA_VIRTUAL ((NODE)->common.static_flag)

²        Nonzero for a TREE_LIST or TREE_VEC node means that the derivation chain is via a `virtual' declaration.

Ø         TREE_CONSTANT_OVERFLOW ((NODE)->common.static_flag)

²        In an INTEGER_CST, REAL_CST, COMPLEX_CST, or VECTOR_CST this means there was an overflow in folding. This is distinct from TREE_OVERFLOW because ANSI C requires a diagnostic when overflows occur in constant expressions.

Ø         TREE_SYMBOL_REFERENCED

(IDENTIFIER_NODE_CHECK (NODE)->common.static_flag)

²        In an IDENTIFIER_NODE, this means that assemble_name was called with this string as an argument.

Ø         CLEANUP_EH_ONLY ((NODE)->common.static_flag)

²        In a TARGET_EXPR, WITH_CLEANUP_EXPR, CLEANUP_STMT, or element of a block's cleanup list, means that the pertinent cleanup should only be executed if an exception is thrown, not on normal exit of its scope.

Ø         TREE_OVERFLOW ((NODE)->common.public_flag)

²        In an INTEGER_CST, REAL_CST, COMPLEX_CST, or VECTOR_CST, this means there was an overflow in folding, and no warning has been issued for this subexpression. TREE_OVERFLOW implies TREE_CONSTANT_OVERFLOW, but not vice versa.

Ø         TREE_PUBLIC ((NODE)->common.public_flag)

²        In a VAR_DECL or FUNCTION_DECL, nonzero means name is to be accessible from outside this module. In an IDENTIFIER_NODE, nonzero means an external declaration accessible from outside this module was previously seen for this name in an inner scope.

Ø         TREE_PRIVATE ((NODE)->common.private_flag)

²        Used in classes in C++.

Ø         CALL_EXPR_HAS_RETURN_SLOT_ADDR ((NODE)->common.private_flag)

²        In a CALL_EXPR, means that the address of the return slot is part of the argument list.

Ø         TREE_PROTECTED ((NODE)->common.protected_flag)

²        Used in classes in C++. In a BLOCK node, this is BLOCK_HANDLER_BLOCK.

Ø         CALL_FROM_THUNK_P ((NODE)->common.protected_flag)

²        In a CALL_EXPR, means that the call is the jump from a thunk to the thunked-to function.

Ø         TREE_SIDE_EFFECTS ((NODE)->common.side_effects_flag)

²        In any expression, nonzero means it has side effects or reevaluation of the whole expression could produce a different value. This is set if any subexpression is a function call, a side effect or a reference to a volatile variable.

²        In a *_DECL, this is set only if the declaration said `volatile'.

Ø         TREE_THIS_VOLATILE ((NODE)->common.volatile_flag)

²        Nonzero means this expression is volatile in the C sense: its address should be of type `volatile WHATEVER *'. In other words, the declared item is volatile qualified. This is used in *_DECL nodes and *_REF nodes.

²        In a *_TYPE node, means this type is volatile-qualified. But use TYPE_VOLATILE instead of this macro when the node is a type, because eventually we may make that a different bit. If this bit is set in an expression, so is TREE_SIDE_EFFECTS.

Ø         TYPE_VOLATILE (TYPE_CHECK (NODE)->common.volatile_flag)

²        Nonzero in a type considered volatile as a whole.

Ø         TREE_READONLY ((NODE)->common.readonly_flag)

²        In a VAR_DECL, PARM_DECL or FIELD_DECL, or any kind of *_REF node, nonzero means it may not be the lhs of an assignment.

²        In a *_TYPE node, means this type is const-qualified (but the macro TYPE_READONLY should be used instead of this macro when the node is a type).

Ø         TYPE_READONLY (TYPE_CHECK (NODE)->common.readonly_flag)

²        Means this type is const-qualified.

Ø         TREE_CONSTANT ((NODE)->common.constant_flag)

²        Value of expression is constant. Always appears in all *_CST nodes. May also appear in an arithmetic expression, an ADDR_EXPR or a CONSTRUCTOR if the value is constant.

Ø         TREE_UNSIGNED ((NODE)->common.unsigned_flag)

²        In INTEGER_TYPE or ENUMERAL_TYPE nodes, means an unsigned type. In FIELD_DECL nodes, means an unsigned bit field.

Ø         TREE_ASM_WRITTEN ((NODE)->common.asm_written_flag)

²        Nonzero in a VAR_DECL means assembler code has been written.

²        Nonzero in a FUNCTION_DECL means that the function has been compiled. This is interesting in an inline function, since it might not need to be compiled separately.

²        Nonzero in a RECORD_TYPE, UNION_TYPE, QUAL_UNION_TYPE or ENUMERAL_TYPE if the sdb debugging info for the type has been written.

²        In a BLOCK node, nonzero if reorder_blocks has already seen this block.

Ø         TREE_USED ((NODE)->common.used_flag)

²        Nonzero in a *_DECL if the name is used in its scope.

²        Nonzero in an expr node means inhibit warning if value is unused.

²        In IDENTIFIER_NODEs, this means that some extern decl for this name was used.

Ø         TREE_NOTHROW ((NODE)->common.nothrow_flag)

²        In a FUNCTION_DECL, nonzero means a call to the function cannot throw an exception. In a CALL_EXPR, nonzero means the call cannot throw.

Ø         TYPE_ALIGN_OK (TYPE_CHECK (NODE)->common.nothrow_flag)

²        In a type, nonzero means that all objects of the type are guaranteed by the language or front-end to be properly aligned, so we can indicate that a MEM of this type is aligned at least to the alignment of the type, even if it doesn't appear that it is. We see this, for example, in object-oriented languages where a tag field may show this is an object of a more-aligned variant of the more generic type.

Ø         TREE_DEPRECATED ((NODE)->common.deprecated_flag)

²        Nonzero in an IDENTIFIER_NODE if the use of the name is defined as a deprecated feature by __attribute__((deprecated)).

List 1 flags in tree_common

1.1.1. Tree node definition

The definition of tree node is given below.

 

45    typedef union tree_node *tree;                                                              in coretypes.h

 

1772 union tree_node GTY ((ptr_alias (union lang_tree_node),                              in tree.h

1773                   desc ("tree_node_structure (&%h)")))

1774 {

1775   struct tree_common GTY ((tag ("TS_COMMON"))) common;

1776   struct tree_int_cst GTY ((tag ("TS_INT_CST"))) int_cst;

1777   struct tree_real_cst GTY ((tag ("TS_REAL_CST"))) real_cst;

1778   struct tree_vector GTY ((tag ("TS_VECTOR"))) vector;

1779   struct tree_string GTY ((tag ("TS_STRING"))) string;

1780   struct tree_complex GTY ((tag ("TS_COMPLEX"))) complex;

1781   struct tree_identifier GTY ((tag ("TS_IDENTIFIER"))) identifier;

1782   struct tree_decl GTY ((tag ("TS_DECL"))) decl;

1783   struct tree_type GTY ((tag ("TS_TYPE"))) type;

1784   struct tree_list GTY ((tag ("TS_LIST"))) list;

1785   struct tree_vec GTY ((tag ("TS_VEC"))) vec;

1786   struct tree_exp GTY ((tag ("TS_EXP"))) exp;

1787   struct tree_block GTY ((tag ("TS_BLOCK"))) block;

1788 };

 

No doubt, the node is defined as union. Notice ptr_alias at line 1772, it tells GTY (garbage collection service in GCC, we ignore it here for the moment) that pointer to tree_node in fact points to lang_tree_node which is front-end specified and builds up from tree_node with extra fields for the language characteristics (so it can be pointed by pointer of tree_node). In C++ front-end, lang_tree_node has following definition.

 

472  union lang_tree_node GTY((desc ("cp_tree_node_structure (&%h)"),             in cp-tree.h

473         chain_next ("(union lang_tree_node *)TREE_CHAIN (&%h.generic)")))

474  {

475    union tree_node GTY ((tag ("TS_CP_GENERIC"),

476                      desc ("tree_node_structure (&%h)"))) generic;

477    struct template_parm_index_s GTY ((tag ("TS_CP_TPI"))) tpi;

478    struct ptrmem_cst GTY ((tag ("TS_CP_PTRMEM"))) ptrmem;

479    struct tree_overload GTY ((tag ("TS_CP_OVERLOAD"))) overload;

480    struct tree_baselink GTY ((tag ("TS_CP_BASELINK"))) baselink;

481    struct tree_wrapper GTY ((tag ("TS_CP_WRAPPER"))) wrapper;

482    struct tree_default_arg GTY ((tag ("TS_CP_DEFAULT_ARG"))) default_arg;

483    struct lang_identifier GTY ((tag ("TS_CP_IDENTIFIER"))) identifier;

484  };

你可能感兴趣的:(struct,tree,gcc,compiler,deprecated,preprocessor)