FreeBSD开发手册(一)

The FreeBSD Documentation Project

FreeBSD 中文计划

  欢迎您阅读《FreeBSD开发手册》。 这本手册还在不断由许多人继续书写。 许多章节还是空白,有的章节亟待更新。如果您对这个项目感兴趣并愿意有所贡献,请发信给 FreeBSD 文档计划邮件列表

   本文档的最新英文原始版本可从 FreeBSD Web 站点获得, 最新中文译本可从FreeBSD 中文计划 Web 站点获得。也可以各种格式和压缩形式从 FreeBSD FTP 服务器 或众多的 镜像站点 得到。

重要: 本文中许可证的非官方中文翻译仅供参考,不作为判定任何责任的依据。如与英文原文有出入,则以英文原文为准。

在满足下列许可条件的前提下, 允许再分发或以源代码 (SGML DocBook) 或 “编译” (SGML, HTML, PDF, PostScript, RTF 等) 的经过修改或未修改的形式:

  1. 再分发源代码 (SGML DocBook) 必须不加修改的保留上述版权告示、本条件清单和下述弃权书作为该文件的最先若干行。

  2. 再分发编译的形式 (转换为其它DTD、 PDF、 PostScript、 RTF 或其它形式),必须将上述版权告示、本条件清单和下述弃权书复制到与分发品一同提供的文件,以及其它材料中。

重要: 本文档由 FREEBSD DOCUMENTATION PROJECT “按现状条件” 提供,并在此明示不提供任何明示或暗示的保障, 包括但不限于对商业适销性、对特定目的的适用性的暗示保障。 任何情况下, FREEBSD DOCUMENTATION PROJECT 均不对任何直接、 间接、 偶然、 特殊、 惩罚性的, 或必然的损失 (包括但不限于替代商品或服务的采购、 使用、 数据或利益的损失或营业中断) 负责,无论是如何导致的并以任何有责任逻辑的, 无论是否是在本文档使用以外以任何方式产生的契约、严格责任或是民事侵权行为(包括疏忽或其它)中的, 即使已被告知发生该损失的可能性。

Redistribution and use in source (SGML DocBook) and 'compiled' forms (SGML, HTML, PDF, PostScript, RTF and so forth) with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code (SGML DocBook) must retain the above copyright notice, this list of conditions and the following disclaimer as the first lines of this file unmodified.

  2. Redistributions in compiled form (transformed to other DTDs, converted to PDF, PostScript, RTF and other formats) must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

重要: THIS DOCUMENTATION IS PROVIDED BY THE FREEBSD DOCUMENTATION PROJECT "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FREEBSD DOCUMENTATION PROJECT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENTATION, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

FreeBSD 是 FreeBSD基金会的注册商标

Apple, FireWire, Mac, Macintosh, Mac OS, Quicktime, 以及 TrueType 是 Apple Computer, Inc. 在美国以及其他国家的注册商标。

IBM, AIX, EtherJet, Netfinity, OS/2, PowerPC, PS/2, S/390, 和 ThinkPad 是国际商用机器公司在美国和其他国家的注册商标或商标。

IEEE, POSIX, 和 802 是 Institute of Electrical and Electronics Engineers, Inc. 在美国的注册商标。

Intel, Celeron, EtherExpress, i386, i486, Itanium, Pentium, 和 Xeon 是 Intel Corporation 及其分支机构在美国和其他国家的商标或注册商标。

Linux 是 Linus Torvalds 的注册商标。

Microsoft, IntelliMouse, MS-DOS, Outlook, Windows, Windows Media, 和 Windows NT 是 Microsoft Corporation 在美国和/或其他国家的商标或注册商标。

Motif, OSF/1, 和 UNIX 是 The Open Group 在美国和其他国家的注册商标; IT DialTone 和 The Open Group 是其商标。

Sun, Sun Microsystems, Java, Java Virtual Machine, JavaServer Pages, JDK, JSP, JVM, Netra, Solaris, StarOffice, Sun Blade, Sun Enterprise, Sun Fire, SunOS, 和 Ultra 是 Sun Microsystems, Inc. 在美国和其他国家的商标或注册商标。

许多制造商和经销商使用一些称为商标的图案或文字设计来彰显自己的产品。 本文档中出现的,为 FreeBSD Project 所知晓的商标,后面将以 '™' 或 '®' 符号来标注。


目录
第I部分. 基础
第1章  介绍
1.1 在FreeBSD平台上编程
1.2 BSD 观点
1.3 基本指导方针
1.4 /usr/src的层次结构
第2章  Programming Tools
2.1 Synopsis
2.2 Introduction
2.3 Introduction to Programming
2.4 Compiling with cc
2.5 Make
2.6 Debugging
2.7 Using Emacs as a Development Environment
2.8 Further Reading
第3章  安全的编程
3.1 提要
3.2 安全的设计方法
3.3 缓冲区溢出
3.4 SetUID 问题
3.5 限制你的程序环境
3.6 信任
3.7 竞态条件
第4章  Localization and Internationalization - L10N and I18N
4.1 Programming I18N Compliant Applications
第5章  Source Tree Guidelines and Policies
5.1 MAINTAINER on Makefiles
5.2 Contributed Software
5.3 Encumbered Files
5.4 Shared Libraries
第6章  Regression and Performance Testing
6.1. Micro Benchmark Checklist
第II部分. 进程间通信
第7章  套接字
7.1 概述
7.2 联网和多样性
7.3 协议
7.4 套接字模型
7.5 Essential Socket Functions
7.6 Helper Functions
7.7 Concurrent Servers
第8章  IPv6 Internals
8.1 IPv6/IPsec Implementation
第III部分. 内核
第9章  DMA
9.1 DMA: What it is and How it Works
第10章  调试内核
10.1 如何将内核的崩溃转存数据保存成文件
10.2 Debugging a Kernel Crash Dump with kgdb
10.3 Debugging a Crash Dump with DDD
10.4 Post-Mortem Analysis of a Dump
10.5 On-Line Kernel Debugging Using DDB
10.6 On-Line Kernel Debugging Using Remote GDB
10.7 Debugging Loadable Modules Using GDB
10.8 Debugging a Console Driver
第IV部分. 系统结构
第11章  x86 汇编语言
11.1 概述
11.2 工具
11.3 系统调用
11.4 返回值
11.5 Creating Portable Code
11.6 Our First Program
11.7 Writing UNIX® Filters
11.8 Buffered Input and Output
11.9 Command Line Arguments
11.10 UNIX Environment
11.11 Working with Files
11.12 One-Pointed Mind
11.13 Using the FPU
11.14 Caveats
11.15 Acknowledgements
第V部分. 附录
参考书目
索引
范例清单
例2-1. A sample .emacs file

第I部分. 基础

目录
第1章  介绍
第2章  Programming Tools
第3章  安全的编程
第4章  Localization and Internationalization - L10N and I18N
第5章  Source Tree Guidelines and Policies
第6章  Regression and Performance Testing

第1章  介绍

供稿:Murray Stokely 和 Jeroen Ruigrok van der Werven. 翻译:Shi Jerome.

1.1 在FreeBSD平台上编程

  阅读本章之前,我们应该已安装好操作系统, 准备开始编程了. FreeBSD为程序员做了哪些工作呢?提供了哪些工具呢?

  本章将回答大家一些问题. 当然各人编程的熟练程度不同, 对某些人而言写程序只是出于爱好,但这却是另一些人的职业. 本章内容主要针对初学者, 但对第一次接触FreeBSD平台的职业程序员也非常有用.


1.2 BSD 观点

  要写出最好的在类 UNIX® 操作系统上运行程序, 就必须尊重原软件工具的可用性, 可观性, 稳定性, 和原创者的思想.


1.3 基本指导方针

  下面几条原则阐述了我们的观点

  • 除非使用者在缺少这一功能的情况下不能完成所需工作, 否则不应添加任何新功能.

  • 和决定系统应具备哪些功能一样重要的是: 决定哪些功能系统不应有. 系统不可能满足所有人的需求. 当然系统应具有良好的扩展性以跟上用户需求不断发展的潮流.

  • 比仅有一个案例可供借鉴更糟的情况是根本无例可循.

  • 在未充分理解问题之前,最好不要忙于解决.

  • 可以用10%的代码完成90%的工作, 请用它.

  • 越简单越好.

  • 提供机制而不是策略, 尤其重要的是应由客户决定用户界面的策略.

  摘自 Scheifler & Gettys: "X Window System"


1.4 /usr/src的层次结构

  FreeBSD的所有原代码均可从公共的CVS库里获取. 通常原代码已安装在 /usr/src 这一目录含下列子目录:

  

目录 说明
bin/ /bin中文件的源代码
contrib/ 第三方软件文件的源代码
crypto/ 加密源代码
etc/ /etc中文件的源代码
games/ /usr/games中文件的源代码
gnu/ 《GNU公共许可证》覆盖的工具
include/ /usr/include中文件的源代码
kerberos5/ Kerberos 版本 5 源代码
lib/ /usr/lib中文件的源代码
libexec/ /usr/libexec中文件的源代码
release/ 产生一个 FreeBSD 发行版所需文件
rescue/ 建造系统时 /rescue中的工具
sbin/ /sbin中文件的源代码
secure/ FreeSec 源代码
share/ /usr/share中文件的源代码
sys/ 内核源代码文件
tools/ 用于维护和测试FreeBSD的工具
usr.bin/ /usr/bin中文件的源代码
usr.sbin/ /usr/sbin中文件的源代码



第2章  Programming Tools

Contributed by James Raynard 和 Murray Stokely.

2.1 Synopsis

  This chapter is an introduction to using some of the programming tools supplied with FreeBSD, although much of it will be applicable to many other versions of UNIX. It does not attempt to describe coding in any detail. Most of the chapter assumes little or no previous programming knowledge, although it is hoped that most programmers will find something of value in it.


2.2 Introduction

  FreeBSD offers an excellent development environment. Compilers for C, C++, and Fortran and an assembler come with the basic system, not to mention a Perl interpreter and classic UNIX tools such as sed and awk. If that is not enough, there are many more compilers and interpreters in the Ports collection. FreeBSD is very compatible with standards such as POSIX® and ANSI C, as well with its own BSD heritage, so it is possible to write applications that will compile and run with little or no modification on a wide range of platforms.

  However, all this power can be rather overwhelming at first if you have never written programs on a UNIX platform before. This document aims to help you get up and running, without getting too deeply into more advanced topics. The intention is that this document should give you enough of the basics to be able to make some sense of the documentation.

  Most of the document requires little or no knowledge of programming, although it does assume a basic competence with using UNIX and a willingness to learn!


2.3 Introduction to Programming

  A program is a set of instructions that tell the computer to do various things; sometimes the instruction it has to perform depends on what happened when it performed a previous instruction. This section gives an overview of the two main ways in which you can give these instructions, or “commands” as they are usually called. One way uses an interpreter, the other a compiler. As human languages are too difficult for a computer to understand in an unambiguous way, commands are usually written in one or other languages specially designed for the purpose.


2.3.1 Interpreters

  With an interpreter, the language comes as an environment, where you type in commands at a prompt and the environment executes them for you. For more complicated programs, you can type the commands into a file and get the interpreter to load the file and execute the commands in it. If anything goes wrong, many interpreters will drop you into a debugger to help you track down the problem.

  The advantage of this is that you can see the results of your commands immediately, and mistakes can be corrected readily. The biggest disadvantage comes when you want to share your programs with someone. They must have the same interpreter, or you must have some way of giving it to them, and they need to understand how to use it. Also users may not appreciate being thrown into a debugger if they press the wrong key! From a performance point of view, interpreters can use up a lot of memory, and generally do not generate code as efficiently as compilers.

  In my opinion, interpreted languages are the best way to start if you have not done any programming before. This kind of environment is typically found with languages like Lisp, Smalltalk, Perl and Basic. It could also be argued that the UNIX shell (sh, csh) is itself an interpreter, and many people do in fact write shell “scripts” to help with various “housekeeping” tasks on their machine. Indeed, part of the original UNIX philosophy was to provide lots of small utility programs that could be linked together in shell scripts to perform useful tasks.


2.3.2 Interpreters available with FreeBSD

  Here is a list of interpreters that are available from the FreeBSD Ports Collection, with a brief discussion of some of the more popular interpreted languages.

  Instructions on how to get and install applications from the Ports Collection can be found in the Ports section of the handbook.

BASIC

Short for Beginner's All-purpose Symbolic Instruction Code. Developed in the 1950s for teaching University students to program and provided with every self-respecting personal computer in the 1980s, BASIC has been the first programming language for many programmers. It is also the foundation for Visual Basic.

The Bywater Basic Interpreter can be found in the Ports Collection as lang/bwbasic and the Phil Cockroft's Basic Interpreter (formerly Rabbit Basic) is available as lang/pbasic.

Lisp

A language that was developed in the late 1950s as an alternative to the “number-crunching” languages that were popular at the time. Instead of being based on numbers, Lisp is based on lists; in fact the name is short for “List Processing”. Very popular in AI (Artificial Intelligence) circles.

Lisp is an extremely powerful and sophisticated language, but can be rather large and unwieldy.

Various implementations of Lisp that can run on UNIX systems are available in the Ports Collection for FreeBSD. GNU Common Lisp can be found as lang/gcl. CLISP by Bruno Haible and Michael Stoll is available as lang/clisp. For CMUCL, which includes a highly-optimizing compiler too, or simpler Lisp implementations like SLisp, which implements most of the Common Lisp constructs in a few hundred lines of C code, lang/cmucl and lang/slisp are available respectively.

Perl

Very popular with system administrators for writing scripts; also often used on World Wide Web servers for writing CGI scripts.

Perl is available in the Ports Collection as lang/perl5 for all FreeBSD releases, and is installed as /usr/bin/perl in the base system 4.X releases.

Scheme

A dialect of Lisp that is rather more compact and cleaner than Common Lisp. Popular in Universities as it is simple enough to teach to undergraduates as a first language, while it has a high enough level of abstraction to be used in research work.

Scheme is available from the Ports Collection as lang/elk for the Elk Scheme Interpreter. The MIT Scheme Interpreter can be found in lang/mit-scheme and the SCM Scheme Interpreter in lang/scm.

Icon

Icon is a high-level language with extensive facilities for processing strings and structures. The version of Icon for FreeBSD can be found in the Ports Collection as lang/icon.

Logo

Logo is a language that is easy to learn, and has been used as an introductory programming language in various courses. It is an excellent tool to work with when teaching programming in small ages, as it makes the creation of elaborate geometric shapes an easy task even for very small children.

The lastest version of Logo for FreeBSD is available from the Ports Collection in lang/logo.

Python

Python is an Object-Oriented, interpreted language. Its advocates argue that it is one of the best languages to start programming with, since it is relatively easy to start with, but is not limited in comparison to other popular interpreted languages that are used for the development of large, complex applications (Perl and Tcl are two other languages that are popular for such tasks).

The latest version of Python is available from the Ports Collection in lang/python.

Ruby

Ruby is an interpreter, pure object-oriented programming language. It has become widely popular because of its easy to understand syntax, flexibility when writing code, and the ability to easily develop and maintain large, complex programs.

Ruby is available from the Ports Collection as lang/ruby18.

Tcl and Tk

Tcl is an embeddable, interpreted language, that has become widely used and became popular mostly because of its portability to many platforms. It can be used both for quickly writing small, prototype applications, or (when combined with Tk, a GUI toolkit) fully-fledged, featureful programs.

Various versions of Tcl are available as ports for FreeBSD. The latest version, Tcl 8.4, can be found in lang/tcl84.




2.3.3 Compilers

  Compilers are rather different. First of all, you write your code in a file (or files) using an editor. You then run the compiler and see if it accepts your program. If it did not compile, grit your teeth and go back to the editor; if it did compile and gave you a program, you can run it either at a shell command prompt or in a debugger to see if it works properly. [1]

  Obviously, this is not quite as direct as using an interpreter. However it allows you to do a lot of things which are very difficult or even impossible with an interpreter, such as writing code which interacts closely with the operating system──or even writing your own operating system! It is also useful if you need to write very efficient code, as the compiler can take its time and optimize the code, which would not be acceptable in an interpreter. Moreover, distributing a program written for a compiler is usually more straightforward than one written for an interpreter──you can just give them a copy of the executable, assuming they have the same operating system as you.

  Compiled languages include Pascal, C and C++. C and C++ are rather unforgiving languages, and best suited to more experienced programmers; Pascal, on the other hand, was designed as an educational language, and is quite a good language to start with. FreeBSD does not include Pascal support in the base system, but both GNU Pascal Compiler (GPC) and the Free Pascal Compiler are available in the ports collection as lang/gpc and lang/fpc.

  As the edit-compile-run-debug cycle is rather tedious when using separate programs, many commercial compiler makers have produced Integrated Development Environments (IDEs for short). FreeBSD does not include an IDE in the base system, but devel/kdevelop is available in the ports tree and many use Emacs for this purpose. Using Emacs as an IDE is discussed in 第 2.7 节.


2.4 Compiling with cc

  This section deals only with the GNU compiler for C and C++, since that comes with the base FreeBSD system. It can be invoked by either cc or gcc. The details of producing a program with an interpreter vary considerably between interpreters, and are usually well covered in the documentation and on-line help for the interpreter.

  Once you have written your masterpiece, the next step is to convert it into something that will (hopefully!) run on FreeBSD. This usually involves several steps, each of which is done by a separate program.

  1. Pre-process your source code to remove comments and do other tricks like expanding macros in C.

  2. Check the syntax of your code to see if you have obeyed the rules of the language. If you have not, it will complain!

  3. Convert the source code into assembly language──this is very close to machine code, but still understandable by humans. Allegedly. [2]

  4. Convert the assembly language into machine code──yep, we are talking bits and bytes, ones and zeros here.

  5. Check that you have used things like functions and global variables in a consistent way. For example, if you have called a non-existent function, it will complain.

  6. If you are trying to produce an executable from several source code files, work out how to fit them all together.

  7. Work out how to produce something that the system's run-time loader will be able to load into memory and run.

  8. Finally, write the executable on the filesystem.

  The word compiling is often used to refer to just steps 1 to 4──the others are referred to as linking. Sometimes step 1 is referred to as pre-processing and steps 3-4 as assembling.

  Fortunately, almost all this detail is hidden from you, as cc is a front end that manages calling all these programs with the right arguments for you; simply typing

% cc foobar.c

  will cause foobar.c to be compiled by all the steps above. If you have more than one file to compile, just do something like

% cc foo.c bar.c

  Note that the syntax checking is just that──checking the syntax. It will not check for any logical mistakes you may have made, like putting the program into an infinite loop, or using a bubble sort when you meant to use a binary sort. [3]

  There are lots and lots of options for cc, which are all in the manual page. Here are a few of the most important ones, with examples of how to use them.

-o filename

The output name of the file. If you do not use this option, cc will produce an executable called a.out. [4]

% cc foobar.c               executable is a.out
% cc -o foobar foobar.c     executable is foobar
       
-c

Just compile the file, do not link it. Useful for toy programs where you just want to check the syntax, or if you are using a Makefile.

% cc -c foobar.c
       

This will produce an object file (not an executable) called foobar.o. This can be linked together with other object files into an executable.

-g

Create a debug version of the executable. This makes the compiler put information into the executable about which line of which source file corresponds to which function call. A debugger can use this information to show the source code as you step through the program, which is very useful; the disadvantage is that all this extra information makes the program much bigger. Normally, you compile with -g while you are developing a program and then compile a “release version” without -g when you are satisfied it works properly.

% cc -g foobar.c
       

This will produce a debug version of the program. [5]

-O

Create an optimized version of the executable. The compiler performs various clever tricks to try to produce an executable that runs faster than normal. You can add a number after the -O to specify a higher level of optimization, but this often exposes bugs in the compiler's optimizer. For instance, the version of cc that comes with the 2.1.0 release of FreeBSD is known to produce bad code with the -O2 option in some circumstances.

Optimization is usually only turned on when compiling a release version.

% cc -O -o foobar foobar.c
       

This will produce an optimized version of foobar.

  The following three flags will force cc to check that your code complies to the relevant international standard, often referred to as the ANSI standard, though strictly speaking it is an ISO standard.

-Wall

Enable all the warnings which the authors of cc believe are worthwhile. Despite the name, it will not enable all the warnings cc is capable of.

-ansi

Turn off most, but not all, of the non-ANSIC features provided by cc. Despite the name, it does not guarantee strictly that your code will comply to the standard.

-pedantic

Turn off all cc's non-ANSIC features.

  Without these flags, cc will allow you to use some of its non-standard extensions to the standard. Some of these are very useful, but will not work with other compilers──in fact, one of the main aims of the standard is to allow people to write code that will work with any compiler on any system. This is known as portable code.

  Generally, you should try to make your code as portable as possible, as otherwise you may have to completely rewrite the program later to get it to work somewhere else──and who knows what you may be using in a few years time?

% cc -Wall -ansi -pedantic -o foobar foobar.c

  This will produce an executable foobar after checking foobar.c for standard compliance.

-llibrary

Specify a function library to be used at link time.

The most common example of this is when compiling a program that uses some of the mathematical functions in C. Unlike most other platforms, these are in a separate library from the standard C one and you have to tell the compiler to add it.

The rule is that if the library is called libsomething.a, you give cc the argument -lsomething. For example, the math library is libm.a, so you give cc the argument -lm. A common “gotcha” with the math library is that it has to be the last library on the command line.

% cc -o foobar foobar.c -lm
       

This will link the math library functions into foobar.

If you are compiling C++ code, you need to add -lg++, or -lstdc++ if you are using FreeBSD 2.2 or later, to the command line argument to link the C++ library functions. Alternatively, you can run c++ instead of cc, which does this for you. c++ can also be invoked as g++ on FreeBSD.

% cc -o foobar foobar.cc -lg++     For FreeBSD 2.1.6 and earlier
% cc -o foobar foobar.cc -lstdc++  For FreeBSD 2.2 and later
% c++ -o foobar foobar.cc
       

Each of these will both produce an executable foobar from the C++ source file foobar.cc. Note that, on UNIX systems, C++ source files traditionally end in .C, .cxx or .cc, rather than the MS-DOS® style .cpp (which was already used for something else). gcc used to rely on this to work out what kind of compiler to use on the source file; however, this restriction no longer applies, so you may now call your C++ files .cpp with impunity!


2.4.1 Common cc Queries and Problems

2.4.1.1. I am trying to write a program which uses the sin() function and I get an error like this. What does it mean?
2.4.1.2. All right, I wrote this simple program to practice using -lm. All it does is raise 2.1 to the power of 6.
2.4.1.3. So how do I fix this?
2.4.1.4. I compiled a file called foobar.c and I cannot find an executable called foobar. Where has it gone?
2.4.1.5. OK, I have an executable called foobar, I can see it when I run ls, but when I type in foobar at the command prompt it tells me there is no such file. Why can it not find it?
2.4.1.6. I called my executable test, but nothing happens when I run it. What is going on?
2.4.1.7. I compiled my program and it seemed to run all right at first, then there was an error and it said something about “core dumped”. What does that mean?
2.4.1.8. Fascinating stuff, but what I am supposed to do now?
2.4.1.9. When my program dumped core, it said something about a “segmentation fault”. What is that?
2.4.1.10. Sometimes when I get a core dump it says “bus error”. It says in my UNIX book that this means a hardware problem, but the computer still seems to be working. Is this true?
2.4.1.11. This dumping core business sounds as though it could be quite useful, if I can make it happen when I want to. Can I do this, or do I have to wait until there is an error?

2.4.1.1. I am trying to write a program which uses the sin() function and I get an error like this. What does it mean?

/var/tmp/cc0143941.o: Undefined symbol `_sin' referenced from text segment
         

When using mathematical functions like sin(), you have to tell cc to link in the math library, like so:

% cc -o foobar foobar.c -lm
         

2.4.1.2. All right, I wrote this simple program to practice using -lm. All it does is raise 2.1 to the power of 6.

#include 

int main() {
    float f;

    f = pow(2.1, 6);
    printf("2.1 ^ 6 = %f/n", f);
    return 0;
}
         

and I compiled it as:

% cc temp.c -lm
         

like you said I should, but I get this when I run it:

% ./a.out
2.1 ^ 6 = 1023.000000
         

This is not the right answer! What is going on?

When the compiler sees you call a function, it checks if it has already seen a prototype for it. If it has not, it assumes the function returns an int, which is definitely not what you want here.

2.4.1.3. So how do I fix this?

The prototypes for the mathematical functions are in math.h. If you include this file, the compiler will be able to find the prototype and it will stop doing strange things to your calculation!

#include 
#include 

int main() {
...
         

After recompiling it as you did before, run it:

% ./a.out
2.1 ^ 6 = 85.766121
         

If you are using any of the mathematical functions, always include math.h and remember to link in the math library.

2.4.1.4. I compiled a file called foobar.c and I cannot find an executable called foobar. Where has it gone?

Remember, cc will call the executable a.out unless you tell it differently. Use the -ofilename option:

% cc -o foobar foobar.c
         

2.4.1.5. OK, I have an executable called foobar, I can see it when I run ls, but when I type in foobar at the command prompt it tells me there is no such file. Why can it not find it?

Unlike MS-DOS, UNIX does not look in the current directory when it is trying to find out which executable you want it to run, unless you tell it to. Either type ./foobar, which means “run the file called foobar in the current directory”, or change your PATH environment variable so that it looks something like

bin:/usr/bin:/usr/local/bin:.
         

The dot at the end means “look in the current directory if it is not in any of the others”.

2.4.1.6. I called my executable test, but nothing happens when I run it. What is going on?

Most UNIX systems have a program called test in /usr/bin and the shell is picking that one up before it gets to checking the current directory. Either type:

% ./test
         

or choose a better name for your program!

2.4.1.7. I compiled my program and it seemed to run all right at first, then there was an error and it said something about “core dumped”. What does that mean?

The name core dump dates back to the very early days of UNIX, when the machines used core memory for storing data. Basically, if the program failed under certain conditions, the system would write the contents of core memory to disk in a file called core, which the programmer could then pore over to find out what went wrong.

2.4.1.8. Fascinating stuff, but what I am supposed to do now?

Use gdb to analyze the core (see 第 2.6 节).

2.4.1.9. When my program dumped core, it said something about a “segmentation fault”. What is that?

This basically means that your program tried to perform some sort of illegal operation on memory; UNIX is designed to protect the operating system and other programs from rogue programs.

Common causes for this are:

  • Trying to write to a NULL pointer, eg

    char *foo = NULL;
    strcpy(foo, "bang!");
           
    
  • Using a pointer that has not been initialized, eg

    char *foo;
    strcpy(foo, "bang!");
           
    

    The pointer will have some random value that, with luck, will point into an area of memory that is not available to your program and the kernel will kill your program before it can do any damage. If you are unlucky, it will point somewhere inside your own program and corrupt one of your data structures, causing the program to fail mysteriously.

  • Trying to access past the end of an array, eg

    int bar[20];
    bar[27] = 6;
           
    
  • Trying to store something in read-only memory, eg

    char *foo = "My string";
    strcpy(foo, "bang!");
           
    

    UNIX compilers often put string literals like "My string" into read-only areas of memory.

  • Doing naughty things with malloc() and free(), eg

    char bar[80];
    free(bar);
           
    

    or

    char *foo = malloc(27);
    free(foo);
    free(foo);
           
    

Making one of these mistakes will not always lead to an error, but they are always bad practice. Some systems and compilers are more tolerant than others, which is why programs that ran well on one system can crash when you try them on an another.

2.4.1.10. Sometimes when I get a core dump it says “bus error”. It says in my UNIX book that this means a hardware problem, but the computer still seems to be working. Is this true?

No, fortunately not (unless of course you really do have a hardware problem...). This is usually another way of saying that you accessed memory in a way you should not have.

2.4.1.11. This dumping core business sounds as though it could be quite useful, if I can make it happen when I want to. Can I do this, or do I have to wait until there is an error?

Yes, just go to another console or xterm, do

% ps
       

to find out the process ID of your program, and do

% kill -ABRT pid
       

where pid is the process ID you looked up.

This is useful if your program has got stuck in an infinite loop, for instance. If your program happens to trap SIGABRT, there are several other signals which have a similar effect.

Alternatively, you can create a core dump from inside your program, by calling the abort() function. See the manual page of abort(3) to learn more.

If you want to create a core dump from outside your program, but do not want the process to terminate, you can use the gcore program. See the manual page of gcore(1) for more information.


2.5 Make

2.5.1 What is make?

  When you are working on a simple program with only one or two source files, typing in

% cc file1.c file2.c

  is not too bad, but it quickly becomes very tedious when there are several files──and it can take a while to compile, too.

  One way to get around this is to use object files and only recompile the source file if the source code has changed. So we could have something like:

% cc file1.o file2.o ... file37.c ...

  if we had changed file37.c, but not any of the others, since the last time we compiled. This may speed up the compilation quite a bit, but does not solve the typing problem.

  Or we could write a shell script to solve the typing problem, but it would have to re-compile everything, making it very inefficient on a large project.

  What happens if we have hundreds of source files lying about? What if we are working in a team with other people who forget to tell us when they have changed one of their source files that we use?

  Perhaps we could put the two solutions together and write something like a shell script that would contain some kind of magic rule saying when a source file needs compiling. Now all we need now is a program that can understand these rules, as it is a bit too complicated for the shell.

  This program is called make. It reads in a file, called a makefile, that tells it how different files depend on each other, and works out which files need to be re-compiled and which ones do not. For example, a rule could say something like “if fromboz.o is older than fromboz.c, that means someone must have changed fromboz.c, so it needs to be re-compiled.” The makefile also has rules telling make how to re-compile the source file, making it a much more powerful tool.

  Makefiles are typically kept in the same directory as the source they apply to, and can be called makefile, Makefile or MAKEFILE. Most programmers use the name Makefile, as this puts it near the top of a directory listing, where it can easily be seen. [6]

2.5.2 Example of using make

  Here is a very simple make file:

foo: foo.c
    cc -o foo foo.c

  It consists of two lines, a dependency line and a creation line.

  The dependency line here consists of the name of the program (known as the target), followed by a colon, then whitespace, then the name of the source file. When make reads this line, it looks to see if foo exists; if it exists, it compares the time foo was last modified to the time foo.c was last modified. If foo does not exist, or is older than foo.c, it then looks at the creation line to find out what to do. In other words, this is the rule for working out when foo.c needs to be re-compiled.

  The creation line starts with a tab (press the tab key) and then the command you would type to create foo if you were doing it at a command prompt. If foo is out of date, or does not exist, make then executes this command to create it. In other words, this is the rule which tells make how to re-compile foo.c.

  So, when you type make, it will make sure that foo is up to date with respect to your latest changes to foo.c. This principle can be extended to Makefiles with hundreds of targets──in fact, on FreeBSD, it is possible to compile the entire operating system just by typing make world in the appropriate directory!

  Another useful property of makefiles is that the targets do not have to be programs. For instance, we could have a make file that looks like this:

foo: foo.c
    cc -o foo foo.c

install:
    cp foo /home/me

  We can tell make which target we want to make by typing:

% make target

  make will then only look at that target and ignore any others. For example, if we type make foo with the makefile above, make will ignore the install target.

  If we just type make on its own, make will always look at the first target and then stop without looking at any others. So if we typed make here, it will just go to the foo target, re-compile foo if necessary, and then stop without going on to the install target.

  Notice that the install target does not actually depend on anything! This means that the command on the following line is always executed when we try to make that target by typing make install. In this case, it will copy foo into the user's home directory. This is often used by application makefiles, so that the application can be installed in the correct directory when it has been correctly compiled.

  This is a slightly confusing subject to try to explain. If you do not quite understand how make works, the best thing to do is to write a simple program like “hello world” and a make file like the one above and experiment. Then progress to using more than one source file, or having the source file include a header file. The touch command is very useful here──it changes the date on a file without you having to edit it.


2.5.3 Make and include-files

  C code often starts with a list of files to include, for example stdio.h. Some of these files are system-include files, some of them are from the project you are now working on:

#include 
#include "foo.h"

int main(....

  To make sure that this file is recompiled the moment foo.h is changed, you have to add it in your Makefile:

foo: foo.c foo.h

  The moment your project is getting bigger and you have more and more own include-files to maintain, it will be a pain to keep track of all include files and the files which are depending on it. If you change an include-file but forget to recompile all the files which are depending on it, the results will be devastating. gcc has an option to analyze your files and to produce a list of include-files and their dependencies: -MM.

  If you add this to your Makefile:

depend:
    gcc -E -MM *.c > .depend

  and run make depend, the file .depend will appear with a list of object-files, C-files and the include-files:

foo.o: foo.c foo.h

  If you change foo.h, next time you run make all files depending on foo.h will be recompiled.

  Do not forget to run make depend each time you add an include-file to one of your files.


2.5.4 FreeBSD Makefiles

  Makefiles can be rather complicated to write. Fortunately, BSD-based systems like FreeBSD come with some very powerful ones as part of the system. One very good example of this is the FreeBSD ports system. Here is the essential part of a typical ports Makefile:

MASTER_SITES=   ftp://freefall.cdrom.com/pub/FreeBSD/LOCAL_PORTS/
DISTFILES=      scheme-microcode+dist-7.3-freebsd.tgz

.include 

  Now, if we go to the directory for this port and type make, the following happens:

  1. A check is made to see if the source code for this port is already on the system.

  2. If it is not, an FTP connection to the URL in MASTER_SITES is set up to download the source.

  3. The checksum for the source is calculated and compared it with one for a known, good, copy of the source. This is to make sure that the source was not corrupted while in transit.

  4. Any changes required to make the source work on FreeBSD are applied──this is known as patching.

  5. Any special configuration needed for the source is done. (Many UNIX program distributions try to work out which version of UNIX they are being compiled on and which optional UNIX features are present──this is where they are given the information in the FreeBSD ports scenario).

  6. The source code for the program is compiled. In effect, we change to the directory where the source was unpacked and do make──the program's own make file has the necessary information to build the program.

  7. We now have a compiled version of the program. If we wish, we can test it now; when we feel confident about the program, we can type make install. This will cause the program and any supporting files it needs to be copied into the correct location; an entry is also made into a package database, so that the port can easily be uninstalled later if we change our mind about it.

  Now I think you will agree that is rather impressive for a four line script!

  The secret lies in the last line, which tells make to look in the system makefile called bsd.port.mk. It is easy to overlook this line, but this is where all the clever stuff comes from──someone has written a makefile that tells make to do all the things above (plus a couple of other things I did not mention, including handling any errors that may occur) and anyone can get access to that just by putting a single line in their own make file!

  If you want to have a look at these system makefiles, they are in /usr/share/mk, but it is probably best to wait until you have had a bit of practice with makefiles, as they are very complicated (and if you do look at them, make sure you have a flask of strong coffee handy!)


2.5.5 More advanced uses of make

  Make is a very powerful tool, and can do much more than the simple example above shows. Unfortunately, there are several different versions of make, and they all differ considerably. The best way to learn what they can do is probably to read the documentation──hopefully this introduction will have given you a base from which you can do this.

  The version of make that comes with FreeBSD is the Berkeley make; there is a tutorial for it in /usr/share/doc/psd/12.make. To view it, do

% zmore paper.ascii.gz

  in that directory.

  Many applications in the ports use GNU make, which has a very good set of “info” pages. If you have installed any of these ports, GNU make will automatically have been installed as gmake. It is also available as a port and package in its own right.

  To view the info pages for GNU make, you will have to edit the dir file in the /usr/local/info directory to add an entry for it. This involves adding a line like

 * Make: (make).                 The GNU Make utility.

  to the file. Once you have done this, you can type info and then select make from the menu (or in Emacs, do C-h i).


2.6 Debugging

2.6.1 The Debugger

  The debugger that comes with FreeBSD is called gdb (GNU debugger). You start it up by typing

% gdb progname

  although most people prefer to run it inside Emacs. You can do this by:

M-x gdb RET progname RET

  Using a debugger allows you to run the program under more controlled circumstances. Typically, you can step through the program a line at a time, inspect the value of variables, change them, tell the debugger to run up to a certain point and then stop, and so on. You can even attach to a program that is already running, or load a core file to investigate why the program crashed. It is even possible to debug the kernel, though that is a little trickier than the user applications we will be discussing in this section.

  gdb has quite good on-line help, as well as a set of info pages, so this section will concentrate on a few of the basic commands.

  Finally, if you find its text-based command-prompt style off-putting, there is a graphical front-end for it (xxgdb) in the ports collection.

  This section is intended to be an introduction to using gdb and does not cover specialized topics such as debugging the kernel.


2.6.2 Running a program in the debugger

  You will need to have compiled the program with the -g option to get the most out of using gdb. It will work without, but you will only see the name of the function you are in, instead of the source code. If you see a line like:

... (no debugging symbols found) ...

  when gdb starts up, you will know that the program was not compiled with the -g option.

  At the gdb prompt, type break main. This will tell the debugger to skip over the preliminary set-up code in the program and start at the beginning of your code. Now type run to start the program──it will start at the beginning of the set-up code and then get stopped by the debugger when it calls main(). (If you have ever wondered where main() gets called from, now you know!).

  You can now step through the program, a line at a time, by pressing n. If you get to a function call, you can step into it by pressing s. Once you are in a function call, you can return from stepping into a function call by pressing f. You can also use up and down to take a quick look at the caller.

  Here is a simple example of how to spot a mistake in a program with gdb. This is our program (with a deliberate mistake):

#include 

int bazz(int anint);

main() {
    int i;

    printf("This is my program/n");
    bazz(i);
    return 0;
}

int bazz(int anint) {
    printf("You gave me %d/n", anint);
    return anint;
}

  This program sets i to be 5 and passes it to a function bazz() which prints out the number we gave it.

  When we compile and run the program we get

% cc -g -o temp temp.c
% ./temp
This is my program
anint = 4231

  That was not what we expected! Time to see what is going on!

% gdb temp
GDB is free software and you are welcome to distribute copies of it
 under certain conditions; type "show copying" to see the conditions.
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.13 (i386-unknown-freebsd), Copyright 1994 Free Software Foundation, Inc.
(gdb) break main               Skip the set-up code
Breakpoint 1 at 0x160f: file temp.c, line 9.    gdb puts breakpoint at main()
(gdb) run                   Run as far as main()
Starting program: /home/james/tmp/temp      Program starts running

Breakpoint 1, main () at temp.c:9       gdb stops at main()
(gdb) n                       Go to next line
This is my program              Program prints out
(gdb) s                       step into bazz()
bazz (anint=4231) at temp.c:17          gdb displays stack frame
(gdb)

  Hang on a minute! How did anint get to be 4231? Did we not we set it to be 5 in main()? Let's move up to main() and have a look.

(gdb) up                   Move up call stack
#1  0x1625 in main () at temp.c:11      gdb displays stack frame
(gdb) p i                   Show us the value of i
$1 = 4231                   gdb displays 4231

  Oh dear! Looking at the code, we forgot to initialize i. We meant to put

...
main() {
    int i;

    i = 5;
    printf("This is my program/n");
...

  but we left the i=5; line out. As we did not initialize i, it had whatever number happened to be in that area of memory when the program ran, which in this case happened to be 4231.

注意: gdb displays the stack frame every time we go into or out of a function, even if we are using up and down to move around the call stack. This shows the name of the function and the values of its arguments, which helps us keep track of where we are and what is going on. (The stack is a storage area where the program stores information about the arguments passed to functions and where to go when it returns from a function call).


2.6.3 Examining a core file

  A core file is basically a file which contains the complete state of the process when it crashed. In “the good old days”, programmers had to print out hex listings of core files and sweat over machine code manuals, but now life is a bit easier. Incidentally, under FreeBSD and other 4.4BSD systems, a core file is called progname.core instead of just core, to make it clearer which program a core file belongs to.

  To examine a core file, start up gdb in the usual way. Instead of typing break or run, type

(gdb) core progname.core

  If you are not in the same directory as the core file, you will have to do dir /path/to/core/file first.

  You should see something like this:

% gdb a.out
GDB is free software and you are welcome to distribute copies of it
 under certain conditions; type "show copying" to see the conditions.
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.13 (i386-unknown-freebsd), Copyright 1994 Free Software Foundation, Inc.
(gdb) core a.out.core
Core was generated by `a.out'.
Program terminated with signal 11, Segmentation fault.
Cannot access memory at address 0x7020796d.
#0  0x164a in bazz (anint=0x5) at temp.c:17
(gdb)

  In this case, the program was called a.out, so the core file is called a.out.core. We can see that the program crashed due to trying to access an area in memory that was not available to it in a function called bazz.

  Sometimes it is useful to be able to see how a function was called, as the problem could have occurred a long way up the call stack in a complex program. The bt command causes gdb to print out a back-trace of the call stack:

(gdb) bt
#0  0x164a in bazz (anint=0x5) at temp.c:17
#1  0xefbfd888 in end ()
#2  0x162c in main () at temp.c:11
(gdb)

  The end() function is called when a program crashes; in this case, the bazz() function was called from main().


2.6.4 Attaching to a running program

  One of the neatest features about gdb is that it can attach to a program that is already running. Of course, that assumes you have sufficient permissions to do so. A common problem is when you are stepping through a program that forks, and you want to trace the child, but the debugger will only let you trace the parent.

  What you do is start up another gdb, use ps to find the process ID for the child, and do

(gdb) attach pid

  in gdb, and then debug as usual.

  “That is all very well,” you are probably thinking, “but by the time I have done that, the child process will be over the hill and far away”. Fear not, gentle reader, here is how to do it (courtesy of the gdb info pages):

...
if ((pid = fork()) < 0)     /* _Always_ check this */
    error();
else if (pid == 0) {        /* child */
    int PauseMode = 1;

    while (PauseMode)
        sleep(10);  /* Wait until someone attaches to us */
    ...
} else {            /* parent */
    ...

  Now all you have to do is attach to the child, set PauseMode to 0, and wait for the sleep() call to return!


2.7 Using Emacs as a Development Environment

2.7.1 Emacs

  Unfortunately, UNIX systems do not come with the kind of everything-you-ever-wanted-and-lots-more-you-did-not-in-one-gigantic-package integrated development environments that other systems have. [7] However, it is possible to set up your own environment. It may not be as pretty, and it may not be quite as integrated, but you can set it up the way you want it. And it is free. And you have the source to it.

  The key to it all is Emacs. Now there are some people who loathe it, but many who love it. If you are one of the former, I am afraid this section will hold little of interest to you. Also, you will need a fair amount of memory to run it──I would recommend 8MB in text mode and 16MB in X as the bare minimum to get reasonable performance.

  Emacs is basically a highly customizable editor──indeed, it has been customized to the point where it is more like an operating system than an editor! Many developers and sysadmins do in fact spend practically all their time working inside Emacs, leaving it only to log out.

  It is impossible even to summarize everything Emacs can do here, but here are some of the features of interest to developers:

  • Very powerful editor, allowing search-and-replace on both strings and regular expressions (patterns), jumping to start/end of block expression, etc, etc.

  • Pull-down menus and online help.

  • Language-dependent syntax highlighting and indentation.

  • Completely customizable.

  • You can compile and debug programs within Emacs.

  • On a compilation error, you can jump to the offending line of source code.

  • Friendly-ish front-end to the info program used for reading GNU hypertext documentation, including the documentation on Emacs itself.

  • Friendly front-end to gdb, allowing you to look at the source code as you step through your program.

  • You can read Usenet news and mail while your program is compiling.

  And doubtless many more that I have overlooked.

  Emacs can be installed on FreeBSD using the Emacs port.

  Once it is installed, start it up and do C-h t to read an Emacs tutorial──that means hold down the control key, press h, let go of the control key, and then press t. (Alternatively, you can you use the mouse to select Emacs Tutorial from the Help menu).

  Although Emacs does have menus, it is well worth learning the key bindings, as it is much quicker when you are editing something to press a couple of keys than to try to find the mouse and then click on the right place. And, when you are talking to seasoned Emacs users, you will find they often casually throw around expressions like “M-x replace-s RET foo RET bar RET” so it is useful to know what they mean. And in any case, Emacs has far too many useful functions for them to all fit on the menu bars.

  Fortunately, it is quite easy to pick up the key-bindings, as they are displayed next to the menu item. My advice is to use the menu item for, say, opening a file until you understand how it works and feel confident with it, then try doing C-x C-f. When you are happy with that, move on to another menu command.

  If you can not remember what a particular combination of keys does, select Describe Key from the Help menu and type it in──Emacs will tell you what it does. You can also use the Command Apropos menu item to find out all the commands which contain a particular word in them, with the key binding next to it.

  By the way, the expression above means hold down the Meta key, press x, release the Meta key, type replace-s (short for replace-string──another feature of Emacs is that you can abbreviate commands), press the return key, type foo (the string you want replaced), press the return key, type bar (the string you want to replace foo with) and press return again. Emacs will then do the search-and-replace operation you have just requested.

  If you are wondering what on earth the Meta key is, it is a special key that many UNIX workstations have. Unfortunately, PC's do not have one, so it is usually the alt key (or if you are unlucky, the escape key).

  Oh, and to get out of Emacs, do C-x C-c (that means hold down the control key, press x, press c and release the control key). If you have any unsaved files open, Emacs will ask you if you want to save them. (Ignore the bit in the documentation where it says C-z is the usual way to leave Emacs──that leaves Emacs hanging around in the background, and is only really useful if you are on a system which does not have virtual terminals).


2.7.2 Configuring Emacs

  Emacs does many wonderful things; some of them are built in, some of them need to be configured.

  Instead of using a proprietary macro language for configuration, Emacs uses a version of Lisp specially adapted for editors, known as Emacs Lisp. Working with Emacs Lisp can be quite helpful if you want to go on and learn something like Common Lisp. Emacs Lisp has many features of Common Lisp, although it is considerably smaller (and thus easier to master).

  The best way to learn Emacs Lisp is to download the Emacs Tutorial

  However, there is no need to actually know any Lisp to get started with configuring Emacs, as I have included a sample .emacs file, which should be enough to get you started. Just copy it into your home directory and restart Emacs if it is already running; it will read the commands from the file and (hopefully) give you a useful basic setup.


2.7.3 A sample .emacs file

  Unfortunately, there is far too much here to explain it in detail; however there are one or two points worth mentioning.

  • Everything beginning with a ; is a comment and is ignored by Emacs.

  • In the first line, the -*-Emacs-Lisp-*- is so that we can edit the .emacs file itself within Emacs and get all the fancy features for editing Emacs Lisp. Emacs usually tries to guess this based on the filename, and may not get it right for .emacs.

  • The tab key is bound to an indentation function in some modes, so when you press the tab key, it will indent the current line of code. If you want to put a tab character in whatever you are writing, hold the control key down while you are pressing the tab key.

  • This file supports syntax highlighting for C, C++, Perl, Lisp and Scheme, by guessing the language from the filename.

  • Emacs already has a pre-defined function called next-error. In a compilation output window, this allows you to move from one compilation error to the next by doing M-n; we define a complementary function, previous-error, that allows you to go to a previous error by doing M-p. The nicest feature of all is that C-c C-c will open up the source file in which the error occurred and jump to the appropriate line.

  • We enable Emacs's ability to act as a server, so that if you are doing something outside Emacs and you want to edit a file, you can just type in

    % emacsclient filename
         
    

    and then you can edit the file in your Emacs! [8]

例 2-1. A sample .emacs file

;; -*-Emacs-Lisp-*-

;; This file is designed to be re-evaled; use the variable first-time
;; to avoid any problems with this.
(defvar first-time t
  "Flag signifying this is the first time that .emacs has been evaled")

;; Meta
(global-set-key "/M- " 'set-mark-command)
(global-set-key "/M-/C-h" 'backward-kill-word)
(global-set-key "/M-/C-r" 'query-replace)
(global-set-key "/M-r" 'replace-string)
(global-set-key "/M-g" 'goto-line)
(global-set-key "/M-h" 'help-command)

;; Function keys
(global-set-key [f1] 'manual-entry)
(global-set-key [f2] 'info)
(global-set-key [f3] 'repeat-complex-command)
(global-set-key [f4] 'advertised-undo)
(global-set-key [f5] 'eval-current-buffer)
(global-set-key [f6] 'buffer-menu)
(global-set-key [f7] 'other-window)
(global-set-key [f8] 'find-file)
(global-set-key [f9] 'save-buffer)
(global-set-key [f10] 'next-error)
(global-set-key [f11] 'compile)
(global-set-key [f12] 'grep)
(global-set-key [C-f1] 'compile)
(global-set-key [C-f2] 'grep)
(global-set-key [C-f3] 'next-error)
(global-set-key [C-f4] 'previous-error)
(global-set-key [C-f5] 'display-faces)
(global-set-key [C-f8] 'dired)
(global-set-key [C-f10] 'kill-compilation)

;; Keypad bindings
(global-set-key [up] "/C-p")
(global-set-key [down] "/C-n")
(global-set-key [left] "/C-b")
(global-set-key [right] "/C-f")
(global-set-key [home] "/C-a")
(global-set-key [end] "/C-e")
(global-set-key [prior] "/M-v")
(global-set-key [next] "/C-v")
(global-set-key [C-up] "/M-/C-b")
(global-set-key [C-down] "/M-/C-f")
(global-set-key [C-left] "/M-b")
(global-set-key [C-right] "/M-f")
(global-set-key [C-home] "/M-<")
(global-set-key [C-end] "/M->")
(global-set-key [C-prior] "/M-<")
(global-set-key [C-next] "/M->")

;; Mouse
(global-set-key [mouse-3] 'imenu)

;; Misc
(global-set-key [C-tab] "/C-q/t")   ; Control tab quotes a tab.
(setq backup-by-copying-when-mismatch t)

;; Treat 'y' or  as yes, 'n' as no.
(fset 'yes-or-no-p 'y-or-n-p)
(define-key query-replace-map [return] 'act)
(define-key query-replace-map [?/C-m] 'act)

;; Load packages
(require 'desktop)
(require 'tar-mode)

;; Pretty diff mode
(autoload 'ediff-buffers "ediff" "Intelligent Emacs interface to diff" t)
(autoload 'ediff-files "ediff" "Intelligent Emacs interface to diff" t)
(autoload 'ediff-files-remote "ediff"
  "Intelligent Emacs interface to diff")

(if first-time
    (setq auto-mode-alist
      (append '(("//.cpp___FCKpd___53quot; . c++-mode)
            ("//.hpp___FCKpd___53quot; . c++-mode)
            ("//.lsp___FCKpd___53quot; . lisp-mode)
            ("//.scm___FCKpd___53quot; . scheme-mode)
            ("//.pl___FCKpd___53quot; . perl-mode)
            ) auto-mode-alist)))

;; Auto font lock mode
(defvar font-lock-auto-mode-list
  (list 'c-mode 'c++-mode 'c++-c-mode 'emacs-lisp-mode 'lisp-mode 'perl-mode 'scheme-mode)
  "List of modes to always start in font-lock-mode")

(defvar font-lock-mode-keyword-alist
  '((c++-c-mode . c-font-lock-keywords)
    (perl-mode . perl-font-lock-keywords))
  "Associations between modes and keywords")

(defun font-lock-auto-mode-select ()
  "Automatically select font-lock-mode if the current major mode is in font-lock-auto-mode-list"
  (if (memq major-mode font-lock-auto-mode-list)
      (progn
    (font-lock-mode t))
    )
  )

(global-set-key [M-f1] 'font-lock-fontify-buffer)

;; New dabbrev stuff
;(require 'new-dabbrev)
(setq dabbrev-always-check-other-buffers t)
(setq dabbrev-abbrev-char-regexp "//sw//|//s_")
(add-hook 'emacs-lisp-mode-hook
      '(lambda ()
         (set (make-local-variable 'dabbrev-case-fold-search) nil)
         (set (make-local-variable 'dabbrev-case-replace) nil)))
(add-hook 'c-mode-hook
      '(lambda ()
         (set (make-local-variable 'dabbrev-case-fold-search) nil)
         (set (make-local-variable 'dabbrev-case-replace) nil)))
(add-hook 'text-mode-hook
      '(lambda ()
         (set (make-local-variable 'dabbrev-case-fold-search) t)
         (set (make-local-variable 'dabbrev-case-replace) t)))

;; C++ and C mode...
(defun my-c++-mode-hook ()
  (setq tab-width 4)
  (define-key c++-mode-map "/C-m" 'reindent-then-newline-and-indent)
  (define-key c++-mode-map "/C-ce" 'c-comment-edit)
  (setq c++-auto-hungry-initial-state 'none)
  (setq c++-delete-function 'backward-delete-char)
  (setq c++-tab-always-indent t)
  (setq c-indent-level 4)
  (setq c-continued-statement-offset 4)
  (setq c++-empty-arglist-indent 4))

(defun my-c-mode-hook ()
  (setq tab-width 4)
  (define-key c-mode-map "/C-m" 'reindent-then-newline-and-indent)
  (define-key c-mode-map "/C-ce" 'c-comment-edit)
  (setq c-auto-hungry-initial-state 'none)
  (setq c-delete-function 'backward-delete-char)
  (setq c-tab-always-indent t)
;; BSD-ish indentation style
  (setq c-indent-level 4)
  (setq c-continued-statement-offset 4)
  (setq c-brace-offset -4)
  (setq c-argdecl-indent 0)
  (setq c-label-offset -4))

;; Perl mode
(defun my-perl-mode-hook ()
  (setq tab-width 4)
  (define-key c++-mode-map "/C-m" 'reindent-then-newline-and-indent)
  (setq perl-indent-level 4)
  (setq perl-continued-statement-offset 4))

;; Scheme mode...
(defun my-scheme-mode-hook ()
  (define-key scheme-mode-map "/C-m" 'reindent-then-newline-and-indent))

;; Emacs-Lisp mode...
(defun my-lisp-mode-hook ()
  (define-key lisp-mode-map "/C-m" 'reindent-then-newline-and-indent)
  (define-key lisp-mode-map "/C-i" 'lisp-indent-line)
  (define-key lisp-mode-map "/C-j" 'eval-print-last-sexp))

;; Add all of the hooks...
(add-hook 'c++-mode-hook 'my-c++-mode-hook)
(add-hook 'c-mode-hook 'my-c-mode-hook)
(add-hook 'scheme-mode-hook 'my-scheme-mode-hook)
(add-hook 'emacs-lisp-mode-hook 'my-lisp-mode-hook)
(add-hook 'lisp-mode-hook 'my-lisp-mode-hook)
(add-hook 'perl-mode-hook 'my-perl-mode-hook)

;; Complement to next-error
(defun previous-error (n)
  "Visit previous compilation error message and corresponding source code."
  (interactive "p")
  (next-error (- n)))

;; Misc...
(transient-mark-mode 1)
(setq mark-even-if-inactive t)
(setq visible-bell nil)
(setq next-line-add-newlines nil)
(setq compile-command "make")
(setq suggest-key-bindings nil)
(put 'eval-expression 'disabled nil)
(put 'narrow-to-region 'disabled nil)
(put 'set-goal-column 'disabled nil)
(if (>= emacs-major-version 21)
    (setq show-trailing-whitespace t))

;; Elisp archive searching
(autoload 'format-lisp-code-directory "lispdir" nil t)
(autoload 'lisp-dir-apropos "lispdir" nil t)
(autoload 'lisp-dir-retrieve "lispdir" nil t)
(autoload 'lisp-dir-verify "lispdir" nil t)

;; Font lock mode
(defun my-make-face (face color &optional bold)
  "Create a face from a color and optionally make it bold"
  (make-face face)
  (copy-face 'default face)
  (set-face-foreground face color)
  (if bold (make-face-bold face))
  )

(if (eq window-system 'x)
    (progn
      (my-make-face 'blue "blue")
      (my-make-face 'red "red")
      (my-make-face 'green "dark green")
      (setq font-lock-comment-face 'blue)
      (setq font-lock-string-face 'bold)
      (setq font-lock-type-face 'bold)
      (setq font-lock-keyword-face 'bold)
      (setq font-lock-function-name-face 'red)
      (setq font-lock-doc-string-face 'green)
      (add-hook 'find-file-hooks 'font-lock-auto-mode-select)

      (setq baud-rate 1000000)
      (global-set-key "/C-cmm" 'menu-bar-mode)
      (global-set-key "/C-cms" 'scroll-bar-mode)
      (global-set-key [backspace] 'backward-delete-char)
                    ;      (global-set-key [delete] 'delete-char)
      (standard-display-european t)
      (load-library "iso-transl")))

;; X11 or PC using direct screen writes
(if window-system
    (progn
      ;;      (global-set-key [M-f1] 'hilit-repaint-command)
      ;;      (global-set-key [M-f2] [?/C-u M-f1])
      (setq hilit-mode-enable-list
        '(not text-mode c-mode c++-mode emacs-lisp-mode lisp-mode
          scheme-mode)
        hilit-auto-highlight nil
        hilit-auto-rehighlight 'visible
        hilit-inhibit-hooks nil
        hilit-inhibit-rebinding t)
      (require 'hilit19)
      (require 'paren))
  (setq baud-rate 2400)         ; For slow serial connections
  )

;; TTY type terminal
(if (and (not window-system)
     (not (equal system-type 'ms-dos)))
    (progn
      (if first-time
      (progn
        (keyboard-translate ?/C-h ?/C-?)
        (keyboard-translate ?/C-? ?/C-h)))))

;; Under UNIX
(if (not (equal system-type 'ms-dos))
    (progn
      (if first-time
      (server-start))))

;; Add any face changes here
(add-hook 'term-setup-hook 'my-term-setup-hook)
(defun my-term-setup-hook ()
  (if (eq window-system 'pc)
      (progn
;;  (set-face-background 'default "red")
    )))

;; Restore the "desktop" - do this as late as possible
(if first-time
    (progn
      (desktop-load-default)
      (desktop-read)))

;; Indicate that this file has been read at least once
(setq first-time nil)

;; No need to debug anything now

(setq debug-on-error nil)

;; All done
(message "All done, %s%s" (user-login-name) ".")
   

2.7.4 Extending the Range of Languages Emacs Understands

  Now, this is all very well if you only want to program in the languages already catered for in the .emacs file (C, C++, Perl, Lisp and Scheme), but what happens if a new language called “whizbang” comes out, full of exciting features?

  The first thing to do is find out if whizbang comes with any files that tell Emacs about the language. These usually end in .el, short for “Emacs Lisp”. For example, if whizbang is a FreeBSD port, we can locate these files by doing

% find /usr/ports/lang/whizbang -name "*.el" -print

  and install them by copying them into the Emacs site Lisp directory. On FreeBSD 2.1.0-RELEASE, this is /usr/local/share/emacs/site-lisp.

  So for example, if the output from the find command was

/usr/ports/lang/whizbang/work/misc/whizbang.el

  we would do

# cp /usr/ports/lang/whizbang/work/misc/whizbang.el /usr/local/share/emacs/site-lisp

  Next, we need to decide what extension whizbang source files have. Let's say for the sake of argument that they all end in .wiz. We need to add an entry to our .emacs file to make sure Emacs will be able to use the information in whizbang.el.

  Find the auto-mode-alist entry in .emacs and add a line for whizbang, such as:

...
("//.lsp___FCKpd___57quot; . lisp-mode)
("//.wiz___FCKpd___57quot; . whizbang-mode)
("//.scm___FCKpd___57quot; . scheme-mode)
...

  This means that Emacs will automatically go into whizbang-mode when you edit a file ending in .wiz.

  Just below this, you will find the font-lock-auto-mode-list entry. Add whizbang-mode to it like so:

;; Auto font lock mode
(defvar font-lock-auto-mode-list
  (list 'c-mode 'c++-mode 'c++-c-mode 'emacs-lisp-mode 'whizbang-mode 'lisp-mode 'perl-mode 'scheme-mode)
  "List of modes to always start in font-lock-mode")

  This means that Emacs will always enable font-lock-mode (ie syntax highlighting) when editing a .wiz file.

  And that is all that is needed. If there is anything else you want done automatically when you open up a .wiz file, you can add a whizbang-mode hook (see my-scheme-mode-hook for a simple example that adds auto-indent).


2.8 Further Reading

  For information about setting up a development environment for contributing fixes to FreeBSD itself, please see development(7).

  • Brian Harvey and Matthew Wright Simply Scheme MIT 1994. ISBN 0-262-08226-8

  • Randall Schwartz Learning Perl O'Reilly 1993 ISBN 1-56592-042-2

  • Patrick Henry Winston and Berthold Klaus Paul Horn Lisp (3rd Edition) Addison-Wesley 1989 ISBN 0-201-08319-1

  • Brian W. Kernighan and Rob Pike The Unix Programming Environment Prentice-Hall 1984 ISBN 0-13-937681-X

  • Brian W. Kernighan and Dennis M. Ritchie The C Programming Language (2nd Edition) Prentice-Hall 1988 ISBN 0-13-110362-8

  • Bjarne Stroustrup The C++ Programming Language Addison-Wesley 1991 ISBN 0-201-53992-6

  • W. Richard Stevens Advanced Programming in the Unix Environment Addison-Wesley 1992 ISBN 0-201-56317-7

  • W. Richard Stevens Unix Network Programming Prentice-Hall 1990 ISBN 0-13-949876-1


第3章  安全的编程

供稿:Murray Stokely. 翻译:susn @NewSMTH.

3.1 提要

  本章描述了十年间一些令UNIX程序员感到困惑的安全问题,并提供了一些新的工具来帮助程序员避免生成可被利用的代码。


3.2 安全的设计方法

  编写安全的应用程序要带着谨慎和略有悲观的生活观点。程序应该本着 “最小特权”的原则运行,这样就不会有带着大于足够能完成其功能的权限的进程在运行。预先测试的代码应该随时可以重用以避免遇到一些本已经修复的通常错误。

  UNIX环境的陷阱之一就是很容易的制造一个稳健环境的假象。程序应该永远不要相信用户的输入(以各种形式),系统资源,进程间通讯,或者 触发事件的时钟。UNIX进程不是同步运行,所以逻辑操作很少是原子类型。


3.3 缓冲区溢出

  缓冲区溢出的漏洞随着冯·诺依曼 1 构架的出 现就已经开始出现了。在1988年随着莫里斯互联网蠕虫的广泛传播他们开始声名狼藉。不幸的是,同样的这种攻击一直持续到今天。 1999年的17个CERT(Computer Emergency Response Team, 卡内基梅隆大学计算机紧急响应小组)的安全通告中, 他们中的10个直接是由软件的缓冲区溢出而导致的。到目前为止,大部分的缓冲区溢出的攻击都是基于摧毁栈的方式。

  大部分现代计算机系统使用栈来给进程传递参数并且存储局部变量。栈是一种在进程映象内存的高地址内的后进先出(LIFO)的缓冲区。当程序调用一个函数时一个新的“栈帧”会被创建。这个栈帧包含着传递给函数的各种参数和一些动态的局部变量空间。“栈指针”记录着当前 栈顶的位置。由于栈指针的值会因为新变量的压入栈顶而经常的变化,许多实现也提供了一种"帧指针"来定位在栈帧的起始位置,以便局部变量可以更容易的被访问。 1调用函数的返回地址也同样存储在栈中,由于在函数中的局部变量覆盖了函数的返回地址成为了栈溢出的一个原因,这就潜在的准许了一个恶意用户可以执行他(她)所想运行的任何代码。

  虽然基于栈的攻击是目前最广泛的,这也可以使基于堆的攻击(malloc/ free)变成可能。

  C程序语言并不像其他一些编程语言一样自动的做数组或者指针的边界检查。另外,C标准库还具有相当一些非常危险的操作函数。

strcpy(char *dest, const char *src)

可导致dest缓冲区溢出

strcat(char *dest, const char *src)

可导致dest缓冲区溢出

getwd(char *buf)

可导致buf缓冲区溢出

gets(char *s)

可导致s缓冲区溢出

[vf]scanf(const char *format, ...)

可导致参数溢出

realpath(char *path, char resolved_path[])

可导致path缓冲区溢出

[v]sprintf(char *str, const char *format, ...)

可导致str缓冲区溢出


3.3.1 缓冲区溢出示例

  下面的示例代码包含了一个缓冲区溢出的情况,它会覆盖函数的返回地址并且立即跳过了紧随此函数之后调用。(授权于5)

#include 

void manipulate(char *buffer) {
  char newbuffer[80];
  strcpy(newbuffer,buffer);
}

int main() {
  char ch,buffer[4096];
  int i=0;

  while ((buffer[i++] = getchar()) != '/n') {};
  
  i=1;
  manipulate(buffer);
  i=2;
  printf("The value of i is : %d/n",i);
  return 0;
}

  让我们来查看一下如果在输入回车之前输入160个空格后这个小程序的内存映象是个什么样子。

  [XXX figure here!]

  很明显更多的恶意输入能被设计出执行实际的编译指令(例如 exec(/bin/sh))。


3.3.2 避免缓冲区溢出

  对于栈溢出的最直接的解决方法就是总是使用长度有限的内存和 字符串复制函数。strncpystrncat 是C标准库的一部分。 这些函数接收一个不大于目标缓冲区长度的值作为参数。这些函数会从源地址复制此值长的字节数到目标地址。然而这些函数还是有一些问题。如果输入缓冲区的长度和目标缓冲区的一样长则函数不保证两者都以NUL 作为结束符。长度参数在strncpy和strncat函数中同样的不一致很容易导致程序员在正常使用时感到困惑。同时当复制一个较短的字符串到一个很大的缓冲 区中时相对于strcpy也有很重大的性能损失, 因为strncpy会用NUL填充所指定的长度。

  在OpenBSD中,另一个内存复制的实现已经规避了这些问题。 函数strlcpystrlcat 保证了当指定了非零的长度参数时目标字符串总是以NUL作为结束符。关于这些函数的更多信息请参考7。OpenBSD 的strlcpystrlcat 自从FreeBSD3.3的版本已经被引入了。


3.3.2.1 基于编译器运行时边界检查

  不幸的是扔然有相当数量的代码在广泛使用盲目的内存复制功能而不是我们所提及到的任何有限制的复制例程。幸运的是还有另一个解决方案。有一些编译器插件和库在C/C++中一直在做运行时的边界检查。

  作为gcc代码生成器的一个小补丁StackGuard就是这样一款插件。 源自StackGuard 站点:

“StackGuard 检测并靠着保护在栈中的返回地址不受到更改来防御针对栈的剧烈攻击。当函数被调用时StackGuard在栈中紧邻返回地址放置了一个‘canary’(哨兵或探针)。如果函数返回时哨兵已经被改变了,就是有针对栈的攻击实施了,那么程序会在syslog中发出一个入侵警报并且停止运行。

“StackGuard作为gcc代码生成器的一个小补丁来实现,特别是function_prolog()和function_epilog()程序。增强的function_prolog()在函数开始时在栈中安装了哨兵,而 function_epilog()在函数退出时检查哨兵的完整性。任何其他破坏返回地址的行为在函数返回时就这样被检测到了。



  使用StackGuard重新编译你的程序可以有效的防止大部分的缓冲区溢出的攻击,但是这仍然是个折衷的办法。


3.3.2.2 基于库运行时边界检查

  基于编译器的机制对于不能重新编译的只有二进制的软件完全无用。对于这些情况仍还是有很多库可以对C库中的不安全的函数 (strcpy, fscanf, getwd等)重新实现并确保这些函数决不回写 栈指针。

  • libsafe

  • libverify

  • libparanoia

  不幸的是这些基于库的防护有一些缺点。这些库仅仅保护和安全相关的一小部分集合,他们忽略了实际的问题。如果程序使用参数 -fomit-frame-pointer进行编译的话这些防护也许会失败。同样,环境变量LD_PRELOAD和LD_LIBRARY_PATH也可以被用户取消或者重置。


3.4 SetUID 问题

  对于给定的进程至少有6个不同的ID与之关联。因此你不得不非常关注你的程序在任何特定时刻的权限问题。特别的,所有seteuid的程序在不需要的时候会立刻放弃他们的特权。

  实际用户ID只能被超级用户进程改变。当用户初始登陆时 login程序设置它并且极少进行更改。

  如果程序准许seteuid位设置的话有效用户ID会被exec() 函数设置。应用程序可以调用seteuid() 在任何时候设置有效的用户ID为任意的实际用户ID或者保存 设置-用户-ID。当有效用户ID被 exec()函数设置后, 前一个ID的值会被保存在设置-用户-ID中。


3.5 限制你的程序环境

  传统的限制进程的方法是使用系统调用chroot() 。这个系统调用使得从进程及其任何子进程所引用的其他的路径变为根路径。对于要使程序运行成功这个调用必须在引用的目录上拥有执行(搜索)的权限。直到你使用了chdir() 在你的新环境中它才会实际的生效。同时应该注意到如果程序具有超级用户的权限它很容易的摆脱chroot所设置的环境。它可能靠创建设备节点来读取内核的内容,对程序在jail外绑定一个调试器,或者靠其他创造性 的方法来完成操作。

  系统调用chroot()的行为可以被 sysctl变量kern.chroot_allow_open_directories 的值在一定程度上控制。当此值为0时,如果有任何目录被打开 chroot()将会返回EPERM并失败。当被置为默认值1,如果任何目录被打开并且进程已经准备调用 chroot() 那么chroot()将会返回EPERM并失败。对于其他的值,对打开目录的检查会被完全的忽视。


3.5.1 FreeBSD的jail功能

  Jail的概念在chroot()之上作了延伸,它靠限制超级用户的权力来创建了一个真正的"虚拟服务器"。一旦一个监狱被设置好后整个网络必须通过特别的IP地址才能到达,在这里"超级用户权限"的力量完全的受到限制。

  当在jail中时,所有在内核中使用suser() 调用的超级用户权限的尝试都会失败。然而,一些对suser() 的调用已经被更改为新的接口suser_xxx() 。这个函数对认可或者拒绝被限制的进程去取得超级用户的权 限的行为负责。

  一个在Jail环境中的超级用户进程有以下权力:

  • 使用可信任的操作: setuid, seteuid, setgid, setegid, setgroups, setreuid, setregid, setlogin

  • 使用setrlimit设置资源限制

  • 编辑一些sysctl节点值 (kern.hostname)

  • chroot()

  • 在vnode(V-节点)上设置标志: chflags, fchflags

  • 设置V-节点属性例如文件权限,所有者,所有组,大小, 存取时间,更新时间

  • 在互联网域中帮定有特权端口(端口号小于1024)

  Jail是一个对于在一个安全环境中运行一个仍有一些缺点的程序非常有用的工具。目前,IPC机制还没有被 更改到suser_xxx以至于像MySQL之类的程序还不能运行在jail中。在jail中超级用户的存取可能还有非常有限的含义,但是没有途径能正确的指出"非常有限"意味着什么。


3.5.2 POSIX®.1e 处理能力

  POSIX已经发布了一个工作草案,增加了事件审计,访问控制列表,精细特权控制,信息标签和强制访问控制。

  这是一个正在进展中的工作并且是 TrustedBSD项目的重点。一些初始化的工作已经被提交到 FreeBSD-CURRENT(cap_set_proc(3))。


3.6 信任

  一个程序应该永远不要假设用户环境是健全的。这包括(但是决不限于此): 用户输入,信号,环境变量,资源,IPC,mmap(内存映射),工作目录的文件系统,文件描述符,打开文件的数量,等等

  你永远不要假设你可以捕捉到所有的用户可能产生的非法输入样式。换言之,你的程序应该过滤只准许一些你认为安全的特别的输入子集。不正确的确认数据会导致各种错误,特别是在互联网上的CGI脚本。对于文件名你应该额外小心比如路径("../", "/"),符号连接和shell的退出符。

  Perl有一个非常棒的特性叫做“Taint”模式能避免脚本从外部程序在不安全的途径得到使用的数据。这个方式会检查命令行参数,环境变量,位置 信息确定系统调用(readdir(),readlink() ,getpwxxx())的结果和所有文件的输入。


3.7 竞态条件

  竞态条件是由和事件时间相关的意料之外的依赖所导致的反常行为。换句话说,一个程序员不正确的假设一个特殊的事件总是在另一个事件之前发生。

  一些通常的导致竞态条件的原因是信号,存取检查和打开文件操作。由于信号生来就是异步事件所以在处理他们时要特别当心。存取检查中使 用access(2)然后使用open(2) 是很明显的非原子操作。用户可以在两次调用中移走文件。换言之,有特 权的程序应该使用seteuid()然后直接调用 open()。沿着同一思路,一个程序应该总是在 open()之前设置正确的掩码来排除不合逻辑的 chmod()调用。


第4章  Localization and Internationalization - L10N and I18N

4.1 Programming I18N Compliant Applications

  To make your application more useful for speakers of other languages, we hope that you will program I18N compliant. The GNU gcc compiler and GUI libraries like QT and GTK support I18N through special handling of strings. Making a program I18N compliant is very easy. It allows contributors to port your application to other languages quickly. Refer to the library specific I18N documentation for more details.

  In contrast with common perception, I18N compliant code is easy to write. Usually, it only involves wrapping your strings with library specific functions. In addition, please be sure to allow for wide or multibyte character support.


4.1.1 A Call to Unify the I18N Effort

  It has come to our attention that the individual I18N/L10N efforts for each country has been repeating each others' efforts. Many of us have been reinventing the wheel repeatedly and inefficiently. We hope that the various major groups in I18N could congregate into a group effort similar to the Core Team's responsibility.

  Currently, we hope that, when you write or port I18N programs, you would send it out to each country's related FreeBSD mailing list for testing. In the future, we hope to create applications that work in all the languages out-of-the-box without dirty hacks.

  The FreeBSD 国际化邮件列表 has been established. If you are an I18N/L10N developer, please send your comments, ideas, questions, and anything you deem related to it.

  Michael C. Wu will be maintaining an I18N works in progress homepage at http://www.FreeBSD.org/~keichii/i18n/index.html. Please also read the BSDCon2000 I18N paper and presentations by Clive Lin, Chia-Liang Kao, and Michael C. Wu at http://www.FreeBSD.org/~keichii/papers/

4.1.2 Perl and Python

  Perl and Python have I18N and wide character handling libraries. Please use them for I18N compliance.

  In older FreeBSD versions, Perl may give warnings about not having a wide character locale installed on your system. You can set the environment variable LD_PRELOAD to /usr/lib/libxpg4.so in your shell.

  In sh-based shells:

LD_PRELOAD=/usr/lib/libxpg4.so

  In C-based shells:

setenv LD_PRELOAD /usr/lib/libxpg4.so

第5章  Source Tree Guidelines and Policies

Contributed by Poul-Henning Kamp.

  This chapter documents various guidelines and policies in force for the FreeBSD source tree.


5.1 MAINTAINER on Makefiles

  If a particular portion of the FreeBSD distribution is being maintained by a person or group of persons, they can communicate this fact to the world by adding a

MAINTAINER= email-addresses
line to the Makefiles covering this portion of the source tree.

  The semantics of this are as follows:

  The maintainer owns and is responsible for that code. This means that he is responsible for fixing bugs and answering problem reports pertaining to that piece of the code, and in the case of contributed software, for tracking new versions, as appropriate.

  Changes to directories which have a maintainer defined shall be sent to the maintainer for review before being committed. Only if the maintainer does not respond for an unacceptable period of time, to several emails, will it be acceptable to commit changes without review by the maintainer. However, it is suggested that you try to have the changes reviewed by someone else if at all possible.

  It is of course not acceptable to add a person or group as maintainer unless they agree to assume this duty. On the other hand it does not have to be a committer and it can easily be a group of people.


5.2 Contributed Software

Contributed by Poul-Henning Kamp 和 David O'Brien.

  Some parts of the FreeBSD distribution consist of software that is actively being maintained outside the FreeBSD project. For historical reasons, we call this contributed software. Some examples are sendmail, gcc and patch.

  Over the last couple of years, various methods have been used in dealing with this type of software and all have some number of advantages and drawbacks. No clear winner has emerged.

  Since this is the case, after some debate one of these methods has been selected as the “official” method and will be required for future imports of software of this kind. Furthermore, it is strongly suggested that existing contributed software converge on this model over time, as it has significant advantages over the old method, including the ability to easily obtain diffs relative to the “official” versions of the source by everyone (even without cvs access). This will make it significantly easier to return changes to the primary developers of the contributed software.

  Ultimately, however, it comes down to the people actually doing the work. If using this model is particularly unsuited to the package being dealt with, exceptions to these rules may be granted only with the approval of the core team and with the general consensus of the other developers. The ability to maintain the package in the future will be a key issue in the decisions.

注意: Because of some unfortunate design limitations with the RCS file format and CVS's use of vendor branches, minor, trivial and/or cosmetic changes are strongly discouraged on files that are still tracking the vendor branch. “Spelling fixes” are explicitly included here under the “cosmetic” category and are to be avoided for files with revision 1.1.x.x. The repository bloat impact from a single character change can be rather dramatic.

  The Tcl embedded programming language will be used as example of how this model works:

  src/contrib/tcl contains the source as distributed by the maintainers of this package. Parts that are entirely not applicable for FreeBSD can be removed. In the case of Tcl, the mac, win and compat subdirectories were eliminated before the import.

  src/lib/libtcl contains only a bmake style Makefile that uses the standard bsd.lib.mk makefile rules to produce the library and install the documentation.

  src/usr.bin/tclsh contains only a bmake style Makefile which will produce and install the tclsh program and its associated man-pages using the standard bsd.prog.mk rules.

  src/tools/tools/tcl_bmake contains a couple of shell-scripts that can be of help when the tcl software needs updating. These are not part of the built or installed software.

  The important thing here is that the src/contrib/tcl directory is created according to the rules: it is supposed to contain the sources as distributed (on a proper CVS vendor-branch and without RCS keyword expansion) with as few FreeBSD-specific changes as possible. The 'easy-import' tool on freefall will assist in doing the import, but if there are any doubts on how to go about it, it is imperative that you ask first and not blunder ahead and hope it “works out”. CVS is not forgiving of import accidents and a fair amount of effort is required to back out major mistakes.

  Because of the previously mentioned design limitations with CVS's vendor branches, it is required that “official” patches from the vendor be applied to the original distributed sources and the result re-imported onto the vendor branch again. Official patches should never be patched into the FreeBSD checked out version and “committed”, as this destroys the vendor branch coherency and makes importing future versions rather difficult as there will be conflicts.

  Since many packages contain files that are meant for compatibility with other architectures and environments that FreeBSD, it is permissible to remove parts of the distribution tree that are of no interest to FreeBSD in order to save space. Files containing copyright notices and release-note kind of information applicable to the remaining files shall not be removed.

  If it seems easier, the bmake Makefiles can be produced from the dist tree automatically by some utility, something which would hopefully make it even easier to upgrade to a new version. If this is done, be sure to check in such utilities (as necessary) in the src/tools directory along with the port itself so that it is available to future maintainers.

  In the src/contrib/tcl level directory, a file called FREEBSD-upgrade should be added and it should state things like:

  • Which files have been left out.

  • Where the original distribution was obtained from and/or the official master site.

  • Where to send patches back to the original authors.

  • Perhaps an overview of the FreeBSD-specific changes that have been made.

  However, please do not import FREEBSD-upgrade with the contributed source. Rather you should cvs add FREEBSD-upgrade ; cvs ci after the initial import. Example wording from src/contrib/cpio is below:

This directory contains virgin sources of the original distribution files
on a "vendor" branch.  Do not, under any circumstances, attempt to upgrade
the files in this directory via patches and a cvs commit.  New versions or
official-patch versions must be imported.  Please remember to import with
"-ko" to prevent CVS from corrupting any vendor RCS Ids.

For the import of GNU cpio 2.4.2, the following files were removed:

        INSTALL         cpio.info       mkdir.c             
        Makefile.in     cpio.texi       mkinstalldirs

To upgrade to a newer version of cpio, when it is available:
        1. Unpack the new version into an empty directory.
           [Do not make ANY changes to the files.]

        2. Remove the files listed above and any others that don't apply to
           FreeBSD.

        3. Use the command:
                cvs import -ko -m 'Virgin import of GNU cpio v' /
                        src/contrib/cpio GNU cpio_

           For example, to do the import of version 2.4.2, I typed:
                cvs import -ko -m 'Virgin import of GNU v2.4.2' /
                        src/contrib/cpio GNU cpio_2_4_2

        4. Follow the instructions printed out in step 3 to resolve any
           conflicts between local FreeBSD changes and the newer version.

Do not, under any circumstances, deviate from this procedure.

To make local changes to cpio, simply patch and commit to the main
branch (aka HEAD).  Never make local changes on the GNU branch.

All local changes should be submitted to "[email protected]" for
inclusion in the next vendor release.

[email protected] - 30 March 1997

5.3 Encumbered Files

  It might occasionally be necessary to include an encumbered file in the FreeBSD source tree. For example, if a device requires a small piece of binary code to be loaded to it before the device will operate, and we do not have the source to that code, then the binary file is said to be encumbered. The following policies apply to including encumbered files in the FreeBSD source tree.

  1. Any file which is interpreted or executed by the system CPU(s) and not in source format is encumbered.

  2. Any file with a license more restrictive than BSD or GNU is encumbered.

  3. A file which contains downloadable binary data for use by the hardware is not encumbered, unless (1) or (2) apply to it. It must be stored in an architecture neutral ASCII format (file2c or uuencoding is recommended).

  4. Any encumbered file requires specific approval from the Core team before it is added to the CVS repository.

  5. Encumbered files go in src/contrib or src/sys/contrib.

  6. The entire module should be kept together. There is no point in splitting it, unless there is code-sharing with non-encumbered code.

  7. Object files are named arch/filename.o.uu>.

  8. Kernel files:

    1. Should always be referenced in conf/files.* (for build simplicity).

    2. Should always be in LINT, but the Core team decides per case if it should be commented out or not. The Core team can, of course, change their minds later on.

    3. The Release Engineer decides whether or not it goes into the release.

  9. User-land files:

    1. The Core team decides if the code should be part of make world.

    2. The Release Engineer decides if it goes into the release.


5.4 Shared Libraries

Contributed by Satoshi Asami、 Peter Wemm 和 David O'Brien.

  If you are adding shared library support to a port or other piece of software that does not have one, the version numbers should follow these rules. Generally, the resulting numbers will have nothing to do with the release version of the software.

  The three principles of shared library building are:

  • Start from 1.0

  • If there is a change that is backwards compatible, bump minor number (note that ELF systems ignore the minor number)

  • If there is an incompatible change, bump major number

  For instance, added functions and bugfixes result in the minor version number being bumped, while deleted functions, changed function call syntax, etc. will force the major version number to change.

  Stick to version numbers of the form major.minor (x.y). Our a.out dynamic linker does not handle version numbers of the form x.y.z well. Any version number after the y (i.e. the third digit) is totally ignored when comparing shared lib version numbers to decide which library to link with. Given two shared libraries that differ only in the “micro” revision, ld.so will link with the higher one. That is, if you link with libfoo.so.3.3.3, the linker only records 3.3 in the headers, and will link with anything starting with libfoo.so.3.(anything >= 3).(highest available).

注意: ld.so will always use the highest “minor” revision. For instance, it will use libc.so.2.2 in preference to libc.so.2.0, even if the program was initially linked with libc.so.2.0.

  In addition, our ELF dynamic linker does not handle minor version numbers at all. However, one should still specify a major and minor version number as our Makefiles “do the right thing” based on the type of system.

  For non-port libraries, it is also our policy to change the shared library version number only once between releases. In addition, it is our policy to change the major shared library version number only once between major OS releases (i.e. from 3.0 to 4.0). When you make a change to a system library that requires the version number to be bumped, check the Makefile's commit logs. It is the responsibility of the committer to ensure that the first such change since the release will result in the shared library version number in the Makefile to be updated, and any subsequent changes will not.


第6章  Regression and Performance Testing

  Regression tests are used to exercise a particular bit of the system to check that it works as expected, and to make sure that old bugs are not reintroduced.

  The FreeBSD regression testing tools can be found in the FreeBSD source tree in the directory src/tools/regression.


6.1. Micro Benchmark Checklist

This section contains hints for doing proper micro-benchmarking on FreeBSD or of FreeBSD itself.

It is not possible to use all of the suggestions below every single time, but the more used, the better the benchmark's ability to test small differences will be.

  • Disable APM and any other kind of clock fiddling (ACPI ?).

  • Run in single user mode. E.g. cron(8), and and other daemons only add noise. The sshd(8) daemon can also cause problems. If ssh access is required during test either disable the SSHv1 key regeneration, or kill the parent sshd daemon during the tests.

  • Do not run ntpd(8).

  • If syslog(3) events are generated, run syslogd(8) with an empty /etc/syslogd.conf, otherwise, do not run it.

  • Minimize disk-I/O, avoid it entirely if possible.

  • Do not mount file systems that are not needed.

  • Mount /, /usr, and any other file system as read-only if possible. This removes atime updates to disk (etc.) from the I/O picture.

  • Reinitialize the read/write test file system with newfs(8) and populate it from a tar(1) or dump(8) file before every run. Unmount and mount it before starting the test. This results in a consistent file system layout. For a worldstone test this would apply to /usr/obj (just reinitialize with newfs and mount). To get 100% reproducibility, populate the file system from a dd(1) file (i.e.: dd if=myimage of=/dev/ad0s1h bs=1m)

  • Use malloc backed or preloaded md(4) partitions.

  • Reboot between individual iterations of the test, this gives a more consistent state.

  • Remove all non-essential device drivers from the kernel. For instance if USB is not needed for the test, do not put USB in the kernel. Drivers which attach often have timeouts ticking away.

  • Unconfigure hardware that are not in use. Detach disks with atacontrol(8) and camcontrol(8) if the disks are not used for the test.

  • Do not configure the network unless it is being tested, or wait until after the test has been performed to ship the results off to another computer.

    If the system must be connected to a public network, watch out for spikes of broadcast traffic. Even though it is hardly noticeable, it will take up CPU cycles. Multicast has similar caveats.

  • Put each file system on its own disk. This minimizes jitter from head-seek optimizations.

  • Minimize output to serial or VGA consoles. Running output into files gives less jitter. (Serial consoles easily become a bottleneck.) Do not touch keyboard while the test is running, even space or back-space shows up in the numbers.

  • Make sure the test is long enough, but not too long. If the test is too short, timestamping is a problem. If it is too long temperature changes and drift will affect the frequency of the quartz crystals in the computer. Rule of thumb: more than a minute, less than an hour.

  • Try to keep the temperature as stable as possible around the machine. This affects both quartz crystals and disk drive algorithms. To get real stable clock, consider stabilized clock injection. E.g. get a OCXO + PLL, inject output into clock circuits instead of motherboard xtal. Contact Poul-Henning Kamp for more information about this.

  • Run the test at least 3 times but it is better to run more than 20 times both for “before” and “after” code. Try to interleave if possible (i.e.: do not run 20 times before then 20 times after), this makes it possible to spot environmental effects. Do not interleave 1:1, but 3:3, this makes it possible to spot interaction effects.

    A good pattern is: bababa{bbbaaa}*. This gives hint after the first 1+1 runs (so it is possible to stop the test if it goes entirely the wrong way), a standard deviation after the first 3+3 (gives a good indication if it is going to be worth a long run) and trending and interaction numbers later on.

  • Use usr/src/tools/tools/ministat to see if the numbers are significant. Consider buying “Cartoon guide to statistics” ISBN: 0062731025, highly recommended, if you have forgotten or never learned about standard deviation and Student's T.

  • Do not use background fsck(8) unless the test is a benchmark of background fsck. Also, disable background_fsck in /etc/rc.conf unless the benchmark is not started at least 60+“fsck runtime” seconds after the boot, as rc(8) wakes up and checks if fsck needs to run on any file systems when background fsck is enabled. Likewise, make sure there are no snapshots lying around unless the benchmark is a test with snapshots.

  • If the benchmark show unexpected bad performance, check for things like high interrupt volume from an unexpected source. Some versions of ACPI have been reported to “misbehave” and generate excess interrupts. To help diagnose odd test results, take a few snapshots of vmstat -i and look for anything unusual.

  • Make sure to be careful about optimization parameters for kernel and userspace, likewise debugging. It is easy to let something slip through and realize later the test was not comparing the same thing.

  • Do not ever benchmark with the WITNESS and INVARIANTS kernel options enabled unless the test is interested to benchmarking those features. WITNESS can cause 400%+ drops in performance. Likewise, userspace malloc(3) parameters default differently in -CURRENT from the way they ship in production releases.

第II部分. 进程间通信

目录
第7章  套接字
第8章  IPv6 Internals

第7章  套接字

供稿:G. Adam Stanislav. 翻译:intron @NewSMTH.

7.1 概述

  BSD 套接字(socket)将进程间通信推到一个新的水平。彼此通信的进程可不再必须运行在同一计算机上。它们仍然还 能够运行在同一计算机上,但不再必须那样。

  不仅这些进程不必运行在同一计算机上, 它们也不必运行在同一种操作系统上。 有了 BSD 套接字,你的 FreeBSD 软件能够与运行在 Macintosh®中的程序顺利的协同工作,也可以与另一个在Sun™ 工作站上的,或是另一个运行在 Windows® 2000中的, 只要这些系统用以太网型的局域网相连。

  你的软件还可以很好的与运行在另一幢大楼,或是在另一个大陆、在一艘潜艇中的,或是一架航天飞机中的进程协同工作。

  它也能够与并非属于计算机一部分(至少从术语的严格意义上说不是) 的组件协同工作,这种设备像打印机、数码相机、医疗设备,大致只要是任何能够进行数字通信的东西。


7.2 联网和多样性

  我们已经暗示了联网的多样性问题。许多不同的系统要彼此对话。它们必须说同一种语言。与此同时, 它们也必须理解同一种语言。

  人们常常认为肢体语言是通用的。事实并非如此。回想在我刚刚十几岁时,我的父亲带我去保加利亚。一次我们正坐在索非亚一座公园里的桌子旁,一个小贩上来向我们 推销烤杏仁。

  那时我还没有学习多少保加利亚语,我没有说“不”,而是摇了摇头,那是“通用的”说的肢体语言。小贩很快开始装给我们一些杏仁。

  然后我想起我曾被告知在保加利亚摇头表示。很快,我又开始上下点头。小贩注意到了,就拿起他的杏仁走开了。对于一个统一的观察者,我没有改变肢体语言:我继续使用摇头和点头的语言。被改变的是肢体语言的意义。最初,小贩和我将同一种语言理解为完全不同的意义。我必须校正我自己对那种语言的了解,这样小贩才会明白。

  对于计算机也是同样的:同样的符号可能会有不同的,乃至截然相反的意义。所以,为了让两台计算机明白彼此,它们不仅要 对于相同的语言有默契,还必须对这种语言的 理解有默契。


7.3 协议

  当各种各样的编程语言动辄有着复杂的语法,并且使用了许多多字母保留字(这使用它们易于被人类程序员明白);数据通信的语言则倾向于简洁。它们经常使用一个个 二进制位,而不是多字节单词。这有一个很令人信服的理由:数据在你的计算机内部 可以以光速高速行进,它却通常只能在两台计算机之间以慢的多速度行进。

  因为数据通信中使用的语言很简洁,我们通常把它们称为 协议,而不是语言。

  当数据从一台计算机行进到另一台时,它一般使用超过一种协议。 这样协议是分层次的。数据可以与一头洋葱的芯类比:只有你剥开几层“表皮” 后才可取得数据。这最好用一张图说明:

  在这个例子中,我们尝试从用以太网连着的网页上获取一幅图像。

  图像由原始数据组成,那是一个我们的软件能够处理的(转换为一幅图片并显示在我们的显示器上)红绿蓝值序列。

  唉,我们的软件无法知道原始数据是如何组织的:那是一个 红绿蓝值序列,还是一个灰度序列,或者可能是 CMYK编码的色彩?数据是表现为8位离散值,还是16位大小,或是4位?图像由多少行和列组织?有的像素应当是透明的吗?

  我想你得到了图片……

  为了统一我们的软件处理原始数据的方式,数据被编码为 PNG文件。那也可以是 GIF,或JPEG文件,不一定只是PNG文件。

  于是PNG就是一种协议。

  对于这一点,我可以听见你的喊声: “不,那不是!那是一种文件格式! ”

  好,那当然是一种文件格式。但从数据通信的方面说, 一种文件格式也是一种协议:文件结构是一种语言,而且还是一种简洁的语言, 与我们的进程通信,确定数据如何被组织。因此,那是一种协议

  唉,假如我们接收到的只有PNG文件,我们的软件将要面对一个严峻的问题:将如何知道数据代表一幅图像,而不是一些文本、或可能是一段声音,或者这些都不是?其次,将如何 知道图像是PNG格式的,而不是 GIF,或是JPEG,或是其它图像格式?

  要取得那些信息,我们使用另一种协议: HTTP。这种协议能告诉我们数据确实代表一幅图像,并且图像使用PNG协议。它也能告诉我们其它一些东西,不过还是让我们把注意力停留在协议层次这里吧。

  至此,我们有一些数据被包裹在PNG 协议中,而后又被包裹在HTTP协议中。我们如何从服务器上取得它?

  通过在以太网上使用TCP/IP,这就是方法。实际上,有比三种更多的协议。我不再继续深入了,我现在开始说说以太网,只因为这样更容易解释其余的问题。

  以太网是一种有趣的系统,它将计算机连接在一个 局域网 (local area network,LAN)中。 每台计算机有一个网络接口卡(中文简称“网卡”) (network interface card,NIC)。 每个网卡有一个唯一的48位标识,称为它的 地址。世界上没有两块 网卡会有相同的地址。

  这些网卡彼此相连。一旦一台计算机要与在同一以太网局域网中的另一台计算机通信时,就在网络上发送一条消息。每个网卡都会看见这条消息。但是作为以太网协议的一部分, 数据包含目的网卡的地址(还有其它内容)。所以,在所有网卡中只有一个会注意那条消息,其余的则会忽略。

  但并非所有的计算机都被连接在同一网络上。因为我们在我们的以太网上所接收到的数据并不意味着那一定源自于我们的局域网。可能有来自其它通过Internet 与我们自己的网络相连的网络的数据来我们面前。

  在Internet上传送的所有数据都使用IP。 IP表示网间协议 (Internet Protocol)。它的基本功能是让我们知道世界上的数据从哪里到来,应该会到哪里去。它并不 保证我们一定会接收到数据, 只保证假如我们接收到数据时会知道它从哪里来。

  甚至即使我们接收到数据,IP 也不保证我们会按照其它计算机发送数据段的顺序接收到这些数据段。举个例子,我们接收到图像的中心部分可能在接收到左上角之前, 又可能在接收到右下角之后。

  是TCP (Transmission Control Protocol传输控制协议) 要求发送方重发丢失的数据,并且把数据都排成正确的顺序。

  总结起来,一台计算机与另一台计算机通信一幅图像的样子需要 五个不同的协议。我们接收到的数据被包裹进 PNG协议,这又被包裹进 HTTP协议,而后又被包裹进 TCP协议,再后来又被包裹进 IP协议,最后被包裹进 Ethernet协议。

  欧,顺便说一下,可能有几个其它的协议包含在那其中的某个位置。例如,如果我们的局域网通过电话呼叫接入 Internet,就会在调制解调器上使用PPP协议,而调制解调器还可能使用一个(或多个)调制解调器协议, 等等,等等,等等……

  到现在为止作为一个开发者你应该问: “我应该如何掌握它们全部? ”

  你是幸运的,你必掌握它们全部。 你只掌握其中的一部分,而不是全部。尤其你不需要担心物理连接(在我们的情形中是以太网和 可能的PPP等)。你也不需要掌握网间协议, 或是传输控制协议。

  换句话说,你不必为从其它计算机接收数据做所有的事情。 好,你又要要做什么,事实上就像打开一个文件一样简单。

  一旦你收到数据,就需要你指出如何处理。 在我们的情形中,你需要明白HTTP协议和 PNG文件结构。

  以此类推,所有联网协议变成一个灰色区域:并非因为我们不明白它们如何工作,而是因为我们不必关心它们。套接字接口为我们照管这些灰色区域:

  我们只需要明白告诉我们如何理解数据的协议, 而不是如何从其它进程接收数据, 也不是如何向其它进程发送数据。


7.4 套接字模型

  BSD套接字构建在基本的UNIX模型上: 一切都是文件。那么,在我们的例子中,套接字将使我们接收一个HTTP文件,就这么说。然后我们要负责将 PNG文件从中提取出来。

  由于联网的复杂性,我们不能只使用 open系统调用,或open() C 函数。而是我们需要分几步 “打开”一个套接字。

  一旦我们做了这些,我们就能以处理任何文件描述符 的方式处理套接字。我们从它读取 (read),向它写入(write),建立管道(pipe), 必定还要关闭(close)它。


7.5 Essential Socket Functions

  While FreeBSD offers different functions to work with sockets, we only need four to “open” a socket. And in some cases we only need two.


7.5.1 The Client-Server Difference

  Typically, one of the ends of a socket-based data communication is a server, the other is a client.


7.5.1.1 The Common Elements

7.5.1.1.1 socket

  The one function used by both, clients and servers, is socket(2). It is declared this way:

int socket(int domain, int type, int protocol);

  The return value is of the same type as that of open, an integer. FreeBSD allocates its value from the same pool as that of file handles. That is what allows sockets to be treated the same way as files.

  The domain argument tells the system what protocol family you want it to use. Many of them exist, some are vendor specific, others are very common. They are declared in sys/socket.h.

  Use PF_INET for UDP, TCP and other Internet protocols (IPv4).

  Five values are defined for the type argument, again, in sys/socket.h. All of them start with “SOCK_”. The most common one is SOCK_STREAM, which tells the system you are asking for a reliable stream delivery service (which is TCP when used with PF_INET).

  If you asked for SOCK_DGRAM, you would be requesting a connectionless datagram delivery service (in our case, UDP).

  If you wanted to be in charge of the low-level protocols (such as IP), or even network interfaces (e.g., the Ethernet), you would need to specify SOCK_RAW.

  Finally, the protocol argument depends on the previous two arguments, and is not always meaningful. In that case, use 0 for its value.

The Unconnected Socket: Nowhere, in the socket function have we specified to what other system we should be connected. Our newly created socket remains unconnected.

This is on purpose: To use a telephone analogy, we have just attached a modem to the phone line. We have neither told the modem to make a call, nor to answer if the phone rings.


7.5.1.1.2 sockaddr

  Various functions of the sockets family expect the address of (or pointer to, to use C terminology) a small area of the memory. The various C declarations in the sys/socket.h refer to it as struct sockaddr. This structure is declared in the same file:

/*
 * Structure used by kernel to store most
 * addresses.
 */
struct sockaddr {
    unsigned char   sa_len;     /* total length */
    sa_family_t sa_family;  /* address family */
    char        sa_data[14];    /* actually longer; address value */
};
#define SOCK_MAXADDRLEN 255     /* longest possible addresses */

  Please note the vagueness with which the sa_data field is declared, just as an array of 14 bytes, with the comment hinting there can be more than 14 of them.

  This vagueness is quite deliberate. Sockets is a very powerful interface. While most people perhaps think of it as nothing more than the Internet interface──and most applications probably use it for that nowadays──sockets can be used for just about any kind of interprocess communications, of which the Internet (or, more precisely, IP) is only one.

  The sys/socket.h refers to the various types of protocols sockets will handle as address families, and lists them right before the definition of sockaddr:

/*
 * Address families.
 */
#define AF_UNSPEC   0       /* unspecified */
#define AF_LOCAL    1       /* local to host (pipes, portals) */
#define AF_UNIX     AF_LOCAL    /* backward compatibility */
#define AF_INET     2       /* internetwork: UDP, TCP, etc. */
#define AF_IMPLINK  3       /* arpanet imp addresses */
#define AF_PUP      4       /* pup protocols: e.g. BSP */
#define AF_CHAOS    5       /* mit CHAOS protocols */
#define AF_NS       6       /* XEROX NS protocols */
#define AF_ISO      7       /* ISO protocols */
#define AF_OSI      AF_ISO
#define AF_ECMA     8       /* European computer manufacturers */
#define AF_DATAKIT  9       /* datakit protocols */
#define AF_CCITT    10      /* CCITT protocols, X.25 etc */
#define AF_SNA      11      /* IBM SNA */
#define AF_DECnet   12      /* DECnet */
#define AF_DLI      13      /* DEC Direct data link interface */
#define AF_LAT      14      /* LAT */
#define AF_HYLINK   15      /* NSC Hyperchannel */
#define AF_APPLETALK    16      /* Apple Talk */
#define AF_ROUTE    17      /* Internal Routing Protocol */
#define AF_LINK     18      /* Link layer interface */
#define pseudo_AF_XTP   19      /* eXpress Transfer Protocol (no AF) */
#define AF_COIP     20      /* connection-oriented IP, aka ST II */
#define AF_CNT      21      /* Computer Network Technology */
#define pseudo_AF_RTIP  22      /* Help Identify RTIP packets */
#define AF_IPX      23      /* Novell Internet Protocol */
#define AF_SIP      24      /* Simple Internet Protocol */
#define pseudo_AF_PIP   25      /* Help Identify PIP packets */
#define AF_ISDN     26      /* Integrated Services Digital Network*/
#define AF_E164     AF_ISDN     /* CCITT E.164 recommendation */
#define pseudo_AF_KEY   27      /* Internal key-management function */
#define AF_INET6    28      /* IPv6 */
#define AF_NATM     29      /* native ATM access */
#define AF_ATM      30      /* ATM */
#define pseudo_AF_HDRCMPLT 31       /* Used by BPF to not rewrite headers
                     * in interface output routine
                     */
#define AF_NETGRAPH 32      /* Netgraph sockets */
#define AF_SLOW     33      /* 802.3ad slow protocol */
#define AF_SCLUSTER 34      /* Sitara cluster protocol */
#define AF_ARP      35
#define AF_BLUETOOTH    36      /* Bluetooth sockets */
#define AF_MAX      37

  The one used for IP is AF_INET. It is a symbol for the constant 2.

  It is the address family listed in the sa_family field of sockaddr that decides how exactly the vaguely named bytes of sa_data will be used.

  Specifically, whenever the address family is AF_INET, we can use struct sockaddr_in found in netinet/in.h, wherever sockaddr is expected:

/*
 * Socket address, internet style.
 */
struct sockaddr_in {
    uint8_t     sin_len;
    sa_family_t sin_family;
    in_port_t   sin_port;
    struct  in_addr sin_addr;
    char    sin_zero[8];
};

  We can visualize its organization this way:

  The three important fields are sin_family, which is byte 1 of the structure, sin_port, a 16-bit value found in bytes 2 and 3, and sin_addr, a 32-bit integer representation of the IP address, stored in bytes 4-7.

  Now, let us try to fill it out. Let us assume we are trying to write a client for the daytime protocol, which simply states that its server will write a text string representing the current date and time to port 13. We want to use TCP/IP, so we need to specify AF_INET in the address family field. AF_INET is defined as 2. Let us use the IP address of 192.43.244.18, which is the time server of US federal government (time.nist.gov).

  By the way the sin_addr field is declared as being of the struct in_addr type, which is defined in netinet/in.h:

/*
 * Internet address (a structure for historical reasons)
 */
struct in_addr {
    in_addr_t s_addr;
};

  In addition, in_addr_t is a 32-bit integer.

  The 192.43.244.18 is just a convenient notation of expressing a 32-bit integer by listing all of its 8-bit bytes, starting with the most significant one.

  So far, we have viewed sockaddr as an abstraction. Our computer does not store short integers as a single 16-bit entity, but as a sequence of 2 bytes. Similarly, it stores 32-bit integers as a sequence of 4 bytes.

  Suppose we coded something like this:

   sa.sin_family      = AF_INET;
    sa.sin_port        = 13;
    sa.sin_addr.s_addr = (((((192 << 8) | 43) << 8) | 244) << 8) | 18;

  What would the result look like?

  Well, that depends, of course. On a Pentium®, or other x86, based computer, it would look like this:

  On a different system, it might look like this:

  And on a PDP it might look different yet. But the above two are the most common ways in use today.

  Ordinarily, wanting to write portable code, programmers pretend that these differences do not exist. And they get away with it (except when they code in assembly language). Alas, you cannot get away with it that easily when coding for sockets.

  Why?

  Because when communicating with another computer, you usually do not know whether it stores data most significant byte (MSB) or least significant byte (LSB) first.

  You might be wondering, “So, will sockets not handle it for me?”

  It will not.

  While that answer may surprise you at first, remember that the general sockets interface only understands the sa_len and sa_family fields of the sockaddr structure. You do not have to worry about the byte order there (of course, on FreeBSD sa_family is only 1 byte anyway, but many other UNIX systems do not have sa_len and use 2 bytes for sa_family, and expect the data in whatever order is native to the computer).

  But the rest of the data is just sa_data[14] as far as sockets goes. Depending on the address family, sockets just forwards that data to its destination.

  Indeed, when we enter a port number, it is because we want the other computer to know what service we are asking for. And, when we are the server, we read the port number so we know what service the other computer is expecting from us. Either way, sockets only has to forward the port number as data. It does not interpret it in any way.

  Similarly, we enter the IP address to tell everyone on the way where to send our data to. Sockets, again, only forwards it as data.

  That is why, we (the programmers, not the sockets) have to distinguish between the byte order used by our computer and a conventional byte order to send the data in to the other computer.

  We will call the byte order our computer uses the host byte order, or just the host order.

  There is a convention of sending the multi-byte data over IP MSB first. This, we will refer to as the network byte order, or simply the network order.

  Now, if we compiled the above code for an Intel based computer, our host byte order would produce:

  But the network byte order requires that we store the data MSB first:

  Unfortunately, our host order is the exact opposite of the network order.

  We have several ways of dealing with it. One would be to reverse the values in our code:

   sa.sin_family      = AF_INET;
    sa.sin_port        = 13 << 8;
    sa.sin_addr.s_addr = (((((18 << 8) | 244) << 8) | 43) << 8) | 192;

  This will trick our compiler into storing the data in the network byte order. In some cases, this is exactly the way to do it (e.g., when programming in assembly language). In most cases, however, it can cause a problem.

  Suppose, you wrote a sockets-based program in C. You know it is going to run on a Pentium, so you enter all your constants in reverse and force them to the network byte order. It works well.

  Then, some day, your trusted old Pentium becomes a rusty old Pentium. You replace it with a system whose host order is the same as the network order. You need to recompile all your software. All of your software continues to perform well, except the one program you wrote.

  You have since forgotten that you had forced all of your constants to the opposite of the host order. You spend some quality time tearing out your hair, calling the names of all gods you ever heard of (and some you made up), hitting your monitor with a nerf bat, and performing all the other traditional ceremonies of trying to figure out why something that has worked so well is suddenly not working at all.

  Eventually, you figure it out, say a couple of swear words, and start rewriting your code.

  Luckily, you are not the first one to face the problem. Someone else has created the htons(3) and htonl(3) C functions to convert a short and long respectively from the host byte order to the network byte order, and the ntohs(3) and ntohl(3) C functions to go the other way.

  On MSB-first systems these functions do nothing. On LSB-first systems they convert values to the proper order.

  So, regardless of what system your software is compiled on, your data will end up in the correct order if you use these functions.


7.5.1.2 Client Functions

  Typically, the client initiates the connection to the server. The client knows which server it is about to call: It knows its IP address, and it knows the port the server resides at. It is akin to you picking up the phone and dialing the number (the address), then, after someone answers, asking for the person in charge of wingdings (the port).


7.5.1.2.1 connect

  Once a client has created a socket, it needs to connect it to a specific port on a remote system. It uses connect(2):

int connect(int s, const struct sockaddr *name, socklen_t namelen);

  The s argument is the socket, i.e., the value returned by the socket function. The name is a pointer to sockaddr, the structure we have talked about extensively. Finally, namelen informs the system how many bytes are in our sockaddr structure.

  If connect is successful, it returns 0. Otherwise it returns -1 and stores the error code in errno.

  There are many reasons why connect may fail. For example, with an attempt to an Internet connection, the IP address may not exist, or it may be down, or just too busy, or it may not have a server listening at the specified port. Or it may outright refuse any request for specific code.


7.5.1.2.2 Our First Client

  We now know enough to write a very simple client, one that will get current time from 192.43.244.18 and print it to stdout.

/*
 * daytime.c
 *
 * Programmed by G. Adam Stanislav
 */
#include 
#include 
#include 
#include 

int main() {
  register int s;
  register int bytes;
  struct sockaddr_in sa;
  char buffer[BUFSIZ+1];

  if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
    perror("socket");
    return 1;
  }

  bzero(&sa, sizeof sa);

  sa.sin_family = AF_INET;
  sa.sin_port = htons(13);
  sa.sin_addr.s_addr = htonl((((((192 << 8) | 43) << 8) | 244) << 8) | 18);
  if (connect(s, (struct sockaddr *)&sa, sizeof sa) < 0) {
    perror("connect");
    close(s);
    return 2;
  }

  while ((bytes = read(s, buffer, BUFSIZ)) > 0)
    write(1, buffer, bytes);

  close(s);
  return 0;
}

  Go ahead, enter it in your editor, save it as daytime.c, then compile and run it:

% cc -O3 -o daytime daytime.c
% ./daytime

52079 01-06-19 02:29:25 50 0 1 543.9 UTC(NIST) * 
%

  In this case, the date was June 19, 2001, the time was 02:29:25 UTC. Naturally, your results will vary.


7.5.1.3 Server Functions

  The typical server does not initiate the connection. Instead, it waits for a client to call it and request services. It does not know when the client will call, nor how many clients will call. It may be just sitting there, waiting patiently, one moment, The next moment, it can find itself swamped with requests from a number of clients, all calling in at the same time.

  The sockets interface offers three basic functions to handle this.


7.5.1.3.1 bind

  Ports are like extensions to a phone line: After you dial a number, you dial the extension to get to a specific person or department.

  There are 65535 IP ports, but a server usually processes requests that come in on only one of them. It is like telling the phone room operator that we are now at work and available to answer the phone at a specific extension. We use bind(2) to tell sockets which port we want to serve.

int bind(int s, const struct sockaddr *addr, socklen_t addrlen);

  Beside specifying the port in addr, the server may include its IP address. However, it can just use the symbolic constant INADDR_ANY to indicate it will serve all requests to the specified port regardless of what its IP address is. This symbol, along with several similar ones, is declared in netinet/in.h

#define    INADDR_ANY      (u_int32_t)0x00000000

  Suppose we were writing a server for the daytime protocol over TCP/IP. Recall that it uses port 13. Our sockaddr_in structure would look like this:


7.5.1.3.2 listen

  To continue our office phone analogy, after you have told the phone central operator what extension you will be at, you now walk into your office, and make sure your own phone is plugged in and the ringer is turned on. Plus, you make sure your call waiting is activated, so you can hear the phone ring even while you are talking to someone.

  The server ensures all of that with the listen(2) function.

int listen(int s, int backlog);

  In here, the backlog variable tells sockets how many incoming requests to accept while you are busy processing the last request. In other words, it determines the maximum size of the queue of pending connections.


7.5.1.3.3 accept

  After you hear the phone ringing, you accept the call by answering the call. You have now established a connection with your client. This connection remains active until either you or your client hang up.

  The server accepts the connection by using the accept(2) function.

int accept(int s, struct sockaddr *addr, socklen_t *addrlen);

  Note that this time addrlen is a pointer. This is necessary because in this case it is the socket that fills out addr, the sockaddr_in structure.

  The return value is an integer. Indeed, the accept returns a new socket. You will use this new socket to communicate with the client.

  What happens to the old socket? It continues to listen for more requests (remember the backlog variable we passed to listen?) until we close it.

  Now, the new socket is meant only for communications. It is fully connected. We cannot pass it to listen again, trying to accept additional connections.


7.5.1.3.4 Our First Server

  Our first server will be somewhat more complex than our first client was: Not only do we have more sockets functions to use, but we need to write it as a daemon.

  This is best achieved by creating a child process after binding the port. The main process then exits and returns control to the shell (or whatever program invoked it).

  The child calls listen, then starts an endless loop, which accepts a connection, serves it, and eventually closes its socket.

/*
 * daytimed - a port 13 server
 *
 * Programmed by G. Adam Stanislav
 * June 19, 2001
 */
#include 
#include 
#include 
#include 
#include 
#include 

#define BACKLOG 4

int main() {
    register int s, c;
    int b;
    struct sockaddr_in sa;
    time_t t;
    struct tm *tm;
    FILE *client;

    if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
        perror("socket");
        return 1;
    }

    bzero(&sa, sizeof sa);

    sa.sin_family = AF_INET;
    sa.sin_port   = htons(13);

    if (INADDR_ANY)
        sa.sin_addr.s_addr = htonl(INADDR_ANY);

    if (bind(s, (struct sockaddr *)&sa, sizeof sa) < 0) {
        perror("bind");
        return 2;
    }

    switch (fork()) {
        case -1:
            perror("fork");
            return 3;
            break;
        default:
            close(s);
            return 0;
            break;
        case 0:
            break;
    }

    listen(s, BACKLOG);

    for (;;) {
        b = sizeof sa;

        if ((c = accept(s, (struct sockaddr *)&sa, &b)) < 0) {
            perror("daytimed accept");
            return 4;
        }

        if ((client = fdopen(c, "w")) == NULL) {
            perror("daytimed fdopen");
            return 5;
        }

        if ((t = time(NULL)) < 0) {
            perror("daytimed time");

            return 6;
        }

        tm = gmtime(&t);
        fprintf(client, "%.4i-%.2i-%.2iT%.2i:%.2i:%.2iZ/n",
            tm->tm_year + 1900,
            tm->tm_mon + 1,
            tm->tm_mday,
            tm->tm_hour,
            tm->tm_min,
            tm->tm_sec);

        fclose(client);
    }
}

  We start by creating a socket. Then we fill out the sockaddr_in structure in sa. Note the conditional use of INADDR_ANY:

    if (INADDR_ANY)
        sa.sin_addr.s_addr = htonl(INADDR_ANY);

  Its value is 0. Since we have just used bzero on the entire structure, it would be redundant to set it to 0 again. But if we port our code to some other system where INADDR_ANY is perhaps not a zero, we need to assign it to sa.sin_addr.s_addr. Most modern C compilers are clever enough to notice that INADDR_ANY is a constant. As long as it is a zero, they will optimize the entire conditional statement out of the code.

  After we have called bind successfully, we are ready to become a daemon: We use fork to create a child process. In both, the parent and the child, the s variable is our socket. The parent process will not need it, so it calls close, then it returns 0 to inform its own parent it had terminated successfully.

  Meanwhile, the child process continues working in the background. It calls listen and sets its backlog to 4. It does not need a large value here because daytime is not a protocol many clients request all the time, and because it can process each request instantly anyway.

  Finally, the daemon starts an endless loop, which performs the following steps:

  1. Call accept. It waits here until a client contacts it. At that point, it receives a new socket, c, which it can use to communicate with this particular client.

  2. It uses the C function fdopen to turn the socket from a low-level file descriptor to a C-style FILE pointer. This will allow the use of fprintf later on.

  3. It checks the time, and prints it in the ISO 8601 format to the client “file”. It then uses fclose to close the file. That will automatically close the socket as well.

  We can generalize this, and use it as a model for many other servers:

  This flowchart is good for sequential servers, i.e., servers that can serve one client at a time, just as we were able to with our daytime server. This is only possible whenever there is no real “conversation” going on between the client and the server: As soon as the server detects a connection to the client, it sends out some data and closes the connection. The entire operation may take nanoseconds, and it is finished.

  The advantage of this flowchart is that, except for the brief moment after the parent forks and before it exits, there is always only one process active: Our server does not take up much memory and other system resources.

  Note that we have added initialize daemon in our flowchart. We did not need to initialize our own daemon, but this is a good place in the flow of the program to set up any signal handlers, open any files we may need, etc.

  Just about everything in the flow chart can be used literally on many different servers. The serve entry is the exception. We think of it as a “black box”, i.e., something you design specifically for your own server, and just “plug it into the rest.”

  Not all protocols are that simple. Many receive a request from the client, reply to it, then receive another request from the same client. Because of that, they do not know in advance how long they will be serving the client. Such servers usually start a new process for each client. While the new process is serving its client, the daemon can continue listening for more connections.

  Now, go ahead, save the above source code as daytimed.c (it is customary to end the names of daemons with the letter d). After you have compiled it, try running it:

% ./daytimed
bind: Permission denied
%

  What happened here? As you will recall, the daytime protocol uses port 13. But all ports below 1024 are reserved to the superuser (otherwise, anyone could start a daemon pretending to serve a commonly used port, while causing a security breach).

  Try again, this time as the superuser:

# ./daytimed
#

  What... Nothing? Let us try again:

# ./daytimed

bind: Address already in use
#

  Every port can only be bound by one program at a time. Our first attempt was indeed successful: It started the child daemon and returned quietly. It is still running and will continue to run until you either kill it, or any of its system calls fail, or you reboot the system.

  Fine, we know it is running in the background. But is it working? How do we know it is a proper daytime server? Simple:

% telnet localhost 13

Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
2001-06-19T21:04:42Z
Connection closed by foreign host.
%

  telnet tried the new IPv6, and failed. It retried with IPv4 and succeeded. The daemon works.

  If you have access to another UNIX system via telnet, you can use it to test accessing the server remotely. My computer does not have a static IP address, so this is what I did:

% who

whizkid          ttyp0   Jun 19 16:59   (216.127.220.143)
xxx              ttyp1   Jun 19 16:06   (xx.xx.xx.xx)
% telnet 216.127.220.143 13

Trying 216.127.220.143...
Connected to r47.bfm.org.
Escape character is '^]'.
2001-06-19T21:31:11Z
Connection closed by foreign host.
%

  Again, it worked. Will it work using the domain name?

% telnet r47.bfm.org 13

Trying 216.127.220.143...
Connected to r47.bfm.org.
Escape character is '^]'.
2001-06-19T21:31:40Z
Connection closed by foreign host.
%

  By the way, telnet prints the Connection closed by foreign host message after our daemon has closed the socket. This shows us that, indeed, using fclose(client); in our code works as advertised.


7.6 Helper Functions

  FreeBSD C library contains many helper functions for sockets programming. For example, in our sample client we hard coded the time.nist.gov IP address. But we do not always know the IP address. Even if we do, our software is more flexible if it allows the user to enter the IP address, or even the domain name.


7.6.1 gethostbyname

  While there is no way to pass the domain name directly to any of the sockets functions, the FreeBSD C library comes with the gethostbyname(3) and gethostbyname2(3) functions, declared in netdb.h.

struct hostent * gethostbyname(const char *name);
struct hostent * gethostbyname2(const char *name, int af);

  Both return a pointer to the hostent structure, with much information about the domain. For our purposes, the h_addr_list[0] field of the structure points at h_length bytes of the correct address, already stored in the network byte order.

  This allows us to create a much more flexible──and much more useful──version of our daytime program:

/*
 * daytime.c
 *
 * Programmed by G. Adam Stanislav
 * 19 June 2001
 */
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[]) {
  register int s;
  register int bytes;
  struct sockaddr_in sa;
  struct hostent *he;
  char buf[BUFSIZ+1];
  char *host;

  if ((s = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
    perror("socket");
    return 1;
  }

  bzero(&sa, sizeof sa);

  sa.sin_family = AF_INET;
  sa.sin_port = htons(13);

  host = (argc > 1) ? (char *)argv[1] : "time.nist.gov";

  if ((he = gethostbyname(host)) == NULL) {
    perror(host);
    return 2;
  }

  bcopy(he->h_addr_list[0],&sa.sin_addr, he->h_length);

  if (connect(s, (struct sockaddr *)&sa, sizeof sa) < 0) {
    perror("connect");
    return 3;
  }

  while ((bytes = read(s, buf, BUFSIZ)) > 0)
    write(1, buf, bytes);

  close(s);
  return 0;
}

  We now can type a domain name (or an IP address, it works both ways) on the command line, and the program will try to connect to its daytime server. Otherwise, it will still default to time.nist.gov. However, even in this case we will use gethostbyname rather than hard coding 192.43.244.18. That way, even if its IP address changes in the future, we will still find it.

  Since it takes virtually no time to get the time from your local server, you could run daytime twice in a row: First to get the time from time.nist.gov, the second time from your own system. You can then compare the results and see how exact your system clock is:

% daytime ; daytime localhost


52080 01-06-20 04:02:33 50 0 0 390.2 UTC(NIST) * 
2001-06-20T04:02:35Z
%

  As you can see, my system was two seconds ahead of the NIST time.


7.6.2 getservbyname

  Sometimes you may not be sure what port a certain service uses. The getservbyname(3) function, also declared in netdb.h comes in very handy in those cases:

struct servent * getservbyname(const char *name, const char *proto);

  The servent structure contains the s_port, which contains the proper port, already in network byte order.

  Had we not known the correct port for the daytime service, we could have found it this way:

  struct servent *se;
  ...
  if ((se = getservbyname("daytime", "tcp")) == NULL {
    fprintf(stderr, "Cannot determine which port to use./n");
    return 7;
  }
  sa.sin_port = se->s_port;

  You usually do know the port. But if you are developing a new protocol, you may be testing it on an unofficial port. Some day, you will register the protocol and its port (if nowhere else, at least in your /etc/services, which is where getservbyname looks). Instead of returning an error in the above code, you just use the temporary port number. Once you have listed the protocol in /etc/services, your software will find its port without you having to rewrite the code.


7.7 Concurrent Servers

  Unlike a sequential server, a concurrent server has to be able to serve more than one client at a time. For example, a chat server may be serving a specific client for hours──it cannot wait till it stops serving a client before it serves the next one.

  This requires a significant change in our flowchart:

  We moved the serve from the daemon process to its own server process. However, because each child process inherits all open files (and a socket is treated just like a file), the new process inherits not only the “accepted handle,” i.e., the socket returned by the accept call, but also the top socket, i.e., the one opened by the top process right at the beginning.

  However, the server process does not need this socket and should close it immediately. Similarly, the daemon process no longer needs the accepted socket, and not only should, but must close it──otherwise, it will run out of available file descriptors sooner or later.

  After the server process is done serving, it should close the accepted socket. Instead of returning to accept, it now exits.

  Under UNIX, a process does not really exit. Instead, it returns to its parent. Typically, a parent process waits for its child process, and obtains a return value. However, our daemon process cannot simply stop and wait. That would defeat the whole purpose of creating additional processes. But if it never does wait, its children will become zombies──no longer functional but still roaming around.

  For that reason, the daemon process needs to set signal handlers in its initialize daemon phase. At least a SIGCHLD signal has to be processed, so the daemon can remove the zombie return values from the system and release the system resources they are taking up.

  That is why our flowchart now contains a process signals box, which is not connected to any other box. By the way, many servers also process SIGHUP, and typically interpret as the signal from the superuser that they should reread their configuration files. This allows us to change settings without having to kill and restart these servers.


第8章  IPv6 Internals

8.1 IPv6/IPsec Implementation

Contributed by Yoshinobu Inoue.

  This section should explain IPv6 and IPsec related implementation internals. These functionalities are derived from KAME project

8.1.1 IPv6

8.1.1.1 Conformance

  The IPv6 related functions conforms, or tries to conform to the latest set of IPv6 specifications. For future reference we list some of the relevant documents below (NOTE: this is not a complete list - this is too hard to maintain...).

  For details please refer to specific chapter in the document, RFCs, manual pages, or comments in the source code.

  Conformance tests have been performed on the KAME STABLE kit at TAHI project. Results can be viewed at http://www.tahi.org/report/KAME/. We also attended Univ. of New Hampshire IOL tests (http://www.iol.unh.edu/) in the past, with our past snapshots.

  • RFC1639: FTP Operation Over Big Address Records (FOOBAR)

    • RFC2428 is preferred over RFC1639. FTP clients will first try RFC2428, then RFC1639 if failed.

  • RFC1886: DNS Extensions to support IPv6

  • RFC1933: Transition Mechanisms for IPv6 Hosts and Routers

    • IPv4 compatible address is not supported.

    • automatic tunneling (described in 4.3 of this RFC) is not supported.

    • gif(4) interface implements IPv[46]-over-IPv[46] tunnel in a generic way, and it covers "configured tunnel" described in the spec. See 23.5.1.5 in this document for details.

  • RFC1981: Path MTU Discovery for IPv6

  • RFC2080: RIPng for IPv6

    • usr.sbin/route6d support this.

  • RFC2292: Advanced Sockets API for IPv6

    • For supported library functions/kernel APIs, see sys/netinet6/ADVAPI.

  • RFC2362: Protocol Independent Multicast-Sparse Mode (PIM-SM)

    • RFC2362 defines packet formats for PIM-SM. draft-ietf-pim-ipv6-01.txt is written based on this.

  • RFC2373: IPv6 Addressing Architecture

    • supports node required addresses, and conforms to the scope requirement.

  • RFC2374: An IPv6 Aggregatable Global Unicast Address Format

    • supports 64-bit length of Interface ID.

  • RFC2375: IPv6 Multicast Address Assignments

    • Userland applications use the well-known addresses assigned in the RFC.

  • RFC2428: FTP Extensions for IPv6 and NATs

    • RFC2428 is preferred over RFC1639. FTP clients will first try RFC2428, then RFC1639 if failed.

  • RFC2460: IPv6 specification

  • RFC2461: Neighbor discovery for IPv6

    • See 23.5.1.2 in this document for details.

  • RFC2462: IPv6 Stateless Address Autoconfiguration

    • See 23.5.1.4 in this document for details.

  • RFC2463: ICMPv6 for IPv6 specification

    • See 23.5.1.9 in this document for details.

  • RFC2464: Transmission of IPv6 Packets over Ethernet Networks

  • RFC2465: MIB for IPv6: Textual Conventions and General Group

    • Necessary statistics are gathered by the kernel. Actual IPv6 MIB support is provided as a patchkit for ucd-snmp.

  • RFC2466: MIB for IPv6: ICMPv6 group

    • Necessary statistics are gathered by the kernel. Actual IPv6 MIB support is provided as patchkit for ucd-snmp.

  • RFC2467: Transmission of IPv6 Packets over FDDI Networks

  • RFC2497: Transmission of IPv6 packet over ARCnet Networks

  • RFC2553: Basic Socket Interface Extensions for IPv6

    • IPv4 mapped address (3.7) and special behavior of IPv6 wildcard bind socket (3.8) are supported. See 23.5.1.12 in this document for details.

  • RFC2675: IPv6 Jumbograms

    • See 23.5.1.7 in this document for details.

  • RFC2710: Multicast Listener Discovery for IPv6

  • RFC2711: IPv6 router alert option

  • draft-ietf-ipngwg-router-renum-08: Router renumbering for IPv6

  • draft-ietf-ipngwg-icmp-namelookups-02: IPv6 Name Lookups Through ICMP

  • draft-ietf-ipngwg-icmp-name-lookups-03: IPv6 Name Lookups Through ICMP

  • draft-ietf-pim-ipv6-01.txt: PIM for IPv6

    • pim6dd(8) implements dense mode. pim6sd(8) implements sparse mode.

  • draft-itojun-ipv6-tcp-to-anycast-00: Disconnecting TCP connection toward IPv6 anycast address

  • draft-yamamoto-wideipv6-comm-model-00

    • See 23.5.1.6 in this document for details.

  • draft-ietf-ipngwg-scopedaddr-format-00.txt : An Extension of Format for IPv6 Scoped Addresses


8.1.1.2 Neighbor Discovery

  Neighbor Discovery is fairly stable. Currently Address Resolution, Duplicated Address Detection, and Neighbor Unreachability Detection are supported. In the near future we will be adding Proxy Neighbor Advertisement support in the kernel and Unsolicited Neighbor Advertisement transmission command as admin tool.

  If DAD fails, the address will be marked "duplicated" and message will be generated to syslog (and usually to console). The "duplicated" mark can be checked with ifconfig(8). It is administrators' responsibility to check for and recover from DAD failures. The behavior should be improved in the near future.

  Some of the network driver loops multicast packets back to itself, even if instructed not to do so (especially in promiscuous mode). In such cases DAD may fail, because DAD engine sees inbound NS packet (actually from the node itself) and considers it as a sign of duplicate. You may want to look at #if condition marked "heuristics" in sys/netinet6/nd6_nbr.c:nd6_dad_timer() as workaround (note that the code fragment in "heuristics" section is not spec conformant).

  Neighbor Discovery specification (RFC2461) does not talk about neighbor cache handling in the following cases:

  1. when there was no neighbor cache entry, node received unsolicited RS/NS/NA/redirect packet without link-layer address

  2. neighbor cache handling on medium without link-layer address (we need a neighbor cache entry for IsRouter bit)

  For first case, we implemented workaround based on discussions on IETF ipngwg mailing list. For more details, see the comments in the source code and email thread started from (IPng 7155), dated Feb 6 1999.

  IPv6 on-link determination rule (RFC2461) is quite different from assumptions in BSD network code. At this moment, no on-link determination rule is supported where default router list is empty (RFC2461, section 5.2, last sentence in 2nd paragraph - note that the spec misuse the word "host" and "node" in several places in the section).

  To avoid possible DoS attacks and infinite loops, only 10 options on ND packet is accepted now. Therefore, if you have 20 prefix options attached to RA, only the first 10 prefixes will be recognized. If this troubles you, please ask it on FREEBSD-CURRENT mailing list and/or modify nd6_maxndopt in sys/netinet6/nd6.c. If there are high demands we may provide sysctl knob for the variable.


8.1.1.3 Scope Index

  IPv6 uses scoped addresses. Therefore, it is very important to specify scope index (interface index for link-local address, or site index for site-local address) with an IPv6 address. Without scope index, scoped IPv6 address is ambiguous to the kernel, and kernel will not be able to determine the outbound interface for a packet.

  Ordinary userland applications should use advanced API (RFC2292) to specify scope index, or interface index. For similar purpose, sin6_scope_id member in sockaddr_in6 structure is defined in RFC2553. However, the semantics for sin6_scope_id is rather vague. If you care about portability of your application, we suggest you to use advanced API rather than sin6_scope_id.

  In the kernel, an interface index for link-local scoped address is embedded into 2nd 16bit-word (3rd and 4th byte) in IPv6 address. For example, you may see something like:

   fe80:1::200:f8ff:fe01:6317
   

  in the routing table and interface address structure (struct in6_ifaddr). The address above is a link-local unicast address which belongs to a network interface whose interface identifier is 1. The embedded index enables us to identify IPv6 link local addresses over multiple interfaces effectively and with only a little code change.

  Routing daemons and configuration programs, like route6d(8) and ifconfig(8), will need to manipulate the "embedded" scope index. These programs use routing sockets and ioctls (like SIOCGIFADDR_IN6) and the kernel API will return IPv6 addresses with 2nd 16bit-word filled in. The APIs are for manipulating kernel internal structure. Programs that use these APIs have to be prepared about differences in kernels anyway.

  When you specify scoped address to the command line, NEVER write the embedded form (such as ff02:1::1 or fe80:2::fedc). This is not supposed to work. Always use standard form, like ff02::1 or fe80::fedc, with command line option for specifying interface (like ping6 -I ne0 ff02::1). In general, if a command does not have command line option to specify outgoing interface, that command is not ready to accept scoped address. This may seem to be opposite from IPv6's premise to support "dentist office" situation. We believe that specifications need some improvements for this.

  Some of the userland tools support extended numeric IPv6 syntax, as documented in draft-ietf-ipngwg-scopedaddr-format-00.txt. You can specify outgoing link, by using name of the outgoing interface like "fe80::1%ne0". This way you will be able to specify link-local scoped address without much trouble.

  To use this extension in your program, you will need to use getaddrinfo(3), and getnameinfo(3) with NI_WITHSCOPEID. The implementation currently assumes 1-to-1 relationship between a link and an interface, which is stronger than what specs say.


8.1.1.4 Plug and Play

  Most of the IPv6 stateless address autoconfiguration is implemented in the kernel. Neighbor Discovery functions are implemented in the kernel as a whole. Router Advertisement (RA) input for hosts is implemented in the kernel. Router Solicitation (RS) output for endhosts, RS input for routers, and RA output for routers are implemented in the userland.


8.1.1.4.1 Assignment of link-local, and special addresses

  IPv6 link-local address is generated from IEEE802 address (Ethernet MAC address). Each of interface is assigned an IPv6 link-local address automatically, when the interface becomes up (IFF_UP). Also, direct route for the link-local address is added to routing table.

  Here is an output of netstat command:

Internet6:
Destination                   Gateway                   Flags      Netif Expire
fe80:1::%ed0/64               link#1                    UC          ed0
fe80:2::%ep0/64               link#2                    UC          ep0

  Interfaces that has no IEEE802 address (pseudo interfaces like tunnel interfaces, or ppp interfaces) will borrow IEEE802 address from other interfaces, such as Ethernet interfaces, whenever possible. If there is no IEEE802 hardware attached, a last resort pseudo-random value, MD5(hostname), will be used as source of link-local address. If it is not suitable for your usage, you will need to configure the link-local address manually.

  If an interface is not capable of handling IPv6 (such as lack of multicast support), link-local address will not be assigned to that interface. See section 2 for details.

  Each interface joins the solicited multicast address and the link-local all-nodes multicast addresses (e.g. fe80::1:ff01:6317 and ff02::1, respectively, on the link the interface is attached). In addition to a link-local address, the loopback address (::1) will be assigned to the loopback interface. Also, ::1/128 and ff01::/32 are automatically added to routing table, and loopback interface joins node-local multicast group ff01::1.


8.1.1.4.2 Stateless address autoconfiguration on hosts

  In IPv6 specification, nodes are separated into two categories: routers and hosts. Routers forward packets addressed to others, hosts does not forward the packets. net.inet6.ip6.forwarding defines whether this node is router or host (router if it is 1, host if it is 0).

  When a host hears Router Advertisement from the router, a host may autoconfigure itself by stateless address autoconfiguration. This behavior can be controlled by net.inet6.ip6.accept_rtadv (host autoconfigures itself if it is set to 1). By autoconfiguration, network address prefix for the receiving interface (usually global address prefix) is added. Default route is also configured. Routers periodically generate Router Advertisement packets. To request an adjacent router to generate RA packet, a host can transmit Router Solicitation. To generate a RS packet at any time, use the rtsol command. rtsold(8) daemon is also available. rtsold(8) generates Router Solicitation whenever necessary, and it works great for nomadic usage (notebooks/laptops). If one wishes to ignore Router Advertisements, use sysctl to set net.inet6.ip6.accept_rtadv to 0.

  To generate Router Advertisement from a router, use the rtadvd(8) daemon.

  Note that, IPv6 specification assumes the following items, and nonconforming cases are left unspecified:

  • Only hosts will listen to router advertisements

  • Hosts have single network interface (except loopback)

  Therefore, this is unwise to enable net.inet6.ip6.accept_rtadv on routers, or multi-interface host. A misconfigured node can behave strange (nonconforming configuration allowed for those who would like to do some experiments).

  To summarize the sysctl knob:

   accept_rtadv    forwarding  role of the node
    ---     ---     ---
    0       0       host (to be manually configured)
    0       1       router
    1       0       autoconfigured host
                    (spec assumes that host has single
                    interface only, autoconfigured host
                    with multiple interface is
                    out-of-scope)
    1       1       invalid, or experimental
                    (out-of-scope of spec)

  RFC2462 has validation rule against incoming RA prefix information option, in 5.5.3 (e). This is to protect hosts from malicious (or misconfigured) routers that advertise very short prefix lifetime. There was an update from Jim Bound to ipngwg mailing list (look for "(ipng 6712)" in the archive) and it is implemented Jim's update.

  See 23.5.1.2 in the document for relationship between DAD and autoconfiguration.


8.1.1.5 Generic tunnel interface

  GIF (Generic InterFace) is a pseudo interface for configured tunnel. Details are described in gif(4). Currently

  • v6 in v6

  • v6 in v4

  • v4 in v6

  • v4 in v4

  are available. Use gifconfig(8) to assign physical (outer) source and destination address to gif interfaces. Configuration that uses same address family for inner and outer IP header (v4 in v4, or v6 in v6) is dangerous. It is very easy to configure interfaces and routing tables to perform infinite level of tunneling. Please be warned.

  gif can be configured to be ECN-friendly. See 23.5.4.5 for ECN-friendliness of tunnels, and gif(4) for how to configure.

  If you would like to configure an IPv4-in-IPv6 tunnel with gif interface, read gif(4) carefully. You will need to remove IPv6 link-local address automatically assigned to the gif interface.


8.1.1.6 Source Address Selection

  Current source selection rule is scope oriented (there are some exceptions - see below). For a given destination, a source IPv6 address is selected by the following rule:

  1. If the source address is explicitly specified by the user (e.g. via the advanced API), the specified address is used.

  2. If there is an address assigned to the outgoing interface (which is usually determined by looking up the routing table) that has the same scope as the destination address, the address is used.

    This is the most typical case.

  3. If there is no address that satisfies the above condition, choose a global address assigned to one of the interfaces on the sending node.

  4. If there is no address that satisfies the above condition, and destination address is site local scope, choose a site local address assigned to one of the interfaces on the sending node.

  5. If there is no address that satisfies the above condition, choose the address associated with the routing table entry for the destination. This is the last resort, which may cause scope violation.

  For instance, ::1 is selected for ff01::1, fe80:1::200:f8ff:fe01:6317 for fe80:1::2a0:24ff:feab:839b (note that embedded interface index - described in 23.5.1.3 - helps us choose the right source address. Those embedded indices will not be on the wire). If the outgoing interface has multiple address for the scope, a source is selected longest match basis (rule 3). Suppose 3ffe:501:808:1:200:f8ff:fe01:6317 and 3ffe:2001:9:124:200:f8ff:fe01:6317 are given to the outgoing interface. 3ffe:501:808:1:200:f8ff:fe01:6317 is chosen as the source for the destination 3ffe:501:800::1.

  Note that the above rule is not documented in the IPv6 spec. It is considered "up to implementation" item. There are some cases where we do not use the above rule. One example is connected TCP session, and we use the address kept in tcb as the source. Another example is source address for Neighbor Advertisement. Under the spec (RFC2461 7.2.2) NA's source should be the target address of the corresponding NS's target. In this case we follow the spec rather than the above longest-match rule.

  For new connections (when rule 1 does not apply), deprecated addresses (addresses with preferred lifetime = 0) will not be chosen as source address if other choices are available. If no other choices are available, deprecated address will be used as a last resort. If there are multiple choice of deprecated addresses, the above scope rule will be used to choose from those deprecated addresses. If you would like to prohibit the use of deprecated address for some reason, configure net.inet6.ip6.use_deprecated to 0. The issue related to deprecated address is described in RFC2462 5.5.4 (NOTE: there is some debate underway in IETF ipngwg on how to use "deprecated" address).


8.1.1.7 Jumbo Payload

  The Jumbo Payload hop-by-hop option is implemented and can be used to send IPv6 packets with payloads longer than 65,535 octets. But currently no physical interface whose MTU is more than 65,535 is supported, so such payloads can be seen only on the loopback interface (i.e. lo0).

  If you want to try jumbo payloads, you first have to reconfigure the kernel so that the MTU of the loopback interface is more than 65,535 bytes; add the following to the kernel configuration file:

   options "LARGE_LOMTU" #To test jumbo payload

  and recompile the new kernel.

  Then you can test jumbo payloads by the ping6(8) command with -b and -s options. The -b option must be specified to enlarge the size of the socket buffer and the -s option specifies the length of the packet, which should be more than 65,535. For example, type as follows:

% ping6 -b 70000 -s 68000 ::1

  The IPv6 specification requires that the Jumbo Payload option must not be used in a packet that carries a fragment header. If this condition is broken, an ICMPv6 Parameter Problem message must be sent to the sender. specification is followed, but you cannot usually see an ICMPv6 error caused by this requirement.

  When an IPv6 packet is received, the frame length is checked and compared to the length specified in the payload length field of the IPv6 header or in the value of the Jumbo Payload option, if any. If the former is shorter than the latter, the packet is discarded and statistics are incremented. You can see the statistics as output of netstat(8) command with `-s -p ip6' option:

% netstat -s -p ip6
      ip6:
        (snip)
        1 with data size < data length

  So, kernel does not send an ICMPv6 error unless the erroneous packet is an actual Jumbo Payload, that is, its packet size is more than 65,535 bytes. As described above, currently no physical interface with such a huge MTU is supported, so it rarely returns an ICMPv6 error.

  TCP/UDP over jumbogram is not supported at this moment. This is because we have no medium (other than loopback) to test this. Contact us if you need this.

  IPsec does not work on jumbograms. This is due to some specification twists in supporting AH with jumbograms (AH header size influences payload length, and this makes it real hard to authenticate inbound packet with jumbo payload option as well as AH).

  There are fundamental issues in *BSD support for jumbograms. We would like to address those, but we need more time to finalize these. To name a few:

  • mbuf pkthdr.len field is typed as "int" in 4.4BSD, so it will not hold jumbogram with len > 2G on 32bit architecture CPUs. If we would like to support jumbogram properly, the field must be expanded to hold 4G + IPv6 header + link-layer header. Therefore, it must be expanded to at least int64_t (u_int32_t is NOT enough).

  • We mistakingly use "int" to hold packet length in many places. We need to convert them into larger integral type. It needs a great care, as we may experience overflow during packet length computation.

  • We mistakingly check for ip6_plen field of IPv6 header for packet payload length in various places. We should be checking mbuf pkthdr.len instead. ip6_input() will perform sanity check on jumbo payload option on input, and we can safely use mbuf pkthdr.len afterwards.

  • TCP code needs a careful update in bunch of places, of course.


8.1.1.8 Loop prevention in header processing

  IPv6 specification allows arbitrary number of extension headers to be placed onto packets. If we implement IPv6 packet processing code in the way BSD IPv4 code is implemented, kernel stack may overflow due to long function call chain. sys/netinet6 code is carefully designed to avoid kernel stack overflow. Because of this, sys/netinet6 code defines its own protocol switch structure, as "struct ip6protosw" (see netinet6/ip6protosw.h). There is no such update to IPv4 part (sys/netinet) for compatibility, but small change is added to its pr_input() prototype. So "struct ipprotosw" is also defined. Because of this, if you receive IPsec-over-IPv4 packet with massive number of IPsec headers, kernel stack may blow up. IPsec-over-IPv6 is okay. (Off-course, for those all IPsec headers to be processed, each such IPsec header must pass each IPsec check. So an anonymous attacker will not be able to do such an attack.)


8.1.1.9 ICMPv6

  After RFC2463 was published, IETF ipngwg has decided to disallow ICMPv6 error packet against ICMPv6 redirect, to prevent ICMPv6 storm on a network medium. This is already implemented into the kernel.


8.1.1.10 Applications

  For userland programming, we support IPv6 socket API as specified in RFC2553, RFC2292 and upcoming Internet drafts.

  TCP/UDP over IPv6 is available and quite stable. You can enjoy telnet(1), ftp(1), rlogin(1), rsh(1), ssh(1), etc. These applications are protocol independent. That is, they automatically chooses IPv4 or IPv6 according to DNS.


8.1.1.11 Kernel Internals

  While ip_forward() calls ip_output(), ip6_forward() directly calls if_output() since routers must not divide IPv6 packets into fragments.

  ICMPv6 should contain the original packet as long as possible up to 1280. UDP6/IP6 port unreach, for instance, should contain all extension headers and the *unchanged* UDP6 and IP6 headers. So, all IP6 functions except TCP never convert network byte order into host byte order, to save the original packet.

  tcp_input(), udp6_input() and icmp6_input() can not assume that IP6 header is preceding the transport headers due to extension headers. So, in6_cksum() was implemented to handle packets whose IP6 header and transport header is not continuous. TCP/IP6 nor UDP6/IP6 header structures do not exist for checksum calculation.

  To process IP6 header, extension headers and transport headers easily, network drivers are now required to store packets in one internal mbuf or one or more external mbufs. A typical old driver prepares two internal mbufs for 96 - 204 bytes data, however, now such packet data is stored in one external mbuf.

  netstat -s -p ip6 tells you whether or not your driver conforms such requirement. In the following example, "cce0" violates the requirement. (For more information, refer to Section 2.)

Mbuf statistics:
                317 one mbuf
                two or more mbuf::
                        lo0 = 8
            cce0 = 10
                3282 one ext mbuf
                0 two or more ext mbuf
   

  Each input function calls IP6_EXTHDR_CHECK in the beginning to check if the region between IP6 and its header is continuous. IP6_EXTHDR_CHECK calls m_pullup() only if the mbuf has M_LOOP flag, that is, the packet comes from the loopback interface. m_pullup() is never called for packets coming from physical network interfaces.

  Both IP and IP6 reassemble functions never call m_pullup().


8.1.1.12 IPv4 mapped address and IPv6 wildcard socket

  RFC2553 describes IPv4 mapped address (3.7) and special behavior of IPv6 wildcard bind socket (3.8). The spec allows you to:

  • Accept IPv4 connections by AF_INET6 wildcard bind socket.

  • Transmit IPv4 packet over AF_INET6 socket by using special form of the address like ::ffff:10.1.1.1.

  but the spec itself is very complicated and does not specify how the socket layer should behave. Here we call the former one "listening side" and the latter one "initiating side", for reference purposes.

  You can perform wildcard bind on both of the address families, on the same port.

  The following table show the behavior of FreeBSD 4.x.

listening side          initiating side
                (AF_INET6 wildcard      (connection to ::ffff:10.1.1.1)
                socket gets IPv4 conn.)
                ---                     ---
FreeBSD 4.x     configurable            supported
                default: enabled
   

  The following sections will give you more details, and how you can configure the behavior.

  Comments on listening side:

  It looks that RFC2553 talks too little on wildcard bind issue, especially on the port space issue, failure mode and relationship between AF_INET/INET6 wildcard bind. There can be several separate interpretation for this RFC which conform to it but behaves differently. So, to implement portable application you should assume nothing about the behavior in the kernel. Using getaddrinfo(3) is the safest way. Port number space and wildcard bind issues were discussed in detail on ipv6imp mailing list, in mid March 1999 and it looks that there is no concrete consensus (means, up to implementers). You may want to check the mailing list archives.

  If a server application would like to accept IPv4 and IPv6 connections, there will be two alternatives.

  One is using AF_INET and AF_INET6 socket (you will need two sockets). Use getaddrinfo(3) with AI_PASSIVE into ai_flags, and socket(2) and bind(2) to all the addresses returned. By opening multiple sockets, you can accept connections onto the socket with proper address family. IPv4 connections will be accepted by AF_INET socket, and IPv6 connections will be accepted by AF_INET6 socket.

  Another way is using one AF_INET6 wildcard bind socket. Use getaddrinfo(3) with AI_PASSIVE into ai_flags and with AF_INET6 into ai_family, and set the 1st argument hostname to NULL. And socket(2) and bind(2) to the address returned. (should be IPv6 unspecified addr). You can accept either of IPv4 and IPv6 packet via this one socket.

  To support only IPv6 traffic on AF_INET6 wildcard binded socket portably, always check the peer address when a connection is made toward AF_INET6 listening socket. If the address is IPv4 mapped address, you may want to reject the connection. You can check the condition by using IN6_IS_ADDR_V4MAPPED() macro.

  To resolve this issue more easily, there is system dependent setsockopt(2) option, IPV6_BINDV6ONLY, used like below.

   int on;

    setsockopt(s, IPPROTO_IPV6, IPV6_BINDV6ONLY,
           (char *)&on, sizeof (on)) < 0));
   

  When this call succeed, then this socket only receive IPv6 packets.

  Comments on initiating side:

  Advise to application implementers: to implement a portable IPv6 application (which works on multiple IPv6 kernels), we believe that the following is the key to the success:

  • NEVER hardcode AF_INET nor AF_INET6.

  • Use getaddrinfo(3) and getnameinfo(3) throughout the system. Never use gethostby*(), getaddrby*(), inet_*() or getipnodeby*(). (To update existing applications to be IPv6 aware easily, sometime getipnodeby*() will be useful. But if possible, try to rewrite the code to use getaddrinfo(3) and getnameinfo(3).)

  • If you would like to connect to destination, use getaddrinfo(3) and try all the destination returned, like telnet(1) does.

  • Some of the IPv6 stack is shipped with buggy getaddrinfo(3). Ship a minimal working version with your application and use that as last resort.

  If you would like to use AF_INET6 socket for both IPv4 and IPv6 outgoing connection, you will need to use getipnodebyname(3). When you would like to update your existing application to be IPv6 aware with minimal effort, this approach might be chosen. But please note that it is a temporal solution, because getipnodebyname(3) itself is not recommended as it does not handle scoped IPv6 addresses at all. For IPv6 name resolution, getaddrinfo(3) is the preferred API. So you should rewrite your application to use getaddrinfo(3), when you get the time to do it.

  When writing applications that make outgoing connections, story goes much simpler if you treat AF_INET and AF_INET6 as totally separate address family. {set,get}sockopt issue goes simpler, DNS issue will be made simpler. We do not recommend you to rely upon IPv4 mapped address.


8.1.1.12.1 unified tcp and inpcb code

  FreeBSD 4.x uses shared tcp code between IPv4 and IPv6 (from sys/netinet/tcp*) and separate udp4/6 code. It uses unified inpcb structure.

  The platform can be configured to support IPv4 mapped address. Kernel configuration is summarized as follows:

  • By default, AF_INET6 socket will grab IPv4 connections in certain condition, and can initiate connection to IPv4 destination embedded in IPv4 mapped IPv6 address.

  • You can disable it on entire system with sysctl like below.

    sysctl net.inet6.ip6.mapped_addr=0

8.1.1.12.1.1 listening side

  Each socket can be configured to support special AF_INET6 wildcard bind (enabled by default). You can disable it on each socket basis with setsockopt(2) like below.

   int on;

    setsockopt(s, IPPROTO_IPV6, IPV6_BINDV6ONLY,
           (char *)&on, sizeof (on)) < 0));
   

  Wildcard AF_INET6 socket grabs IPv4 connection if and only if the following conditions are satisfied:

  • there is no AF_INET socket that matches the IPv4 connection

  • the AF_INET6 socket is configured to accept IPv4 traffic, i.e. getsockopt(IPV6_BINDV6ONLY) returns 0.

  There is no problem with open/close ordering.


8.1.1.12.1.2 initiating side

  FreeBSD 4.x supports outgoing connection to IPv4 mapped address (::ffff:10.1.1.1), if the node is configured to support IPv4 mapped address.


8.1.1.13 sockaddr_storage

  When RFC2553 was about to be finalized, there was discussion on how struct sockaddr_storage members are named. One proposal is to prepend "__" to the members (like "__ss_len") as they should not be touched. The other proposal was not to prepend it (like "ss_len") as we need to touch those members directly. There was no clear consensus on it.

  As a result, RFC2553 defines struct sockaddr_storage as follows:

   struct sockaddr_storage {
        u_char  __ss_len;   /* address length */
        u_char  __ss_family;    /* address family */
        /* and bunch of padding */
    };
   

  On the contrary, XNET draft defines as follows:

   struct sockaddr_storage {
        u_char  ss_len;     /* address length */
        u_char  ss_family;  /* address family */
        /* and bunch of padding */
    };
   

  In December 1999, it was agreed that RFC2553bis should pick the latter (XNET) definition.

  Current implementation conforms to XNET definition, based on RFC2553bis discussion.

  If you look at multiple IPv6 implementations, you will be able to see both definitions. As an userland programmer, the most portable way of dealing with it is to:

  1. ensure ss_family and/or ss_len are available on the platform, by using GNU autoconf,

  2. have -Dss_family=__ss_family to unify all occurrences (including header file) into __ss_family, or

  3. never touch __ss_family. cast to sockaddr * and use sa_family like:

       struct sockaddr_storage ss;
        family = ((struct sockaddr *)&ss)->sa_family
           
    

8.1.2 Network Drivers

  Now following two items are required to be supported by standard drivers:

  1. mbuf clustering requirement. In this stable release, we changed MINCLSIZE into MHLEN+1 for all the operating systems in order to make all the drivers behave as we expect.

  2. multicast. If ifmcstat(8) yields no multicast group for a interface, that interface has to be patched.

  If any of the drivers do not support the requirements, then the drivers can not be used for IPv6 and/or IPsec communication. If you find any problem with your card using IPv6/IPsec, then, please report it to the FreeBSD 问题报告邮件列表.

  (NOTE: In the past we required all PCMCIA drivers to have a call to in6_ifattach(). We have no such requirement any more)


8.1.3 Translator

  We categorize IPv4/IPv6 translator into 4 types:

  • Translator A --- It is used in the early stage of transition to make it possible to establish a connection from an IPv6 host in an IPv6 island to an IPv4 host in the IPv4 ocean.

  • Translator B --- It is used in the early stage of transition to make it possible to establish a connection from an IPv4 host in the IPv4 ocean to an IPv6 host in an IPv6 island.

  • Translator C --- It is used in the late stage of transition to make it possible to establish a connection from an IPv4 host in an IPv4 island to an IPv6 host in the IPv6 ocean.

  • Translator D --- It is used in the late stage of transition to make it possible to establish a connection from an IPv6 host in the IPv6 ocean to an IPv4 host in an IPv4 island.

  TCP relay translator for category A is supported. This is called "FAITH". We also provide IP header translator for category A. (The latter is not yet put into FreeBSD 4.x yet.)


8.1.3.1 FAITH TCP relay translator

  FAITH system uses TCP relay daemon called faithd(8) helped by the kernel. FAITH will reserve an IPv6 address prefix, and relay TCP connection toward that prefix to IPv4 destination.

  For example, if the reserved IPv6 prefix is 3ffe:0501:0200:ffff::, and the IPv6 destination for TCP connection is 3ffe:0501:0200:ffff::163.221.202.12, the connection will be relayed toward IPv4 destination 163.221.202.12.

   destination IPv4 node (163.221.202.12)
      ^
      | IPv4 tcp toward 163.221.202.12
    FAITH-relay dual stack node
      ^
      | IPv6 TCP toward 3ffe:0501:0200:ffff::163.221.202.12
    source IPv6 node
   

  faithd(8) must be invoked on FAITH-relay dual stack node.

  For more details, consult src/usr.sbin/faithd/README

8.1.4 IPsec

  IPsec is mainly organized by three components.

  1. Policy Management

  2. Key Management

  3. AH and ESP handling


8.1.4.1 Policy Management

  The kernel implements experimental policy management code. There are two way to manage security policy. One is to configure per-socket policy using setsockopt(2). In this cases, policy configuration is described in ipsec_set_policy(3). The other is to configure kernel packet filter-based policy using PF_KEY interface, via setkey(8).

  The policy entry is not re-ordered with its indexes, so the order of entry when you add is very significant.


8.1.4.2 Key Management

  The key management code implemented in this kit (sys/netkey) is a home-brew PFKEY v2 implementation. This conforms to RFC2367.

  The home-brew IKE daemon, "racoon" is included in the kit (kame/kame/racoon). Basically you will need to run racoon as daemon, then set up a policy to require keys (like ping -P 'out ipsec esp/transport//use'). The kernel will contact racoon daemon as necessary to exchange keys.


8.1.4.3 AH and ESP handling

  IPsec module is implemented as "hooks" to the standard IPv4/IPv6 processing. When sending a packet, ip{,6}_output() checks if ESP/AH processing is required by checking if a matching SPD (Security Policy Database) is found. If ESP/AH is needed, {esp,ah}{4,6}_output() will be called and mbuf will be updated accordingly. When a packet is received, {esp,ah}4_input() will be called based on protocol number, i.e. (*inetsw[proto])(). {esp,ah}4_input() will decrypt/check authenticity of the packet, and strips off daisy-chained header and padding for ESP/AH. It is safe to strip off the ESP/AH header on packet reception, since we will never use the received packet in "as is" form.

  By using ESP/AH, TCP4/6 effective data segment size will be affected by extra daisy-chained headers inserted by ESP/AH. Our code takes care of the case.

  Basic crypto functions can be found in directory "sys/crypto". ESP/AH transform are listed in {esp,ah}_core.c with wrapper functions. If you wish to add some algorithm, add wrapper function in {esp,ah}_core.c, and add your crypto algorithm code into sys/crypto.

  Tunnel mode is partially supported in this release, with the following restrictions:

  • IPsec tunnel is not combined with GIF generic tunneling interface. It needs a great care because we may create an infinite loop between ip_output() and tunnelifp->if_output(). Opinion varies if it is better to unify them, or not.

  • MTU and Don't Fragment bit (IPv4) considerations need more checking, but basically works fine.

  • Authentication model for AH tunnel must be revisited. We will need to improve the policy management engine, eventually.


8.1.4.4 Conformance to RFCs and IDs

  The IPsec code in the kernel conforms (or, tries to conform) to the following standards:

  "old IPsec" specification documented in rfc182[5-9].txt

  "new IPsec" specification documented in rfc240[1-6].txt, rfc241[01].txt, rfc2451.txt and draft-mcdonald-simple-ipsec-api-01.txt (draft expired, but you can take from ftp://ftp.kame.net/pub/internet-drafts/). (NOTE: IKE specifications, rfc241[7-9].txt are implemented in userland, as "racoon" IKE daemon)

  Currently supported algorithms are:

  • old IPsec AH

    • null crypto checksum (no document, just for debugging)

    • keyed MD5 with 128bit crypto checksum (rfc1828.txt)

    • keyed SHA1 with 128bit crypto checksum (no document)

    • HMAC MD5 with 128bit crypto checksum (rfc2085.txt)

    • HMAC SHA1 with 128bit crypto checksum (no document)

  • old IPsec ESP

    • null encryption (no document, similar to rfc2410.txt)

    • DES-CBC mode (rfc1829.txt)

  • new IPsec AH

    • null crypto checksum (no document, just for debugging)

    • keyed MD5 with 96bit crypto checksum (no document)

    • keyed SHA1 with 96bit crypto checksum (no document)

    • HMAC MD5 with 96bit crypto checksum (rfc2403.txt)

    • HMAC SHA1 with 96bit crypto checksum (rfc2404.txt)

  • new IPsec ESP

    • null encryption (rfc2410.txt)

    • DES-CBC with derived IV (draft-ietf-ipsec-ciph-des-derived-01.txt, draft expired)

    • DES-CBC with explicit IV (rfc2405.txt)

    • 3DES-CBC with explicit IV (rfc2451.txt)

    • BLOWFISH CBC (rfc2451.txt)

    • CAST128 CBC (rfc2451.txt)

    • RC5 CBC (rfc2451.txt)

    • each of the above can be combined with:

      • ESP authentication with HMAC-MD5(96bit)

      • ESP authentication with HMAC-SHA1(96bit)

  The following algorithms are NOT supported:

  • old IPsec AH

    • HMAC MD5 with 128bit crypto checksum + 64bit replay prevention (rfc2085.txt)

    • keyed SHA1 with 160bit crypto checksum + 32bit padding (rfc1852.txt)

  IPsec (in kernel) and IKE (in userland as "racoon") has been tested at several interoperability test events, and it is known to interoperate with many other implementations well. Also, current IPsec implementation as quite wide coverage for IPsec crypto algorithms documented in RFC (we cover algorithms without intellectual property issues only).


8.1.4.5 ECN consideration on IPsec tunnels

  ECN-friendly IPsec tunnel is supported as described in draft-ipsec-ecn-00.txt.

  Normal IPsec tunnel is described in RFC2401. On encapsulation, IPv4 TOS field (or, IPv6 traffic class field) will be copied from inner IP header to outer IP header. On decapsulation outer IP header will be simply dropped. The decapsulation rule is not compatible with ECN, since ECN bit on the outer IP TOS/traffic class field will be lost.

  To make IPsec tunnel ECN-friendly, we should modify encapsulation and decapsulation procedure. This is described in http://www.aciri.org/floyd/papers/draft-ipsec-ecn-00.txt, chapter 3.

  IPsec tunnel implementation can give you three behaviors, by setting net.inet.ipsec.ecn (or net.inet6.ipsec6.ecn) to some value:

  • RFC2401: no consideration for ECN (sysctl value -1)

  • ECN forbidden (sysctl value 0)

  • ECN allowed (sysctl value 1)

  Note that the behavior is configurable in per-node manner, not per-SA manner (draft-ipsec-ecn-00 wants per-SA configuration, but it looks too much for me).

  The behavior is summarized as follows (see source code for more detail):

                encapsulate                     decapsulate
                ---                             ---
RFC2401         copy all TOS bits               drop TOS bits on outer
                from inner to outer.            (use inner TOS bits as is)

ECN forbidden   copy TOS bits except for ECN    drop TOS bits on outer
                (masked with 0xfc) from inner   (use inner TOS bits as is)
                to outer.  set ECN bits to 0.

ECN allowed     copy TOS bits except for ECN    use inner TOS bits with some
                CE (masked with 0xfe) from      change.  if outer ECN CE bit
                inner to outer.                 is 1, enable ECN CE bit on
                set ECN CE bit to 0.            the inner.

   

  General strategy for configuration is as follows:

  • if both IPsec tunnel endpoint are capable of ECN-friendly behavior, you should better configure both end to “ECN allowed” (sysctl value 1).

  • if the other end is very strict about TOS bit, use "RFC2401" (sysctl value -1).

  • in other cases, use "ECN forbidden" (sysctl value 0).

  The default behavior is "ECN forbidden" (sysctl value 0).

  For more information, please refer to:

   http://www.aciri.org/floyd/papers/draft-ipsec-ecn-00.txt, RFC2481 (Explicit Congestion Notification), src/sys/netinet6/{ah,esp}_input.c

  (Thanks goes to Kenjiro Cho for detailed analysis)


8.1.4.6 Interoperability

  Here are (some of) platforms that KAME code have tested IPsec/IKE interoperability in the past. Note that both ends may have modified their implementation, so use the following list just for reference purposes.

  Altiga, Ashley-laurent (vpcom.com), Data Fellows (F-Secure), Ericsson ACC, FreeS/WAN, HITACHI, IBM AIX®, IIJ, Intel, Microsoft® Windows NT®, NIST (linux IPsec + plutoplus), Netscreen, OpenBSD, RedCreek, Routerware, SSH, Secure Computing, Soliton, Toshiba, VPNet, Yamaha RT100i

第III部分. 内核

目录
第9章  DMA
第10章  调试内核

第9章  DMA

9.1 DMA: What it is and How it Works

  Copyright © 1995,1997 Frank Durda IV , All Rights Reserved. 10 December 1996. Last Update 8 October 1997.

  Direct Memory Access (DMA) is a method of allowing data to be moved from one location to another in a computer without intervention from the central processor (CPU).

  The way that the DMA function is implemented varies between computer architectures, so this discussion will limit itself to the implementation and workings of the DMA subsystem on the IBM Personal Computer (PC), the IBM PC/AT and all of its successors and clones.

  The PC DMA subsystem is based on the Intel® 8237 DMA controller. The 8237 contains four DMA channels that can be programmed independently and any one of the channels may be active at any moment. These channels are numbered 0, 1, 2 and 3. Starting with the PC/AT, IBM added a second 8237 chip, and numbered those channels 4, 5, 6 and 7.

  The original DMA controller (0, 1, 2 and 3) moves one byte in each transfer. The second DMA controller (4, 5, 6, and 7) moves 16-bits from two adjacent memory locations in each transfer, with the first byte always coming from an even-numbered address. The two controllers are identical components and the difference in transfer size is caused by the way the second controller is wired into the system.

  The 8237 has two electrical signals for each channel, named DRQ and -DACK. There are additional signals with the names HRQ (Hold Request), HLDA (Hold Acknowledge), -EOP (End of Process), and the bus control signals -MEMR (Memory Read), -MEMW (Memory Write), -IOR (I/O Read), and -IOW (I/O Write).

  The 8237 DMA is known as a “fly-by” DMA controller. This means that the data being moved from one location to another does not pass through the DMA chip and is not stored in the DMA chip. Subsequently, the DMA can only transfer data between an I/O port and a memory address, but not between two I/O ports or two memory locations.

注意: The 8237 does allow two channels to be connected together to allow memory-to-memory DMA operations in a non-“fly-by” mode, but nobody in the PC industry uses this scarce resource this way since it is faster to move data between memory locations using the CPU.

  In the PC architecture, each DMA channel is normally activated only when the hardware that uses a given DMA channel requests a transfer by asserting the DRQ line for that channel.


9.1.1 A Sample DMA transfer

  Here is an example of the steps that occur to cause and perform a DMA transfer. In this example, the floppy disk controller (FDC) has just read a byte from a diskette and wants the DMA to place it in memory at location 0x00123456. The process begins by the FDC asserting the DRQ2 signal (the DRQ line for DMA channel 2) to alert the DMA controller.

  The DMA controller will note that the DRQ2 signal is asserted. The DMA controller will then make sure that DMA channel 2 has been programmed and is unmasked (enabled). The DMA controller also makes sure that none of the other DMA channels are active or want to be active and have a higher priority. Once these checks are complete, the DMA asks the CPU to release the bus so that the DMA may use the bus. The DMA requests the bus by asserting the HRQ signal which goes to the CPU.

  The CPU detects the HRQ signal, and will complete executing the current instruction. Once the processor has reached a state where it can release the bus, it will. Now all of the signals normally generated by the CPU (-MEMR, -MEMW, -IOR, -IOW and a few others) are placed in a tri-stated condition (neither high or low) and then the CPU asserts the HLDA signal which tells the DMA controller that it is now in charge of the bus.

  Depending on the processor, the CPU may be able to execute a few additional instructions now that it no longer has the bus, but the CPU will eventually have to wait when it reaches an instruction that must read something from memory that is not in the internal processor cache or pipeline.

  Now that the DMA “is in charge”, the DMA activates its -MEMR, -MEMW, -IOR, -IOW output signals, and the address outputs from the DMA are set to 0x3456, which will be used to direct the byte that is about to transferred to a specific memory location.

  The DMA will then let the device that requested the DMA transfer know that the transfer is commencing. This is done by asserting the -DACK signal, or in the case of the floppy disk controller, -DACK2 is asserted.

  The floppy disk controller is now responsible for placing the byte to be transferred on the bus Data lines. Unless the floppy controller needs more time to get the data byte on the bus (and if the peripheral does need more time it alerts the DMA via the READY signal), the DMA will wait one DMA clock, and then de-assert the -MEMW and -IOR signals so that the memory will latch and store the byte that was on the bus, and the FDC will know that the byte has been transferred.

  Since the DMA cycle only transfers a single byte at a time, the FDC now drops the DRQ2 signal, so the DMA knows that it is no longer needed. The DMA will de-assert the -DACK2 signal, so that the FDC knows it must stop placing data on the bus.

  The DMA will now check to see if any of the other DMA channels have any work to do. If none of the channels have their DRQ lines asserted, the DMA controller has completed its work and will now tri-state the -MEMR, -MEMW, -IOR, -IOW and address signals.

  Finally, the DMA will de-assert the HRQ signal. The CPU sees this, and de-asserts the HOLDA signal. Now the CPU activates its -MEMR, -MEMW, -IOR, -IOW and address lines, and it resumes executing instructions and accessing main memory and the peripherals.

  For a typical floppy disk sector, the above process is repeated 512 times, once for each byte. Each time a byte is transferred, the address register in the DMA is incremented and the counter in the DMA that shows how many bytes are to be transferred is decremented.

  When the counter reaches zero, the DMA asserts the EOP signal, which indicates that the counter has reached zero and no more data will be transferred until the DMA controller is reprogrammed by the CPU. This event is also called the Terminal Count (TC). There is only one EOP signal, and since only one DMA channel can be active at any instant, the DMA channel that is currently active must be the DMA channel that just completed its task.

  If a peripheral wants to generate an interrupt when the transfer of a buffer is complete, it can test for its -DACKn signal and the EOP signal both being asserted at the same time. When that happens, it means the DMA will not transfer any more information for that peripheral without intervention by the CPU. The peripheral can then assert one of the interrupt signals to get the processors' attention. In the PC architecture, the DMA chip itself is not capable of generating an interrupt. The peripheral and its associated hardware is responsible for generating any interrupt that occurs. Subsequently, it is possible to have a peripheral that uses DMA but does not use interrupts.

  It is important to understand that although the CPU always releases the bus to the DMA when the DMA makes the request, this action is invisible to both applications and the operating system, except for slight changes in the amount of time the processor takes to execute instructions when the DMA is active. Subsequently, the processor must poll the peripheral, poll the registers in the DMA chip, or receive an interrupt from the peripheral to know for certain when a DMA transfer has completed.


9.1.2 DMA Page Registers and 16Meg address space limitations

  You may have noticed earlier that instead of the DMA setting the address lines to 0x00123456 as we said earlier, the DMA only set 0x3456. The reason for this takes a bit of explaining.

  When the original IBM PC was designed, IBM elected to use both DMA and interrupt controller chips that were designed for use with the 8085, an 8-bit processor with an address space of 16 bits (64K). Since the IBM PC supported more than 64K of memory, something had to be done to allow the DMA to read or write memory locations above the 64K mark. What IBM did to solve this problem was to add an external data latch for each DMA channel that holds the upper bits of the address to be read to or written from. Whenever a DMA channel is active, the contents of that latch are written to the address bus and kept there until the DMA operation for the channel ends. IBM called these latches “Page Registers”.

  So for our example above, the DMA would put the 0x3456 part of the address on the bus, and the Page Register for DMA channel 2 would put 0x0012xxxx on the bus. Together, these two values form the complete address in memory that is to be accessed.

  Because the Page Register latch is independent of the DMA chip, the area of memory to be read or written must not span a 64K physical boundary. For example, if the DMA accesses memory location 0xffff, after that transfer the DMA will then increment the address register and the DMA will access the next byte at location 0x0000, not 0x10000. The results of letting this happen are probably not intended.

注意: “Physical” 64K boundaries should not be confused with 8086-mode 64K “Segments”, which are created by mathematically adding a segment register with an offset register. Page Registers have no address overlap and are mathematically OR-ed together.

  To further complicate matters, the external DMA address latches on the PC/AT hold only eight bits, so that gives us 8+16=24 bits, which means that the DMA can only point at memory locations between 0 and 16Meg. For newer computers that allow more than 16Meg of memory, the standard PC-compatible DMA cannot access memory locations above 16Meg.

  To get around this restriction, operating systems will reserve a RAM buffer in an area below 16Meg that also does not span a physical 64K boundary. Then the DMA will be programmed to transfer data from the peripheral and into that buffer. Once the DMA has moved the data into this buffer, the operating system will then copy the data from the buffer to the address where the data is really supposed to be stored.

  When writing data from an address above 16Meg to a DMA-based peripheral, the data must be first copied from where it resides into a buffer located below 16Meg, and then the DMA can copy the data from the buffer to the hardware. In FreeBSD, these reserved buffers are called “Bounce Buffers”. In the MS-DOS world, they are sometimes called “Smart Buffers”.

注意: A new implementation of the 8237, called the 82374, allows 16 bits of page register to be specified and enables access to the entire 32 bit address space, without the use of bounce buffers.


9.1.3 DMA Operational Modes and Settings

  The 8237 DMA can be operated in several modes. The main ones are:

Single

A single byte (or word) is transferred. The DMA must release and re-acquire the bus for each additional byte. This is commonly-used by devices that cannot transfer the entire block of data immediately. The peripheral will request the DMA each time it is ready for another transfer.

The standard PC-compatible floppy disk controller (NEC 765) only has a one-byte buffer, so it uses this mode.

Block/Demand

Once the DMA acquires the system bus, an entire block of data is transferred, up to a maximum of 64K. If the peripheral needs additional time, it can assert the READY signal to suspend the transfer briefly. READY should not be used excessively, and for slow peripheral transfers, the Single Transfer Mode should be used instead.

The difference between Block and Demand is that once a Block transfer is started, it runs until the transfer count reaches zero. DRQ only needs to be asserted until -DACK is asserted. Demand Mode will transfer one more bytes until DRQ is de-asserted, at which point the DMA suspends the transfer and releases the bus back to the CPU. When DRQ is asserted later, the transfer resumes where it was suspended.

Older hard disk controllers used Demand Mode until CPU speeds increased to the point that it was more efficient to transfer the data using the CPU, particularly if the memory locations used in the transfer were above the 16Meg mark.

Cascade

This mechanism allows a DMA channel to request the bus, but then the attached peripheral device is responsible for placing the addressing information on the bus instead of the DMA. This is also used to implement a technique known as “Bus Mastering”.

When a DMA channel in Cascade Mode receives control of the bus, the DMA does not place addresses and I/O control signals on the bus like the DMA normally does when it is active. Instead, the DMA only asserts the -DACK signal for the active DMA channel.

At this point it is up to the peripheral connected to that DMA channel to provide address and bus control signals. The peripheral has complete control over the system bus, and can do reads and/or writes to any address below 16Meg. When the peripheral is finished with the bus, it de-asserts the DRQ line, and the DMA controller can then return control to the CPU or to some other DMA channel.

Cascade Mode can be used to chain multiple DMA controllers together, and this is exactly what DMA Channel 4 is used for in the PC architecture. When a peripheral requests the bus on DMA channels 0, 1, 2 or 3, the slave DMA controller asserts HLDREQ, but this wire is actually connected to DRQ4 on the primary DMA controller instead of to the CPU. The primary DMA controller, thinking it has work to do on Channel 4, requests the bus from the CPU using HLDREQ signal. Once the CPU grants the bus to the primary DMA controller, -DACK4 is asserted, and that wire is actually connected to the HLDA signal on the slave DMA controller. The slave DMA controller then transfers data for the DMA channel that requested it (0, 1, 2 or 3), or the slave DMA may grant the bus to a peripheral that wants to perform its own bus-mastering, such as a SCSI controller.

Because of this wiring arrangement, only DMA channels 0, 1, 2, 3, 5, 6 and 7 are usable with peripherals on PC/AT systems.

注意: DMA channel 0 was reserved for refresh operations in early IBM PC computers, but is generally available for use by peripherals in modern systems.

When a peripheral is performing Bus Mastering, it is important that the peripheral transmit data to or from memory constantly while it holds the system bus. If the peripheral cannot do this, it must release the bus frequently so that the system can perform refresh operations on main memory.

The Dynamic RAM used in all PCs for main memory must be accessed frequently to keep the bits stored in the components “charged”. Dynamic RAM essentially consists of millions of capacitors with each one holding one bit of data. These capacitors are charged with power to represent a 1 or drained to represent a 0. Because all capacitors leak, power must be added at regular intervals to keep the 1 values intact. The RAM chips actually handle the task of pumping power back into all of the appropriate locations in RAM, but they must be told when to do it by the rest of the computer so that the refresh activity will not interfere with the computer wanting to access RAM normally. If the computer is unable to refresh memory, the contents of memory will become corrupted in just a few milliseconds.

Since memory read and write cycles “count” as refresh cycles (a dynamic RAM refresh cycle is actually an incomplete memory read cycle), as long as the peripheral controller continues reading or writing data to sequential memory locations, that action will refresh all of memory.

Bus-mastering is found in some SCSI host interfaces and other high-performance peripheral controllers.

Autoinitialize

This mode causes the DMA to perform Byte, Block or Demand transfers, but when the DMA transfer counter reaches zero, the counter and address are set back to where they were when the DMA channel was originally programmed. This means that as long as the peripheral requests transfers, they will be granted. It is up to the CPU to move new data into the fixed buffer ahead of where the DMA is about to transfer it when doing output operations, and to read new data out of the buffer behind where the DMA is writing when doing input operations.

This technique is frequently used on audio devices that have small or no hardware “sample” buffers. There is additional CPU overhead to manage this “circular” buffer, but in some cases this may be the only way to eliminate the latency that occurs when the DMA counter reaches zero and the DMA stops transfers until it is reprogrammed.


9.1.4 Programming the DMA

  The DMA channel that is to be programmed should always be “masked” before loading any settings. This is because the hardware might unexpectedly assert the DRQ for that channel, and the DMA might respond, even though not all of the parameters have been loaded or updated.

  Once masked, the host must specify the direction of the transfer (memory-to-I/O or I/O-to-memory), what mode of DMA operation is to be used for the transfer (Single, Block, Demand, Cascade, etc), and finally the address and length of the transfer are loaded. The length that is loaded is one less than the amount you expect the DMA to transfer. The LSB and MSB of the address and length are written to the same 8-bit I/O port, so another port must be written to first to guarantee that the DMA accepts the first byte as the LSB and the second byte as the MSB of the length and address.

  Then, be sure to update the Page Register, which is external to the DMA and is accessed through a different set of I/O ports.

  Once all the settings are ready, the DMA channel can be un-masked. That DMA channel is now considered to be “armed”, and will respond when the DRQ line for that channel is asserted.

  Refer to a hardware data book for precise programming details for the 8237. You will also need to refer to the I/O port map for the PC system, which describes where the DMA and Page Register ports are located. A complete port map table is located below.


9.1.5 DMA Port Map

  All systems based on the IBM-PC and PC/AT have the DMA hardware located at the same I/O ports. The complete list is provided below. Ports assigned to DMA Controller #2 are undefined on non-AT designs.


9.1.5.1 0x00-0x1f DMA Controller #1 (Channels 0, 1, 2 and 3)

  DMA Address and Count Registers

0x00 write Channel 0 starting address
0x00 read Channel 0 current address
0x01 write Channel 0 starting word count
0x01 read Channel 0 remaining word count
0x02 write Channel 1 starting address
0x02 read Channel 1 current address
0x03 write Channel 1 starting word count
0x03 read Channel 1 remaining word count
0x04 write Channel 2 starting address
0x04 read Channel 2 current address
0x05 write Channel 2 starting word count
0x05 read Channel 2 remaining word count
0x06 write Channel 3 starting address
0x06 read Channel 3 current address
0x07 write Channel 3 starting word count
0x07 read Channel 3 remaining word count

  DMA Command Registers

0x08 write Command Register
0x08 read Status Register
0x09 write Request Register
0x09 read -
0x0a write Single Mask Register Bit
0x0a read -
0x0b write Mode Register
0x0b read -
0x0c write Clear LSB/MSB Flip-Flop
0x0c read -
0x0d write Master Clear/Reset
0x0d read Temporary Register (not available on newer versions)
0x0e write Clear Mask Register
0x0e read -
0x0f write Write All Mask Register Bits
0x0f read Read All Mask Register Bits (only in Intel 82374)

9.1.5.2 0xc0-0xdf DMA Controller #2 (Channels 4, 5, 6 and 7)

  DMA Address and Count Registers

0xc0 write Channel 4 starting address
0xc0 read Channel 4 current address
0xc2 write Channel 4 starting word count
0xc2 read Channel 4 remaining word count
0xc4 write Channel 5 starting address
0xc4 read Channel 5 current address
0xc6 write Channel 5 starting word count
0xc6 read Channel 5 remaining word count
0xc8 write Channel 6 starting address
0xc8 read Channel 6 current address
0xca write Channel 6 starting word count
0xca read Channel 6 remaining word count
0xcc write Channel 7 starting address
0xcc read Channel 7 current address
0xce write Channel 7 starting word count
0xce read Channel 7 remaining word count

  DMA Command Registers

0xd0 write Command Register
0xd0 read Status Register
0xd2 write Request Register
0xd2 read -
0xd4 write Single Mask Register Bit
0xd4 read -
0xd6 write Mode Register
0xd6 read -
0xd8 write Clear LSB/MSB Flip-Flop
0xd8 read -
0xda write Master Clear/Reset
0xda read Temporary Register (not present in Intel 82374)
0xdc write Clear Mask Register
0xdc read -
0xde write Write All Mask Register Bits
0xdf read Read All Mask Register Bits (only in Intel 82374)

9.1.5.3 0x80-0x9f DMA Page Registers

0x87 r/w Channel 0 Low byte (23-16) page Register
0x83 r/w Channel 1 Low byte (23-16) page Register
0x81 r/w Channel 2 Low byte (23-16) page Register
0x82 r/w Channel 3 Low byte (23-16) page Register
0x8b r/w Channel 5 Low byte (23-16) page Register
0x89 r/w Channel 6 Low byte (23-16) page Register
0x8a r/w Channel 7 Low byte (23-16) page Register
0x8f r/w Low byte page Refresh

9.1.5.4 0x400-0x4ff 82374 Enhanced DMA Registers

  The Intel 82374 EISA System Component (ESC) was introduced in early 1996 and includes a DMA controller that provides a superset of 8237 functionality as well as other PC-compatible core peripheral components in a single package. This chip is targeted at both EISA and PCI platforms, and provides modern DMA features like scatter-gather, ring buffers as well as direct access by the system DMA to all 32 bits of address space.

  If these features are used, code should also be included to provide similar functionality in the previous 16 years worth of PC-compatible computers. For compatibility reasons, some of the 82374 registers must be programmed after programming the traditional 8237 registers for each transfer. Writing to a traditional 8237 register forces the contents of some of the 82374 enhanced registers to zero to provide backward software compatibility.

0x401 r/w Channel 0 High byte (bits 23-16) word count
0x403 r/w Channel 1 High byte (bits 23-16) word count
0x405 r/w Channel 2 High byte (bits 23-16) word count
0x407 r/w Channel 3 High byte (bits 23-16) word count
0x4c6 r/w Channel 5 High byte (bits 23-16) word count
0x4ca r/w Channel 6 High byte (bits 23-16) word count
0x4ce r/w Channel 7 High byte (bits 23-16) word count
0x487 r/w Channel 0 High byte (bits 31-24) page Register
0x483 r/w Channel 1 High byte (bits 31-24) page Register
0x481 r/w Channel 2 High byte (bits 31-24) page Register
0x482 r/w Channel 3 High byte (bits 31-24) page Register
0x48b r/w Channel 5 High byte (bits 31-24) page Register
0x489 r/w Channel 6 High byte (bits 31-24) page Register
0x48a r/w Channel 6 High byte (bits 31-24) page Register
0x48f r/w High byte page Refresh
0x4e0 r/w Channel 0 Stop Register (bits 7-2)
0x4e1 r/w Channel 0 Stop Register (bits 15-8)
0x4e2 r/w Channel 0 Stop Register (bits 23-16)
0x4e4 r/w Channel 1 Stop Register (bits 7-2)
0x4e5 r/w Channel 1 Stop Register (bits 15-8)
0x4e6 r/w Channel 1 Stop Register (bits 23-16)
0x4e8 r/w Channel 2 Stop Register (bits 7-2)
0x4e9 r/w Channel 2 Stop Register (bits 15-8)
0x4ea r/w Channel 2 Stop Register (bits 23-16)
0x4ec r/w Channel 3 Stop Register (bits 7-2)
0x4ed r/w Channel 3 Stop Register (bits 15-8)
0x4ee r/w Channel 3 Stop Register (bits 23-16)
0x4f4 r/w Channel 5 Stop Register (bits 7-2)
0x4f5 r/w Channel 5 Stop Register (bits 15-8)
0x4f6 r/w Channel 5 Stop Register (bits 23-16)
0x4f8 r/w Channel 6 Stop Register (bits 7-2)
0x4f9 r/w Channel 6 Stop Register (bits 15-8)
0x4fa r/w Channel 6 Stop Register (bits 23-16)
0x4fc r/w Channel 7 Stop Register (bits 7-2)
0x4fd r/w Channel 7 Stop Register (bits 15-8)
0x4fe r/w Channel 7 Stop Register (bits 23-16)
0x40a write Channels 0-3 Chaining Mode Register
0x40a read Channel Interrupt Status Register
0x4d4 write Channels 4-7 Chaining Mode Register
0x4d4 read Chaining Mode Status
0x40c read Chain Buffer Expiration Control Register
0x410 write Channel 0 Scatter-Gather Command Register
0x411 write Channel 1 Scatter-Gather Command Register
0x412 write Channel 2 Scatter-Gather Command Register
0x413 write Channel 3 Scatter-Gather Command Register
0x415 write Channel 5 Scatter-Gather Command Register
0x416 write Channel 6 Scatter-Gather Command Register
0x417 write Channel 7 Scatter-Gather Command Register
0x418 read Channel 0 Scatter-Gather Status Register
0x419 read Channel 1 Scatter-Gather Status Register
0x41a read Channel 2 Scatter-Gather Status Register
0x41b read Channel 3 Scatter-Gather Status Register
0x41d read Channel 5 Scatter-Gather Status Register
0x41e read Channel 5 Scatter-Gather Status Register
0x41f read Channel 7 Scatter-Gather Status Register
0x420-0x423 r/w Channel 0 Scatter-Gather Descriptor Table Pointer Register
0x424-0x427 r/w Channel 1 Scatter-Gather Descriptor Table Pointer Register
0x428-0x42b r/w Channel 2 Scatter-Gather Descriptor Table Pointer Register
0x42c-0x42f r/w Channel 3 Scatter-Gather Descriptor Table Pointer Register
0x434-0x437 r/w Channel 5 Scatter-Gather Descriptor Table Pointer Register
0x438-0x43b r/w Channel 6 Scatter-Gather Descriptor Table Pointer Register
0x43c-0x43f r/w Channel 7 Scatter-Gather Descriptor Table Pointer Register

第10章  调试内核

供稿:Paul Richards 和 Jörg Wunsch.

10.1 如何将内核的崩溃转存数据保存成文件

  在极端条件下 (比如系统运行在非常高的负载下, 数以万计的连接,过多的用户同时登录使用, 成百上千的 jail(8)) 运行尚有待进一步完善的系统内核 (例如 FreeBSD-CURRENT), 或在 FreeBSD-STABLE 上使用新特性或新驱动 (例如 PAE) 时,、都有可能使内核崩溃。 针对上述可能出现的情况,本章将讲解如何从内核崩溃中获取有价值的信息。

  一旦内核崩溃, 系统不可避免的要重启。 而系统重启将使内存 (RAM) 和交换设备上的数据荡然无存。 为保存内存中的数据,内核使用交换设备临时储存崩溃前 RAM 上的数据。 这样做使得 FreeBSD 重启后,可从中得到当时内核的镜像, 从而为进一步调试提供基础。

注意: 交换设备在被配置成内核存档设备后仍可作为交换设备正常使用。目前尚不支持将其它设备 (譬如磁带、 CDRW等) 配置成内核崩溃时的转存设备。 这里所说的 “交换设备” 就是 “交换分区。”

  要得到可用的核心内存转存,需确保至少有一个容量足以保存保存内存中所有数据的交换分区。 当内核崩溃时,内核会在系统重启前查找是否存在配置为用于内核转存的交换设备, 如果有,内核就将内存中的全部内容丝毫不变地存入交换设备。


10.1.1 配置内核转存设备

  只有在配置内核转存设备之后, 内核才会向其写入崩溃时内存中的数据。 dumpon(8) 命令告诉内核在何处保存崩溃的内核。 dumpon(8) 只能在已经通过 swapon(8) 配置好的交换分区上使用。 通常, 只需在 rc.conf(5) 中把 dumpdev 变量设为交换分区的设备路径既可 (推荐用这种方法提取内核转存数据)。

  另一种方法是在内核配置文件中, 通过 config(5)dump 语句来将转存设备硬编码进内核。 这是一种过时的做法,只有当内核会在能够执行 dumpon(8) 之前就会发生崩溃时才应考虑采用。

提示: 可以通过查看 /etc/fstabswapinfo(8) 的输出来了解系统中现有的交换设备。

重要: 在分析内核崩溃之前, 首先确认一下 rc.conf(5) 所设置的 dumpdir 确实存在!

# mkdir /var/crash
# chmod 700 /var/crash

需要注意的是, 对外界而言, /var/crash 中保存的数据可能包含敏感信息, 因为其中可能包含一些机密内容, 如用户密码等等。


10.1.2 提取内核转存数据

  一旦内核转存到转存设备之后, 就需要在下次交换设备挂载之前,将其提取并保存到文件中。 要从转存设备中提取内核转存数据, 就需要使用 savecore(8) 程序。如果在 rc.conf(5) 中配置了 dumpdev, 则崩溃之后的首次多用户方式启动过程中,在配置交换区设备之前便会自动执行 savecore(8)。提取出来的内核数据将存放在 rc.conf(5) 变量 dumpdir 所指定的位置, 其默认值为 /var/crash, 而保存的文件名则是 vmcore.0

  若 /var/crash 目录下 (或 dumpdev设置的其它目录), 已存在了名为 vmcore.0 的文件, 则每次保存内核转存数据时, 其末尾的数字会顺次递增 (例如 vmcore.1) 以避免覆盖之前存档的转存数据。 所以,调试内核时, /var/crash 目录下序号最大的 vmcore 通常就是希望找的那个 vmcore

提示: 如果正在调试新的内核, 但需要从另一能支持系统正常运行的内核启动,就应在屏幕出现启动提示时, 使用 -s 选项进入单用户模式,再按下列步骤操作:

# fsck -p
# mount -a -t ufs       # make sure /var/crash is writable
# savecore /var/crash /dev/ad0s1b
# exit                  # exit to multi-user

This instructs savecore(8) to extract a kernel dump from /dev/ad0s1b and place the contents in /var/crash. Do not forget to make sure the destination directory /var/crash has enough space for the dump. Also, do not forget to specify the correct path to your swap device as it is likely different than /dev/ad0s1b!

  The recommended, and certainly the easiest way to automate obtaining crash dumps is to use the dumpdev variable in rc.conf(5).


10.2 Debugging a Kernel Crash Dump with kgdb

注意: This section covers kgdb(1) as found in FreeBSD5.3 and later. In previous versions, one must use gdb -k to read a core dump file.

  Once a dump has been obtained, getting useful information out of the dump is relatively easy for simple problems. Before launching into the internals of kgdb(1) to debug the crash dump, locate the debug version of your kernel (normally called kernel.debug) and the path to the source files used to build your kernel (normally /usr/obj/usr/src/sys/KERNCONF, where KERNCONF is the ident specified in a kernel config(5)). With those two pieces of info, let the debugging commence!

  To enter into the debugger and begin getting information from the dump, the following steps are required at a minimum:

# cd /usr/obj/usr/src/sys/KERNCONF
# kgdb kernel.debug /var/crash/vmcore.0

  You can debug the crash dump using the kernel sources just like you can for any other program.

  This first dump is from a 5.2-BETA kernel and the crash comes from deep within the kernel. The output below has been modified to include line numbers on the left. This first trace inspects the instruction pointer and obtains a back trace. The address that is used on line 41 for the list command is the instruction pointer and can be found on line line 17. Most developers will request having at least this information sent to them if you are unable to debug the problem yourself. If, however, you do solve the problem, make sure that your patch winds its way into the source tree via a problem report, mailing lists, or by being able to commit it!

 1:# cd /usr/obj/usr/src/sys/KERNCONF
 2:# kgdb kernel.debug /var/crash/vmcore.0
 3:GNU gdb 5.2.1 (FreeBSD)
 4:Copyright 2002 Free Software Foundation, Inc.
 5:GDB is free software, covered by the GNU General Public License, and you are
 6:welcome to change it and/or distribute copies of it under certain conditions.
 7:Type "show copying" to see the conditions.
 8:There is absolutely no warranty for GDB.  Type "show warranty" for details.
 9:This GDB was configured as "i386-undermydesk-freebsd"...
10:panic: page fault
11:panic messages:
12:---
13:Fatal trap 12: page fault while in kernel mode
14:cpuid = 0; apic id = 00
15:fault virtual address   = 0x300
16:fault code:             = supervisor read, page not present
17:instruction pointer     = 0x8:0xc0713860
18:stack pointer           = 0x10:0xdc1d0b70
19:frame pointer           = 0x10:0xdc1d0b7c
20:code segment            = base 0x0, limit 0xfffff, type 0x1b
21:                        = DPL 0, pres 1, def32 1, gran 1
22:processor eflags        = resume, IOPL = 0
23:current process         = 14394 (uname)
24:trap number             = 12
25:panic: page fault
26      cpuid = 0;
27:Stack backtrace:
28
29:syncing disks, buffers remaining... 2199 2199 panic: mi_switch: switch in a critical section
30:cpuid = 0;
31:Uptime: 2h43m19s
32:Dumping 255 MB
33: 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240
34:---
35:Reading symbols from /boot/kernel/snd_maestro3.ko...done.
36:Loaded symbols for /boot/kernel/snd_maestro3.ko
37:Reading symbols from /boot/kernel/snd_pcm.ko...done.
38:Loaded symbols for /boot/kernel/snd_pcm.ko
39:#0  doadump () at /usr/src/sys/kern/kern_shutdown.c:240
40:240             dumping++;
41:(kgdb) list *0xc0713860
42:0xc0713860 is in lapic_ipi_wait (/usr/src/sys/i386/i386/local_apic.c:663).
43:658                     incr = 0;
44:659                     delay = 1;
45:660             } else
46:661                     incr = 1;
47:662             for (x = 0; x < delay; x += incr) {
48:663                     if ((lapic->icr_lo & APIC_DELSTAT_MASK) == APIC_DELSTAT_IDLE)
49:664                             return (1);
50:665                     ia32_pause();
51:666             }
52:667             return (0);
53:(kgdb) backtrace
54:#0  doadump () at /usr/src/sys/kern/kern_shutdown.c:240
55:#1  0xc055fd9b in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:372
56:#2  0xc056019d in panic () at /usr/src/sys/kern/kern_shutdown.c:550
57:#3  0xc0567ef5 in mi_switch () at /usr/src/sys/kern/kern_synch.c:470
58:#4  0xc055fa87 in boot (howto=256) at /usr/src/sys/kern/kern_shutdown.c:312
59:#5  0xc056019d in panic () at /usr/src/sys/kern/kern_shutdown.c:550
60:#6  0xc0720c66 in trap_fatal (frame=0xdc1d0b30, eva=0)
61:    at /usr/src/sys/i386/i386/trap.c:821
62:#7  0xc07202b3 in trap (frame=
63:      {tf_fs = -1065484264, tf_es = -1065484272, tf_ds = -1065484272, tf_edi = 1, tf_esi = 0, tf_ebp = -602076292, tf_isp = -602076324, tf_ebx = 0, tf_edx = 0, tf_ecx = 1000000, tf_eax = 243, tf_trapno = 12, tf_err = 0, tf_eip = -1066321824, tf_cs = 8, tf_eflags = 65671, tf_esp = 243, tf_ss = 0})
64:    at /usr/src/sys/i386/i386/trap.c:250
65:#8  0xc070c9f8 in calltrap () at {standard input}:94
66:#9  0xc07139f3 in lapic_ipi_vectored (vector=0, dest=0)
67:    at /usr/src/sys/i386/i386/local_apic.c:733
68:#10 0xc0718b23 in ipi_selected (cpus=1, ipi=1)
69:    at /usr/src/sys/i386/i386/mp_machdep.c:1115
70:#11 0xc057473e in kseq_notify (ke=0xcc05e360, cpu=0)
71:    at /usr/src/sys/kern/sched_ule.c:520
72:#12 0xc0575cad in sched_add (td=0xcbcf5c80)
73:    at /usr/src/sys/kern/sched_ule.c:1366
74:#13 0xc05666c6 in setrunqueue (td=0xcc05e360)
75:    at /usr/src/sys/kern/kern_switch.c:422
76:#14 0xc05752f4 in sched_wakeup (td=0xcbcf5c80)
77:    at /usr/src/sys/kern/sched_ule.c:999
78:#15 0xc056816c in setrunnable (td=0xcbcf5c80)
79:    at /usr/src/sys/kern/kern_synch.c:570
80:#16 0xc0567d53 in wakeup (ident=0xcbcf5c80)
81:    at /usr/src/sys/kern/kern_synch.c:411
82:#17 0xc05490a8 in exit1 (td=0xcbcf5b40, rv=0)
83:    at /usr/src/sys/kern/kern_exit.c:509
84:#18 0xc0548011 in sys_exit () at /usr/src/sys/kern/kern_exit.c:102
85:#19 0xc0720fd0 in syscall (frame=
86:      {tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 0, tf_esi = -1, tf_ebp = -1077940712, tf_isp = -602075788, tf_ebx = 672411944, tf_edx = 10, tf_ecx = 672411600, tf_eax = 1, tf_trapno = 12, tf_err = 2, tf_eip = 671899563, tf_cs = 31, tf_eflags = 642, tf_esp = -1077940740, tf_ss = 47})
87:    at /usr/src/sys/i386/i386/trap.c:1010
88:#20 0xc070ca4d in Xint0x80_syscall () at {standard input}:136
89:---Can't read userspace from dump, or kernel process---
90:(kgdb) quit

  This next trace is an older dump from the FreeBSD 2 time frame, but is more involved and demonstrates more of the features of gdb. Long lines have been folded to improve readability, and the lines are numbered for reference. Despite this, it is a real-world error trace taken during the development of the pcvt console driver.

 1:Script started on Fri Dec 30 23:15:22 1994
 2:# cd /sys/compile/URIAH
 3:# gdb -k kernel /var/crash/vmcore.1
 4:Reading symbol data from /usr/src/sys/compile/URIAH/kernel
...done.
 5:IdlePTD 1f3000
 6:panic: because you said to!
 7:current pcb at 1e3f70
 8:Reading in symbols for ../../i386/i386/machdep.c...done.
 9:(kgdb) backtrace
10:#0  boot (arghowto=256) (../../i386/i386/machdep.c line 767)
11:#1  0xf0115159 in panic ()
12:#2  0xf01955bd in diediedie () (../../i386/i386/machdep.c line 698)
13:#3  0xf010185e in db_fncall ()
14:#4  0xf0101586 in db_command (-266509132, -266509516, -267381073)
15:#5  0xf0101711 in db_command_loop ()
16:#6  0xf01040a0 in db_trap ()
17:#7  0xf0192976 in kdb_trap (12, 0, -272630436, -266743723)
18:#8  0xf019d2eb in trap_fatal (...)
19:#9  0xf019ce60 in trap_pfault (...)
20:#10 0xf019cb2f in trap (...)
21:#11 0xf01932a1 in exception:calltrap ()
22:#12 0xf0191503 in cnopen (...)
23:#13 0xf0132c34 in spec_open ()
24:#14 0xf012d014 in vn_open ()
25:#15 0xf012a183 in open ()
26:#16 0xf019d4eb in syscall (...)
27:(kgdb) up 10
28:Reading in symbols for ../../i386/i386/trap.c...done.
29:#10 0xf019cb2f in trap (frame={tf_es = -260440048, tf_ds = 16, tf_/
30:edi = 3072, tf_esi = -266445372, tf_ebp = -272630356, tf_isp = -27/
31:2630396, tf_ebx = -266427884, tf_edx = 12, tf_ecx = -266427884, tf/
32:_eax = 64772224, tf_trapno = 12, tf_err = -272695296, tf_eip = -26/
33:6672343, tf_cs = -266469368, tf_eflags = 66066, tf_esp = 3072, tf_/
34:ss = -266427884}) (../../i386/i386/trap.c line 283)
35:283                             (void) trap_pfault(&frame, FALSE);
36:(kgdb) frame frame->tf_ebp frame->tf_eip
37:Reading in symbols for ../../i386/isa/pcvt/pcvt_drv.c...done.
38:#0  0xf01ae729 in pcopen (dev=3072, flag=3, mode=8192, p=(struct p/
39:roc *) 0xf07c0c00) (../../i386/isa/pcvt/pcvt_drv.c line 403)
40:403             return ((*linesw[tp->t_line].l_open)(dev, tp));
41:(kgdb) list
42:398
43:399             tp->t_state |= TS_CARR_ON;
44:400             tp->t_cflag |= CLOCAL;  /* cannot be a modem (:-) */
45:401
46:402     #if PCVT_NETBSD || (PCVT_FREEBSD >= 200)
47:403             return ((*linesw[tp->t_line].l_open)(dev, tp));
48:404     #else
49:405             return ((*linesw[tp->t_line].l_open)(dev, tp, flag));
50:406     #endif /* PCVT_NETBSD || (PCVT_FREEBSD >= 200) */
51:407     }
52:(kgdb) print tp
53:Reading in symbols for ../../i386/i386/cons.c...done.
54:$1 = (struct tty *) 0x1bae
55:(kgdb) print tp->t_line
56:$2 = 1767990816
57:(kgdb) up
58:#1  0xf0191503 in cnopen (dev=0x00000000, flag=3, mode=8192, p=(st/
59:ruct proc *) 0xf07c0c00) (../../i386/i386/cons.c line 126)
60:       return ((*cdevsw[major(dev)].d_open)(dev, flag, mode, p));
61:(kgdb) up
62:#2  0xf0132c34 in spec_open ()
63:(kgdb) up
64:#3  0xf012d014 in vn_open ()
65:(kgdb) up
66:#4  0xf012a183 in open ()
67:(kgdb) up
68:#5  0xf019d4eb in syscall (frame={tf_es = 39, tf_ds = 39, tf_edi =/
69: 2158592, tf_esi = 0, tf_ebp = -272638436, tf_isp = -272629788, tf/
70:_ebx = 7086, tf_edx = 1, tf_ecx = 0, tf_eax = 5, tf_trapno = 582, /
71:tf_err = 582, tf_eip = 75749, tf_cs = 31, tf_eflags = 582, tf_esp /
72:= -272638456, tf_ss = 39}) (../../i386/i386/trap.c line 673)
73:673             error = (*callp->sy_call)(p, args, rval);
74:(kgdb) up
75:Initial frame selected; you cannot go up.
76:(kgdb) quit

  Comments to the above script:

line 6:

This is a dump taken from within DDB (see below), hence the panic comment “because you said to!”, and a rather long stack trace; the initial reason for going into DDB has been a page fault trap though.

line 20:

This is the location of function trap() in the stack trace.

line 36:

Force usage of a new stack frame; this is no longer necessary. The stack frames are supposed to point to the right locations now, even in case of a trap. From looking at the code in source line 403, there is a high probability that either the pointer access for “tp” was messed up, or the array access was out of bounds.

line 52:

The pointer looks suspicious, but happens to be a valid address.

line 56:

However, it obviously points to garbage, so we have found our error! (For those unfamiliar with that particular piece of code: tp->t_line refers to the line discipline of the console device here, which must be a rather small integer number.)

提示: If your system is crashing regularly and you are running out of disk space, deleting old vmcore files in /var/crash could save a considerable amount of disk space!


10.3 Debugging a Crash Dump with DDD

  Examining a kernel crash dump with a graphical debugger like ddd is also possible (you will need to install the devel/ddd port in order to use the ddd debugger). Add the -k option to the ddd command line you would use normally. For example;

# ddd -k /var/crash/kernel.0 /var/crash/vmcore.0

  You should then be able to go about looking at the crash dump using ddd's graphical interface.


10.4 Post-Mortem Analysis of a Dump

  What do you do if a kernel dumped core but you did not expect it, and it is therefore not compiled using config -g? Not everything is lost here. Do not panic!

  Of course, you still need to enable crash dumps. See above for the options you have to specify in order to do this.

  Go to your kernel config directory (/usr/src/sys/arch/conf) and edit your configuration file. Uncomment (or add, if it does not exist) the following line:

makeoptions    DEBUG=-g                #Build kernel with gdb(1) debug symbols

  Rebuild the kernel. Due to the time stamp change on the Makefile, some other object files will be rebuilt, for example trap.o. With a bit of luck, the added -g option will not change anything for the generated code, so you will finally get a new kernel with similar code to the faulting one but with some debugging symbols. You should at least verify the old and new sizes with the size(1) command. If there is a mismatch, you probably need to give up here.

  Go and examine the dump as described above. The debugging symbols might be incomplete for some places, as can be seen in the stack trace in the example above where some functions are displayed without line numbers and argument lists. If you need more debugging symbols, remove the appropriate object files, recompile the kernel again and repeat the gdb -k session until you know enough.

  All this is not guaranteed to work, but it will do it fine in most cases.


10.5 On-Line Kernel Debugging Using DDB

  While gdb -k as an off-line debugger provides a very high level of user interface, there are some things it cannot do. The most important ones being breakpointing and single-stepping kernel code.

  If you need to do low-level debugging on your kernel, there is an on-line debugger available called DDB. It allows setting of breakpoints, single-stepping kernel functions, examining and changing kernel variables, etc. However, it cannot access kernel source files, and only has access to the global and static symbols, not to the full debug information like gdb does.

  To configure your kernel to include DDB, add the option line

options DDB
to your config file, and rebuild. (See The FreeBSD Handbook for details on configuring the FreeBSD kernel).

注意: If you have an older version of the boot blocks, your debugger symbols might not be loaded at all. Update the boot blocks; the recent ones load the DDB symbols automatically.

  Once your DDB kernel is running, there are several ways to enter DDB. The first, and earliest way is to type the boot flag -d right at the boot prompt. The kernel will start up in debug mode and enter DDB prior to any device probing. Hence you can even debug the device probe/attach functions.

  The second scenario is to drop to the debugger once the system has booted. There are two simple ways to accomplish this. If you would like to break to the debugger from the command prompt, simply type the command:

# sysctl debug.enter_debugger=ddb

  Alternatively, if you are at the system console, you may use a hot-key on the keyboard. The default break-to-debugger sequence is Ctrl+Alt+ESC. For syscons, this sequence can be remapped and some of the distributed maps out there do this, so check to make sure you know the right sequence to use. There is an option available for serial consoles that allows the use of a serial line BREAK on the console line to enter DDB (options BREAK_TO_DEBUGGER in the kernel config file). It is not the default since there are a lot of serial adapters around that gratuitously generate a BREAK condition, for example when pulling the cable.

  The third way is that any panic condition will branch to DDB if the kernel is configured to use it. For this reason, it is not wise to configure a kernel with DDB for a machine running unattended.

  The DDB commands roughly resemble some gdb commands. The first thing you probably need to do is to set a breakpoint:

b function-name
b address

  Numbers are taken hexadecimal by default, but to make them distinct from symbol names; hexadecimal numbers starting with the letters a-f need to be preceded with 0x (this is optional for other numbers). Simple expressions are allowed, for example: function-name + 0x103.

  To continue the operation of an interrupted kernel, simply type:

c

  To get a stack trace, use:

trace

注意: Note that when entering DDB via a hot-key, the kernel is currently servicing an interrupt, so the stack trace might be not of much use to you.

  If you want to remove a breakpoint, use

del
del address-expression

  The first form will be accepted immediately after a breakpoint hit, and deletes the current breakpoint. The second form can remove any breakpoint, but you need to specify the exact address; this can be obtained from:

show b

  To single-step the kernel, try:

s

  This will step into functions, but you can make DDB trace them until the matching return statement is reached by:

n

注意: This is different from gdb's next statement; it is like gdb's finish.

  To examine data from memory, use (for example):

x/wx 0xf0133fe0,40
x/hd db_symtab_space
x/bc termbuf,10
x/s stringbuf
for word/halfword/byte access, and hexadecimal/decimal/character/ string display. The number after the comma is the object count. To display the next 0x10 items, simply use:

x ,10

  Similarly, use

x/ia foofunc,10
to disassemble the first 0x10 instructions of foofunc, and display them along with their offset from the beginning of foofunc.

  To modify memory, use the write command:

w/b termbuf 0xa 0xb 0
w/w 0xf0010030 0 0

  The command modifier (b/h/w) specifies the size of the data to be written, the first following expression is the address to write to and the remainder is interpreted as data to write to successive memory locations.

  If you need to know the current registers, use:

show reg

  Alternatively, you can display a single register value by e.g.

p $eax
and modify it by:

set $eax new-value

  Should you need to call some kernel functions from DDB, simply say:

call func(arg1, arg2, ...)

  The return value will be printed.

  For a ps(1) style summary of all running processes, use:

ps

  Now you have examined why your kernel failed, and you wish to reboot. Remember that, depending on the severity of previous malfunctioning, not all parts of the kernel might still be working as expected. Perform one of the following actions to shut down and reboot your system:

panic

  This will cause your kernel to dump core and reboot, so you can later analyze the core on a higher level with gdb. This command usually must be followed by another continue statement.

call boot(0)

  Which might be a good way to cleanly shut down the running system, sync() all disks, and finally reboot. As long as the disk and filesystem interfaces of the kernel are not damaged, this might be a good way for an almost clean shutdown.

call cpu_reset()

  This is the final way out of disaster and almost the same as hitting the Big Red Button.

  If you need a short command summary, simply type:

help

  However, it is highly recommended to have a printed copy of the ddb(4) manual page ready for a debugging session. Remember that it is hard to read the on-line manual while single-stepping the kernel.


10.6 On-Line Kernel Debugging Using Remote GDB

  This feature has been supported since FreeBSD 2.2, and it is actually a very neat one.

  GDB has already supported remote debugging for a long time. This is done using a very simple protocol along a serial line. Unlike the other methods described above, you will need two machines for doing this. One is the host providing the debugging environment, including all the sources, and a copy of the kernel binary with all the symbols in it, and the other one is the target machine that simply runs a similar copy of the very same kernel (but stripped of the debugging information).

  You should configure the kernel in question with config -g, include DDB into the configuration, and compile it as usual. This gives a large binary, due to the debugging information. Copy this kernel to the target machine, strip the debugging symbols off with strip -x, and boot it using the -d boot option. Connect the serial line of the target machine that has "flags 080" set on its sio device to any serial line of the debugging host. Now, on the debugging machine, go to the compile directory of the target kernel, and start gdb:

% gdb -k kernel
GDB is free software and you are welcome to distribute copies of it
 under certain conditions; type "show copying" to see the conditions.
There is absolutely no warranty for GDB; type "show warranty" for details.
GDB 4.16 (i386-unknown-freebsd),
Copyright 1996 Free Software Foundation, Inc...
(kgdb)

  Initialize the remote debugging session (assuming the first serial port is being used) by:

(kgdb) target remote /dev/cuaa0

  Now, on the target host (the one that entered DDB right before even starting the device probe), type:

Debugger("Boot flags requested debugger")
Stopped at Debugger+0x35: movb  $0, edata+0x51bc
db> gdb

  DDB will respond with:

Next trap will enter GDB remote protocol mode

  Every time you type gdb, the mode will be toggled between remote GDB and local DDB. In order to force a next trap immediately, simply type s (step). Your hosting GDB will now gain control over the target kernel:

Remote debugging using /dev/cuaa0
Debugger (msg=0xf01b0383 "Boot flags requested debugger")
    at ../../i386/i386/db_interface.c:257
(kgdb)

  You can use this session almost as any other GDB session, including full access to the source, running it in gud-mode inside an Emacs window (which gives you an automatic source code display in another Emacs window), etc.


10.7 Debugging Loadable Modules Using GDB

  When debugging a panic that occurred within a module, or using remote GDB against a machine that uses dynamic modules, you need to tell GDB how to obtain symbol information for those modules.

  First, you need to build the module(s) with debugging information:

# cd /sys/modules/linux
# make clean; make COPTS=-g

  If you are using remote GDB, you can run kldstat on the target machine to find out where the module was loaded:

# kldstat
Id Refs Address    Size     Name
 1    4 0xc0100000 1c1678   kernel
 2    1 0xc0a9e000 6000     linprocfs.ko
 3    1 0xc0ad7000 2000     warp_saver.ko
 4    1 0xc0adc000 11000    linux.ko

  If you are debugging a crash dump, you will need to walk the linker_files list, starting at linker_files->tqh_first and following the link.tqe_next pointers until you find the entry with the filename you are looking for. The address member of that entry is the load address of the module.

  Next, you need to find out the offset of the text section within the module:

# objdump --section-headers /sys/modules/linux/linux.ko | grep text
  3 .rel.text     000016e0  000038e0  000038e0  000038e0  2**2
 10 .text         00007f34  000062d0  000062d0  000062d0  2**2

  The one you want is the .text section, section 10 in the above example. The fourth hexadecimal field (sixth field overall) is the offset of the text section within the file. Add this offset to the load address of the module to obtain the relocation address for the module's code. In our example, we get 0xc0adc000 + 0x62d0 = 0xc0ae22d0. Use the add-symbol-file command in GDB to tell the debugger about the module:

(kgdb) add-symbol-file /sys/modules/linux/linux.ko 0xc0ae22d0
add symbol table from file "/sys/modules/linux/linux.ko" at text_addr = 0xc0ae22d0?
(y or n) y
Reading symbols from /sys/modules/linux/linux.ko...done.
(kgdb)

  You should now have access to all the symbols in the module.


10.8 Debugging a Console Driver

  Since you need a console driver to run DDB on, things are more complicated if the console driver itself is failing. You might remember the use of a serial console (either with modified boot blocks, or by specifying -h at the Boot: prompt), and hook up a standard terminal onto your first serial port. DDB works on any configured console driver, including a serial console.

第IV部分. 系统结构

你可能感兴趣的:(FreeBSD开发手册(一))