
The Linux Kernel Module Programming Guide

Peter Jay Salzman
Michael Burian
Ori Pomerantz

The Linux Kernel Module Programming Guide is a free book; you may reproduce and/or modify it under the terms of theOpen Software License, version 1.1. You can obtain a copy of this license at

This book is distributed in the hope it will be useful, but without any warranty, without even the implied warrantyof merchantability or fitness for a particular purpose.

The author encourages wide distribution of this book for personal or commercial use, provided the above copyrightnotice remains intact and the method adheres to the provisions of the Open Software License. In summary, you may copy anddistribute this book free of charge or for a profit. No explicit permission is required from the author for reproductionof this book in any medium, physical or electronic.

Derivative works and translations of this document must be placed under the Open Software License, and the originalcopyright notice must remain intact. If you have contributed new material to this book, you must make the material andsource code available for your revisions. Please make revisions and updates available directly to the documentmaintainer, Peter Jay Salzman . This will allow for the merging of updates and provideconsistent revisions to the Linux community.

If you publish or distribute this book commercially, donations, royalties, and/or printed copies are greatlyappreciated by the author and the Linux Documentation Project (LDP).Contributing in this way shows your support for free software and the LDP. If you have questions or comments, pleasecontact the address above.

1. 作者声明

《Linux内核驱动模块编程指南》最初是由Ori Pomerantz为2.2版本的内核编写的,后来,Ori将文档维护的任务交给了Peter Jay Salzman,Peter完成了2.4内核版本文档的编写,毕竟Linux内核驱动模块是一个更新很快的内容。现在,Peter也无法腾出足够的时间来完成2.6内核版本文档的编写,目前该2.6内核版本的文档由合作者Michael Burian完成。

2. 版本和注意

Linux内核模块是一块不断更新进步的内容,在LKMPG上总有关于是否保留还是历史版本的争论。Michael和我最终是决定为每个新的稳定版本内核建立一个新的文档分支。也就是说LKMPG 2.4.x专注于2.4的内核,而LKMPG 2.6.x将专注于2.6的内核。我们不会在一篇文档中提供对旧版本内核的支持,对此感兴趣的读者应该寻找相关版本的文档分支。

在文档中的绝大部分源代码和讨论都应该适用于其它平台,但我无法提供任何保证。其中的一个例外就是Chapter 12, 中断处理该章的源代码和讨论就只适用于x86平台。

3. 感谢

感谢下列人士为此文档提供了他们宝贵的意见。他们是:Ignacio Martin, David Porter, Daniele Paolo,Scarpazza 和 Dimo Velev。

4. 译者注



受原作者Ori的鼓励,基于上次完成的LKMPG 2.4的,内容有稍许的改变和扩充。应该是目前最新的了。翻译的方式有所改变,在基于LDP认可的docbook格式上翻译,通过docbook2html转换为附件中的html文档。由于对docbook不是很熟悉,其中的一些标题尚未翻译,而且可能破坏了原有的tag,导致html出现一些错误显示,但总体来说很少。修改了很多2.4中的错别字。


Chapter 1. Introduction

1.1. 什么是内核模块?



1.2. 内核模块是如何被调入内核工作的?

你可以通过执行lsmod命令来查看内核已经加载了哪些内核模块, 该命令通过读取/proc/modules文件的内容来获得所需信息。


  • 一个内核模块的名字像softdog或是ppp。

  • 通用识别符像char-major-10-30。


alias char-major-10-30 softdog


然后,modprobe遍历文件/lib/modules/version/modules.dep来判断是否有其它内核模块需要在该模块加载前被加载。该文件是由命令depmod -a建立,保存着内核模块的依赖关系。举例来说,msdos.o依赖于模块fat.o内核模块已经被内核载入。当要加载的内核模块需要使用别的模块提供的符号链接时(多半是变量或函数),那么那些提供这些所需符号链接的内核模块就被该模块所依赖。


insmod /lib/modules/2.5.1/kernel/fs/fat/fat.o
insmod /lib/modules/2.5.1/kernel/fs/msdos/msdos.o

或只是执行"modprobe -a msdos"。

Linux提供modprobe, insmod and depmod在一个名为modutils 或 mod-utils的工具包内。

在结束本章前,让我们来看一个 /etc/modules.conf文件:

#This file is automatically generated by update-modules
options mydriver irq=10
alias eth0 eepro


以 path[misc]起始的行告诉modprobe用/lib/modules/2.4.?/local替代搜寻misc内核模块的路径。正如你看到的,命令解释器shell的元字符也可以使用。

以path[net]起始的行告诉modprobe 在目录 ~p/mymodules搜索网络方面的内核模块。但是,在path[net]指令之前使用的"keep" 指令告诉modprobe只是将该路径添加到标准搜索路径中,而不是像对待misc前面那样进行替换。

以alias 起始的的行使modprobe加载eepro.o当kmod 以通用识别符'eth0'要求加载相应内核模块时。

你不会发现像"alias block-major-2 floppy"这样的别名行在文件/etc/modules.conf 因为modprobe已经知道在绝大多数系统上安装的标准的设备的驱动模块。


1.2.1. 在开始前

在我们介绍源代码前,有一些事需要注意。系统彼此之间的不同会导致许多困难。顺利的编译并且加载你的第一个"hello world"模块有时就会比较困难。但是当你跨过这道坎时,后面会顺利的多。 内核模块和内核的版本问题

为某个版本编译的模块将不能被另一个版本的内核加载如果内核中打开了CONFIG_MODVERSIONS选项。我们暂时不会讨论与此相关的内容。在我们进入相关内容前,本文档中的范例可能在该选项打开的情况下无法工作。但是,目前绝大多数的发行版是将该选项打开的。所以如果你遇到和版本相关的错误时,最好,重新编译一个关闭该选项的内核。 使用 X带来的问题


模块不能像printf()那样输出到屏幕,但它们可以记录信息和警告,当且仅当你在使用控制台时这些信息才能最终显示在屏幕上。如果你从xterm中insmod一个模块,这些日志信息只会记录在你的日志文件中。除了查看日志文件你将无法 得到输出信息。想要及时的获得这些日志信息,建议所有的工作都在控制台下进行。 编译相关和内核版本相关的问题



我强烈建议从Linux镜像站点下载源代码包,编译新内核并用新内核启动系统来避免以上的问题。参阅"Linux Kernel HOWTO"获得详细内容。


Chapter 2. Hello World

2.1. Hello, World (part 1): 最简单的内核模块

当第一个洞穴程序员在第一台洞穴计算机的墙上上凿写第一个程序时,这是一个在羚羊皮上输出`Hello, world'的字符串。罗马的编程书籍上是以`Salut, Mundi'这样的程序开始的。 我不明白人们为什么要破坏这个传统,但我认为还是不明白为好。我们将从编写一系列的`Hello, world'模块开始,一步步展示编写内核模块的基础的方方面面。


Example 2-1. hello-1.c

 *  hello-1.c - The simplest kernel module.
#include <linux/module.h>	/* Needed by all modules */
#include <linux/kernel.h>	/* Needed for KERN_ALERT */

int init_module(void)
	printk("<1>Hello world 1.\n");

	 * A non 0 return means init_module failed; module can't be loaded. 
	return 0;

void cleanup_module(void)
	printk(KERN_ALERT "Goodbye world 1.\n");

一个内核模块应该至少包含两个函数。一个“开始”(初始化)的函数被称为init_module()还有一个“结束” (干一些收尾清理的工作)的函数被称为cleanup_module(),当内核模块被rmmod卸载时被执行。实际上,从内核版本2.3.13开始这种情况有些改变。你可以为你的开始和结束函数起任意的名字。 你将在以后学习如何实现这一点Section 2.3。实际上,这个新方法时推荐的实现方法。但是,许多人仍然使init_module()cleanup_module()作为他们的开始和结束函数。


最后,任一个内核模块需要包含linux/module.h。 我们仅仅需要包含linux/kernel.h当需要使用printk()记录级别的宏扩展时KERN_ALERT,相关内容将在Section 2.1.1中介绍。

2.1.1. 介绍printk()

不管你可能怎么想,printk()并不是设计用来同用户交互的,虽然我们在hello-1就是出于这样的目的使用它!它实际上是为内核提供日志功能,记录内核信息或用来给出警告。因此,每个printk()声明都会带一个优先级,就像你看到的<1>和KERN_ALERT那样。内核总共定义了八个优先级的宏, 所以你不必使用晦涩的数字代码,并且你可以从文件linux/kernel.h查看这些宏和它们的意义。如果你不指明优先级,默认的优先级DEFAULT_MESSAGE_LOGLEVEL将被采用。


当优先级低于int console_loglevel,信息将直接打印在你的终端上。如果同时syslogdklogd都在运行,信息也同时添加在文件/var/log/messages,而不管是否显示在控制台上与否。我们使用像KERN_ALERT这样的高优先级,来确保printk()将信息输出到控制台而不是只是添加到日志文件中。 当你编写真正的实用的模块时,你应该针对可能遇到的情况使用合适的优先级。

2.2. 编译内核模块

内核模块在用gcc编译时需要使用特定的参数。另外,一些宏同样需要定义。 这是因为在编译成可执行文件和内核模块时,内核头文件起的作用是不同的。 以往的内核版本需要我们去在Makefile中手动设置这些设定。尽管这些Makefile是按目录分层次安排的,但是这其中有许多多余的重复并导致代码树大而难以维护。幸运的是,一种称为kbuild的新方法被引入,现在外部的可加载内核模块的编译的方法已经同内核编译统一起来。想了解更多的编译非内核代码树中的模块(就像我们将要编写的)请参考帮助文件linux/Documentation/kbuild/modules.txt。


Example 2-2. 一个基本的Makefile

obj-m += hello-1.o

现在你可以通过执行命令 make -C /usr/src/linux-`uname -r` SUBDIRS=$PWD modules 编译模块。你应该得到同下面类似的屏幕输出:

[root@pcsenonsrv test_module]# make -C /usr/src/linux-`uname -r` SUBDIRS=$PWD modules
make: Entering directory `/usr/src/linux-2.6.x
  CC [M]  /root/test_module/hello-1.o
  Building modules, stage 2.
  CC      /root/test_module/hello-1.mod.o
  LD [M]  /root/test_module/hello-1.ko
make: Leaving directory `/usr/src/linux-2.6.x

请注意2.6的内核现在引入一种新的内核模块命名规范:内核模块现在使用.ko的文件后缀(代替 以往的.o后缀),这样内核模块就可以同普通的目标文件区别开。更详细的文档请参考linux/Documentation/kbuild/makefiles.txt。在研究Makefile之前请确认你已经参考了这些文档。

现在是使用insmod ./hello-1.ko命令加载该模块的时候了(忽略任何你看到的关于内核污染的输出显示,我们将在以后介绍相关内容)。

所有已经被加载的内核模块都罗列在文件/proc/modules中。cat一下这个文件看一下你的模块是否真的成为内核的一部分了。如果是,祝贺你!你现在已经是内核模块的作者了。当你的新鲜劲过去后,使用命令rmmod hello-1.卸载模块。再看一下/var/log/messages文件的内容是否有相关的日志内容。

这儿是另一个练习。看到了在声明init_module()上的注释吗? 改变返回值非零,重新编译再加载,发生了什么?

2.3. Hello World (part 2)

在内核Linux 2.4中,你可以为你的模块的“开始”和“结束”函数起任意的名字。它们不再必须使用init_module()cleanup_module()的名字。这可以通过宏module_init()module_exit()实现。这些宏在头文件linux/init.h定义。唯一需要注意的地方是函数必须在宏的使用前定义,否则会有编译错误。下面就是一个例子。

Example 2-3. hello-2.c

 *  hello-2.c - Demonstrating the module_init() and module_exit() macros.
 *  This is preferred over using init_module() and cleanup_module().
#include <linux/module.h>	/* Needed by all modules */
#include <linux/kernel.h>	/* Needed for KERN_ALERT */
#include <linux/init.h>		/* Needed for the macros */

static int __init hello_2_init(void)
	printk(KERN_ALERT "Hello, world 2\n");
	return 0;

static void __exit hello_2_exit(void)
	printk(KERN_ALERT "Goodbye, world 2\n");



Example 2-4. 两个内核模块使用的Makefile

obj-m += hello-1.o
obj-m += hello-2.o

现在让我们来研究一下linux/drivers/char/Makefile这个实际中的例子。就如同你看到的,一些被编译进内核 (obj-y),但是这些obj-m哪里去了呢?对于熟悉shell脚本的人这不难理解。这些在Makefile中随处可见的obj-$(CONFIG_FOO)的指令将会在CONFIG_FOO被设置后扩展为你熟悉的obj-y或obj-m。这其实就是你在使用make menuconfig编译内核时生成的linux/.config中设置的东西。

2.4. Hello World (part 3): 关于__init和__exit宏


__initdata__init 类似,只不过对变量有效。

__exit将忽略“清理收尾”的函数如果该模块被编译进内核。同宏__exit一样,对动态加载模块是无效的。这很容易理解。编译进内核的模块是没有清理收尾工作的, 而动态加载的却需要自己完成这些工作。

这些宏在头文件linux/init.h定义,用来释放内核占用的内存。当你在启动时看到这样的Freeing unused kernel memory: 236k freed内核输出,上面的那些正是内核所释放的。

Example 2-5. hello-3.c

 *  hello-3.c - Illustrating the __init, __initdata and __exit macros.
#include <linux/module.h>	/* Needed by all modules */
#include <linux/kernel.h>	/* Needed for KERN_ALERT */
#include <linux/init.h>		/* Needed for the macros */

static int hello3_data __initdata = 3;

static int __init hello_3_init(void)
	printk(KERN_ALERT "Hello, world %d\n", hello3_data);
	return 0;

static void __exit hello_3_exit(void)
	printk(KERN_ALERT "Goodbye, world 3\n");


2.5. Hello World (part 4): 内核模块证书和内核模块文档说明


# insmod hello-3.o
Warning: loading hello-3.o will taint the kernel: no license
  See for information about tainted modules
Hello, world 3
Module hello-3 loaded, with warnings


 * The following license idents are currently accepted as indicating free
 * software modules
 *	"GPL"				[GNU Public License v2 or later]
 *	"GPL v2"			[GNU Public License v2]
 *	"GPL and additional rights"	[GNU Public License v2 rights and more]
 *	"Dual BSD/GPL"			[GNU Public License v2
 *					 or BSD license choice]
 *	"Dual MPL/GPL"			[GNU Public License v2
 *					 or Mozilla license choice]
 * The following other idents are available
 *	"Proprietary"			[Non free products]
 * There are dual licensed components, but when running with Linux it is the
 * GPL that is relevant so this is a non issue. Similarly LGPL linked with GPL
 * is a GPL combined work.
 * This exists for several reasons
 * 1.	So modinfo can show license info for users wanting to vet their setup 
 *	is free
 * 2.	So the community can ignore bug reports including proprietary modules
 * 3.	So vendors can do likewise based on their own policies



Example 2-6. hello-4.c

 *  hello-4.c - Demonstrates module documentation.
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#define DRIVER_AUTHOR "Peter Jay Salzman <[email protected]>"
#define DRIVER_DESC   "A sample driver"

static int __init init_hello_4(void)
	printk(KERN_ALERT "Hello, world 4\n");
	return 0;

static void __exit cleanup_hello_4(void)
	printk(KERN_ALERT "Goodbye, world 4\n");


 *  You can use strings, like this:

 * Get rid of taint message by declaring code as GPL. 

 * Or with defines, like this:
MODULE_AUTHOR(DRIVER_AUTHOR);	/* Who wrote this module? */
MODULE_DESCRIPTION(DRIVER_DESC);	/* What does this module do */

 *  This module uses /dev/testdevice.  The MODULE_SUPPORTED_DEVICE macro might
 *  be used in the future to help automatic configuration of modules, but is 
 *  currently unused other than for documentation purposes.

2.6. 从命令行传递参数给内核模块


要传递参数给模块,首先将获取参数值的变量声明为全局变量。然后使用宏MODULE_PARM()(在头文件linux/module.h)。运行时,insmod将给变量赋予命令行的参数,如同./insmod mymodule.o myvariable=5。为使代码清晰,变量的声明和宏都应该放在模块代码的开始部分。以下的代码范例也许将比我公认差劲的解说更好。

MODULE_PARM()需要两个参数,变量的名字和其类型。支持的类型有"b": 比特型,"h": 短整型, "i": 整数型,"l: 长整型和 "s": 字符串型,其中正数型既可为signed也可为unsigned。 字符串类型应该声明为"char *"这样insmod就可以为它们分配内存空间。你应该总是为你的变量赋初值。这是内核编程,代码要编写的十分谨慎。举个例子:

int myint = 3;
char *mystr;

MODULE_PARM(myint, "i");
MODULE_PARM(mystr, "s");


int myshortArray[4];
MODULE_PARM (myintArray, "3-9i");



Example 2-7. hello-5.c

 *  hello-5.c - Demonstrates command line argument passing to a module.
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/stat.h>

MODULE_AUTHOR("Peter Jay Salzman");

static short int myshort = 1;
static int myint = 420;
static long int mylong = 9999;
static char *mystring = "blah";

 * module_param(foo, int, 0000)
 * The first param is the parameters name
 * The second param is it's data type
 * The final argument is the permissions bits, 
 * for exposing parameters in sysfs (if non-zero) at a later stage.

module_param(myshort, short, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP);
MODULE_PARM_DESC(myshort, "A short integer");
module_param(myint, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
MODULE_PARM_DESC(myint, "An integer");
module_param(mylong, long, S_IRUSR);
MODULE_PARM_DESC(mylong, "A long integer");
module_param(mystring, charp, 0000);
MODULE_PARM_DESC(mystring, "A character string");

static int __init hello_5_init(void)
	printk(KERN_ALERT "Hello, world 5\n=============\n");
	printk(KERN_ALERT "myshort is a short integer: %hd\n", myshort);
	printk(KERN_ALERT "myint is an integer: %d\n", myint);
	printk(KERN_ALERT "mylong is a long integer: %ld\n", mylong);
	printk(KERN_ALERT "mystring is a string: %s\n", mystring);
	return 0;

static void __exit hello_5_exit(void)
	printk(KERN_ALERT "Goodbye, world 5\n");



satan# insmod hello-5.o mystring="bebop" mybyte=255 myintArray=-1
mybyte is an 8 bit integer: 255
myshort is a short integer: 1
myint is an integer: 20
mylong is a long integer: 9999
mystring is a string: bebop
myintArray is -1 and 420

satan# rmmod hello-5
Goodbye, world 5

satan# insmod hello-5.o mystring="supercalifragilisticexpialidocious" \
> mybyte=256 myintArray=-1,-1
mybyte is an 8 bit integer: 0
myshort is a short integer: 1
myint is an integer: 20
mylong is a long integer: 9999
mystring is a string: supercalifragilisticexpialidocious
myintArray is -1 and -1

satan# rmmod hello-5
Goodbye, world 5

satan# insmod hello-5.o mylong=hello
hello-5.o: invalid argument syntax for mylong: 'h'

2.7. 由多个文件构成的内核模块


  1. 只要在一个源文件中添加#define __NO_VERSION__预处理命令。这很重要因为module.h通常包含kernel_version的定义,此时一个存储着内核版本的全局变量将会被编译。但如果此时你又要包含头文件version.h,你必须手动包含它,因为 module.h不会再包含它如果打开预处理选项__NO_VERSION__。

  2. 像通常一样编译。

  3. 将所有的目标文件连接为一个文件。在x86平台下,使用命令ld -m elf_i386 -r -o <modulename.o> <1st src file.o> <2nd src file.o>



Example 2-8. start.c

 *  start.c - Illustration of multi filed modules

#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/module.h>	/* Specifically, a module */

int init_module(void)
	printk("Hello, world - this is the kernel speaking\n");
	return 0;


Example 2-9. stop.c

 *  stop.c - Illustration of multi filed modules

#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/module.h>	/* Specifically, a module  */

void cleanup_module()
	printk("<1>Short is the life of a kernel module\n");


Example 2-10. Makefile

obj-m += hello-1.o
obj-m += hello-2.o
obj-m += hello-3.o
obj-m += hello-4.o
obj-m += hello-5.o
obj-m += startstop.o
startstop-objs := start.o stop.o

2.8. 为已编译的内核编译模块

很显然,我们强烈推荐你编译一个新的内核,这样你就可以打开内核中一些有用的排错功能,像强制卸载模块(MODULE_FORCE_UNLOAD):当该选项被打开时,你可以rmmod -f module强制内核卸载一个模块,即使内核认为这是不安全的。该选项可以为你节省不少开发时间。



insmod: error inserting 'poet_atkm.ko': -1 Invalid module format


Jun  4 22:07:54 localhost kernel: poet_atkm: version magic '2.6.5-1.358custom 686 
REGPARM 4KSTACKS gcc-3.3' should be '2.6.5-1.358 686 REGPARM 4KSTACKS gcc-3.3'

换句话说,内核拒绝加载你的模块因为记载版本号的字符串不符(更确切的说是版本印戳)。版本印戳作为一个静态的字符串存在于内核模块中,以vermagic:。 版本信息是在连接阶段从文件init/vermagic.o中获得的。查看版本印戳和其它在模块中的一些字符信息,可以使用下面的命令modinfo module.ko

[root@pcsenonsrv 02-HelloWorld]# modinfo hello-4.ko 
license:        GPL
author:         Peter Jay Salzman <[email protected]>
description:    A sample driver
vermagic:       2.6.5-1.358 686 REGPARM 4KSTACKS gcc-3.3


首先,准备同你目前的内核版本完全一致的内核代码树。然后,找到你的当前内核的编译配置文件。通常它可以在路径/boot下找到,使用像config-2.6.x的文件名。你可以直接将它拷贝到内核代码树的路径下: cp /boot/config-`uname -r` /usr/src/linux-`uname -r`/.config


EXTRAVERSION = -1.358custom

像上面的情况你就需要将EXTRAVERSION一项改为-1.358。我们的建议是将原始的makefile备份在/lib/modules/2.6.5-1.358/build下。一个简单的命令cp /lib/modules/`uname -r`/build/Makefile /usr/src/linux-`uname -r`即可。另外,如果你已经在运行一个由上面的错误的Makefile编译的内核,你应该重新执行 make,或直接对应/lib/modules/2.6.x/build/include/linux/version.h从文件/usr/src/linux-2.6.x/include/linux/version.h修改UTS_RELEASE,或用前者覆盖后者的。


[root@pcsenonsrv linux-2.6.x]# make
CHK     include/linux/version.h
UPD     include/linux/version.h
SYMLINK include/asm -> include/asm-i386
SPLIT   include/linux/autoconf.h -> include/config/*
HOSTCC  scripts/basic/fixdep
HOSTCC  scripts/basic/split-include
HOSTCC  scripts/basic/docproc
HOSTCC  scripts/conmakehash
HOSTCC  scripts/kallsyms
CC      scripts/empty.o


Chapter 3. Preliminaries

3.1. 内核模块和用户程序的比较

3.1.1. 内核模块是如何开始和结束的



所有的模块都必须有入口函数和退出函数。既然我们有不只一种方法去定义这两个函数,我将努力使用“入口函数”和“退出函数”来描述 它们。但是当我只用init_modulecleanup_module时,我希望你明白我指的是什么。

3.1.2. 模块可调用的函数

程序员并不总是自己写所有用到的函数。一个常见的基本的例子就是printf()你使用这些C标准库,libc提供的库函数。这些函数(像printf()) 实际上在连接之前并不进入你的程序。在连接时这些函数调用才会指向 你调用的库,从而使你的代码最终可以执行。

内核模块有所不同。在hello world模块中你也许已经注意到了我们使用的函数printk() 却没有包含标准I/O库。这是因为模块是在insmod加载时才连接的目标文件。那些要用到的函数的符号链接是内核自己提供的。 也就是说,你可以在内核模块中使用的函数只能来自内核本身。如果你对内核提供了哪些函数符号链接感兴趣,看一看文件/proc/kallsyms。

需要注意的一点是库函数和系统调用的区别。库函数是高层的,完全运行在用户空间,为程序员提供调用真正的在幕后 完成实际事务的系统调用的更方便的接口。系统调用在内核态运行并且由内核自己提供。标准C库函数printf()可以被看做是一个通用的输出语句,但它实际做的是将数据转化为符合格式的字符串并且调用系统调用write()输出这些字符串。

是否想看一看printf()究竟使用了哪些系统调用? 这很容易,编译下面的代码。

#include <stdio.h>
int main(void)
{ printf("hello"); return 0; }

使用命令gcc -Wall -o hello hello.c编译。用命令strace hello行该可执行文件。是否很惊讶? 每一行都和一个系统调用相对应。strace[3]是一个非常有用的程序,它可以告诉你程序使用了哪些系统调用和这些系统调用的参数,返回值。 这是一个极有价值的查看程序在干什么的工具。在输出的末尾,你应该看到这样类似的一行write(1, "hello", 5hello)。这就是我们要找的。藏在面具printf()的真实面目。既然绝大多数人使用库函数来对文件I/O进行操作(像 fopen, fputs, fclose)。你可以查看man说明的第二部分使用命令man 2 write. 。man说明的第二部分专门介绍系统调用(像kill()read())。man说明的第三部分则专门介绍你可能更熟悉的库函数,(像cosh()random())。

你甚至可以编写代码去覆盖系统调用,正如我们不久要做的。骇客常这样做来为系统安装后门或木马。但你可以用它来完成一些更有益的事,像让内核在每次某人删除文件时输出“ Tee hee, that tickles!” 的信息。

3.1.3. 用户空间和内核空间

内核全权负责对硬件资源的访问,不管被访问的是显示卡,硬盘,还是内存。用户程序常为这些资源竞争。就如同我在保存这 份文档同时本地数据库正在更新。我的编辑器vim进程和数据库更新进程同时要求访问硬盘。内核必须使这些请求有条不紊的进行,而不是随用户的意愿提供计算机资源。 为方便实现这种机制, CPU可以在不同的状态运行。不同的状态赋予不同的你对系统操作的自由。Intel 80836 架构有四种状态。Unix只使用了其中 的两种,最高级的状态(操作状态0,即“超级状态”,可以执行任何操作)和最低级的状态(即“用户状态”)。


3.1.4. 命名空间

如果你只是写一些短小的C程序,你可为你的变量起一个方便的和易于理解的变量名。但是,如果你写的代码只是 许多其它人写的代码的一部分,你的全局一些就会与其中的全局变量发生冲突。另一个情况是一个程序中有太多的 难以理解的变量名,这又会导致变量命名空间污染在大型项目中,必须努力记住保留的变量名,或为独一无二的命名使用一种统一的方法。

当编写内核代码时,即使是最小的模块也会同整个内核连接,所以这的确是个令人头痛的问题。最好的解决方法是声明你的变量为static静态的并且为你的符号使用一个定义的很好的前缀。传统中,使用小写字母的内核前缀。如果你不想将所有的东西都声明为static静态的,另一个选择是声明一个symbol table(符号表)并向内核注册。我们将在以后讨论。


3.1.5. 代码空间

内存管理是一个非常复杂的课题。O'Reilly的《Understanding The Linux Kernel》绝大部分都在讨论内存管理!我们 并不准备专注于内存管理,但有一些东西还是得知道的。

如果你没有认真考虑过内存设计缺陷意味着什么,你也许会惊讶的获知一个指针并不指向一个确切的内存区域。当一个进程建立时,内核为它分配一部分确切的实际内存空间并把它交给进程,被进程的代码,变量,堆栈和其它一些计算机学的专家才明白的东西使用[4]。这些内存从$0$ 开始并可以扩展到需要的地方。这些内存空间并不重叠,所以即使进程访问同一个内存地址,例如0xbffff978,真实的物理内存地址其实是不同的。进程实际指向的是一块被分配的内存中以0xbffff978为偏移量的一块内存区域。绝大多数情况下,一个进程像普通的"Hello, World"不可以访问别的进程的内存空间,尽管有实现这种机制的方法。 我们将在以后讨论。

内核自己也有内存空间。既然一个内核模块可以动态的从内核中加载和卸载,它其实是共享内核的内存空间而不是自己拥有 独立的内存空间。因此,一旦你的模块具有内存设计缺陷,内核就是内存设计缺陷了。如果你在错误的覆盖数据,那么你就在 破坏内核的代码。这比现在听起来的还糟。所以尽量小心谨慎。

顺便提一下,以上我所指出的对于任何单整体内核的操作系统都是真实的[5]。也存在模块化微内核的操作系统,如 GNU Hurd 和 QNX Neutrino。

3.1.6. Device Drivers

一种内核模块是设备驱动程序,为使用硬件设备像电视卡和串口而编写。在Unix中,任何设备都被当作路径/dev的设备文件处理,并通过这些设备文件提供访问硬件的方法。 设备驱动为用户程序访问硬件设备。举例来说,声卡设备驱动程序es1370.o将会把设备文件/dev/sound同声卡硬件Ensoniq IS1370联系起来。这样用户程序像 mp3blaster 就可以通过访问设备文件/dev/sound运行而不必知道那种声卡硬件安装在系统上。 Major and Minor Numbers


# ls -l /dev/hda[1-3]
brw-rw----  1 root  disk  3, 1 Jul  5  2000 /dev/hda1
brw-rw----  1 root  disk  3, 2 Jul  5  2000 /dev/hda2
brw-rw----  1 root  disk  3, 3 Jul  5  2000 /dev/hda3

注意一下被逗号隔开的两列。第一个数字被叫做主设备号,第二个被叫做从设备号。主设备号决定使用何种设备驱动程序。 每种不同的设备都被分配了不同的主设备号;所有具有相同主设备号的设备文件都是被同一个驱动程序控制。上面例子中的 主设备号都为3,表示它们都被同一个驱动程序控制。


设备被大概的分为两类:字符设备和块设备。区别是块设备有缓冲区,所以它们可以对请求进行优化排序。这对存储设备尤其 重要,因为读写相邻的文件总比读写相隔很远的文件要快。另一个区别是块设备输入和输出都是以数据块为单位的,但是字符设备 就可以自由读写任意量的字节。大部分硬件设备为字符设备,因为它们不需要缓冲区和数据不是按块来传输的。你可以通过命令ls -l输出的头一个字母识别一个设备为何种设备。如果是'b' 就是块设备,如果是'c'就是字符设备。以上你看到的是块设备。这儿还有一些字符设备文件(串口):

crw-rw----  1 root  dial 4, 64 Feb 18 23:34 /dev/ttyS0
crw-r-----  1 root  dial 4, 65 Nov 17 10:26 /dev/ttyS1
crw-rw----  1 root  dial 4, 66 Jul  5  2000 /dev/ttyS2
crw-rw----  1 root  dial 4, 67 Jul  5  2000 /dev/ttyS3


系统安装时,所有的这些设备文件都是由命令mknod建立的。去建立一个新的名叫coffee',主设备号为12和从设备号为2的设备文件,只要简单的执行命令mknod /dev/coffee c 12 2。你并不是必须将设备文件放在目录/dev中,这只是一个传统。Linus本人是这样做的,所以你最好也不例外。但是,当你测试一个模块时,在工作目录建立一个设备文件也不错。 只要保证完成后将它放在驱动程序找得到的地方。



% ls -l /dev/fd0 /dev/fd0u1680
brwxrwxrwx   1 root  floppy   2,  0 Jul  5  2000 /dev/fd0
brw-rw----   1 root  floppy   2, 44 Jul  5  2000 /dev/fd0u1680

你现在立即明白这是快设备的设备文件并且它们是有相同的驱动内核模块来操纵(主设备号都为2))。你也许也意识到它们都是你的软盘驱动器,即使你实际上只有一个软盘驱动器。为什么是两个设备文件?因为它们其中的一个代表着你的1.44 MB容量的软驱,另一个代表着你的1.68 MB容量的,被某些人称为“超级格式化”的软驱。这就是一个不同的从设备号代表着相同硬件设备的例子。请清楚的意识到我们提到的硬件有时可能是非常抽象的。

Chapter 4. Character Device Files

4.1. 字符设备文件

4.1.1. 关于file_operations结构体



struct file_operations {
	struct module *owner;
	 loff_t(*llseek) (struct file *, loff_t, int);
	 ssize_t(*read) (struct file *, char __user *, size_t, loff_t *);
	 ssize_t(*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
	 ssize_t(*write) (struct file *, const char __user *, size_t, loff_t *);
	 ssize_t(*aio_write) (struct kiocb *, const char __user *, size_t,
	int (*readdir) (struct file *, void *, filldir_t);
	unsigned int (*poll) (struct file *, struct poll_table_struct *);
	int (*ioctl) (struct inode *, struct file *, unsigned int,
		      unsigned long);
	int (*mmap) (struct file *, struct vm_area_struct *);
	int (*open) (struct inode *, struct file *);
	int (*flush) (struct file *);
	int (*release) (struct inode *, struct file *);
	int (*fsync) (struct file *, struct dentry *, int datasync);
	int (*aio_fsync) (struct kiocb *, int datasync);
	int (*fasync) (int, struct file *, int);
	int (*lock) (struct file *, int, struct file_lock *);
	 ssize_t(*readv) (struct file *, const struct iovec *, unsigned long,
			  loff_t *);
	 ssize_t(*writev) (struct file *, const struct iovec *, unsigned long,
			   loff_t *);
	 ssize_t(*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
			     void __user *);
	 ssize_t(*sendpage) (struct file *, struct page *, int, size_t,
			     loff_t *, int);
	unsigned long (*get_unmapped_area) (struct file *, unsigned long,
					    unsigned long, unsigned long,
					    unsigned long);



struct file_operations fops = {
	read: device_read,
	write: device_write,
	open: device_open,
	release: device_release


struct file_operations fops = {
	.read = device_read,
	.write = device_write,
	.open = device_open,
	.release = device_release


指向结构体struct file_operations的指针通常命名为fops。

4.1.2. 关于file结构体


指向结构体struct file的指针通常命名为filp。你同样可以看到struct file file的表达方式,但不要被它诱惑。

去看看结构体file的定义。大部分的函数入口,像结构体struct dentry没有被设备驱动模块使用,你大可忽略它们。这是因为设备驱动模块并不自己直接填充结构体file:它们只是使用在别处建立的结构体file中的数据。

4.1.3. 注册一个设备



int register_chrdev(unsigned int major, const char *name, struct file_operations *fops);

其中unsigned int major是你申请的主设备号,const char *name是将要在文件/proc/devicesstruct file_operations *fops是指向你的驱动模块的file_operations表的指针。负的返回值意味着注册失败。注意注册并不需要提供从设备号。内核本身并不在意从设备号。


如果你向函数register_chrdev传递为0的主设备号,那么返回的就是动态分配的主设备号。副作用就是既然你无法得知主设备号,你就无法预先建立一个设备文件。 有多种解决方法。第一种方法是新注册的驱动模块会输出自己新分配到的主设备号,所以我们可以手工建立需要的设备文件。第二种是利用文件/proc/devices新注册的驱动模块的入口,要么手工建立设备文件,要么编一个脚本去自动读取该文件并且生成设备文件。第三种是在我们的模块中,当注册成功时,使用mknod统调用建立设备文件并且调用 rm 删除该设备文件在驱动模块调用函数cleanup_module前。

4.1.4. 注销一个设备



  • try_module_get(THIS_MODULE): Increment the use count.

  • try_module_put(THIS_MODULE): Decrement the use count.


4.1.5. chardev.c

下面的代码示范了一个叫做chardev的字符设备。你可以用cat输出该设备文件的内容(或用别的程序打开它)时,驱动模块会将该设备文件被读取的次数显示。目前对设备文件的写操作还不被支持(像echo "hi" >/dev/hello),但会捕捉这些操作并且告诉用户该操作不被支持。不要担心我们对读入缓冲区的数据做了什么;我们什么都没做。我们只是读入数据并输出我们已经接收到的数据的信息。

Example 4-1. chardev.c

 *  chardev.c: Creates a read-only char device that says how many times
 *  you've read from the dev file

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <asm/uaccess.h>	/* for put_user */

 *  Prototypes - this would normally go in a .h file
int init_module(void);
void cleanup_module(void);
static int device_open(struct inode *, struct file *);
static int device_release(struct inode *, struct file *);
static ssize_t device_read(struct file *, char *, size_t, loff_t *);
static ssize_t device_write(struct file *, const char *, size_t, loff_t *);

#define SUCCESS 0
#define DEVICE_NAME "chardev"	/* Dev name as it appears in /proc/devices   */
#define BUF_LEN 80		/* Max length of the message from the device */

 * Global variables are declared as static, so are global within the file. 

static int Major;		/* Major number assigned to our device driver */
static int Device_Open = 0;	/* Is device open?  
				 * Used to prevent multiple access to device */
static char msg[BUF_LEN];	/* The msg the device will give when asked */
static char *msg_Ptr;

static struct file_operations fops = {
	.read = device_read,
	.write = device_write,
	.open = device_open,
	.release = device_release

 * Functions

int init_module(void)
	Major = register_chrdev(0, DEVICE_NAME, &fops);

	if (Major < 0) {
		printk("Registering the character device failed with %d\n",
		return Major;

	printk("<1>I was assigned major number %d.  To talk to\n", Major);
	printk("<1>the driver, create a dev file with\n");
	printk("'mknod /dev/hello c %d 0'.\n", Major);
	printk("<1>Try various minor numbers.  Try to cat and echo to\n");
	printk("the device file.\n");
	printk("<1>Remove the device file and module when done.\n");

	return 0;

void cleanup_module(void)
	 * Unregister the device 
	int ret = unregister_chrdev(Major, DEVICE_NAME);
	if (ret < 0)
		printk("Error in unregister_chrdev: %d\n", ret);

 * Methods

 * Called when a process tries to open the device file, like
 * "cat /dev/mycharfile"
static int device_open(struct inode *inode, struct file *file)
	static int counter = 0;
	if (Device_Open)
		return -EBUSY;
	sprintf(msg, "I already told you %d times Hello world!\n", counter++);
	msg_Ptr = msg;

	return SUCCESS;

 * Called when a process closes the device file.
static int device_release(struct inode *inode, struct file *file)
	Device_Open--;		/* We're now ready for our next caller */

	 * Decrement the usage count, or else once you opened the file, you'll
	 * never get get rid of the module. 

	return 0;

 * Called when a process, which already opened the dev file, attempts to
 * read from it.
static ssize_t device_read(struct file *filp,	/* see include/linux/fs.h   */
			   char *buffer,	/* buffer to fill with data */
			   size_t length,	/* length of the buffer     */
			   loff_t * offset)
	 * Number of bytes actually written to the buffer 
	int bytes_read = 0;

	 * If we're at the end of the message, 
	 * return 0 signifying end of file 
	if (*msg_Ptr == 0)
		return 0;

	 * Actually put the data into the buffer 
	while (length && *msg_Ptr) {

		 * The buffer is in the user data segment, not the kernel 
		 * segment so "*" assignment won't work.  We have to use 
		 * put_user which copies data from the kernel data segment to
		 * the user data segment. 
		put_user(*(msg_Ptr++), buffer++);


	 * Most read functions return the number of bytes put into the buffer
	return bytes_read;

 * Called when a process writes to dev file: echo "hi" > /dev/hello 
static ssize_t
device_write(struct file *filp, const char *buff, size_t len, loff_t * off)
	printk("<1>Sorry, this operation isn't supported.\n");
	return -EINVAL;

4.1.6. 为多个版本的内核编写内核模块



如果你要支持多版本的内核,你需要编写为不同内核编译的代码树。可以通过比较宏LINUX_VERSION_CODE和宏KERNEL_VERSION在版本号为a.b.c的内核中,该宏的值应该为 2^16×a+2^8×b+c


Chapter 5. The /proc File System

5.1. 关于 /proc 文件系统


使用 proc 文件系统的方法同使用设备文件很相似。你建立一个包含/proc文件需要的所有信息的结构体,这其中包括处理各种事务的函数的指针(在我们的例子中,只用到从/proc文件读取信息的函数)。然后在init_module时向内核注册这个结构体,在cleanup_module时注销这个结构体。

我们使用proc_register_dynamic[7]的原因是我们不用去设置inode,而留给内核去自动分配从而避免系统冲突错误。 普通的文件系统是建立在磁盘上的,而 /proc 的文件仅仅是建立在内存中的。在前种情况中,inode的数值是一个指向存储在磁盘某个位置的文件的索引节点(inode就是index-node的缩写)。该索引节点储存着文件的信息,像文件的权限;同时还有在哪儿能找到文件中的数据。


Example 5-1. procfs.c

 *  procfs.c -  create a "file" in /proc

#include <linux/module.h>	/* Specifically, a module */
#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/proc_fs.h>	/* Necessary because we use the proc fs */

struct proc_dir_entry *Our_Proc_File;

/* Put data into the proc fs file.
 * Arguments
 * =========
 * 1. The buffer where the data is to be inserted, if
 *    you decide to use it.
 * 2. A pointer to a pointer to characters. This is
 *    useful if you don't want to use the buffer
 *    allocated by the kernel.
 * 3. The current position in the file
 * 4. The size of the buffer in the first argument.
 * 5. Write a "1" here to indicate EOF.
 * 6. A pointer to data (useful in case one common 
 *    read for multiple /proc/... entries)
 * Usage and Return Value
 * ======================
 * A return value of zero means you have no further
 * information at this time (end of file). A negative
 * return value is an error condition.
 * For More Information
 * ====================
 * The way I discovered what to do with this function
 * wasn't by reading documentation, but by reading the
 * code which used it. I just looked to see what uses
 * the get_info field of proc_dir_entry struct (I used a
 * combination of find and grep, if you're interested),
 * and I saw that  it is used in <kernel source
 * directory>/fs/proc/array.c.
 * If something is unknown about the kernel, this is
 * usually the way to go. In Linux we have the great
 * advantage of having the kernel source code for
 * free - use it.
procfile_read(char *buffer,
	      char **buffer_location,
	      off_t offset, int buffer_length, int *eof, void *data)
	printk(KERN_INFO "inside /proc/test : procfile_read\n");

	int len = 0;		/* The number of bytes actually used */
	static int count = 1;

	 * We give all of our information in one go, so if the
	 * user asks us if we have more information the
	 * answer should always be no.
	 * This is important because the standard read
	 * function from the library would continue to issue
	 * the read system call until the kernel replies
	 * that it has no more information, or until its
	 * buffer is filled.
	if (offset > 0) {
		printk(KERN_INFO "offset %d : /proc/test : procfile_read, \
		       wrote %d Bytes\n", (int)(offset), len);
		*eof = 1;
		return len;

	 * Fill the buffer and get its length 
	len = sprintf(buffer,
		      "For the %d%s time, go away!\n", count,
		      (count % 100 > 10 && count % 100 < 14) ? "th" :
		      (count % 10 == 1) ? "st" :
		      (count % 10 == 2) ? "nd" :
		      (count % 10 == 3) ? "rd" : "th");

	 * Return the length 
	       "leaving /proc/test : procfile_read, wrote %d Bytes\n", len);
	return len;

int init_module()
	int rv = 0;
	Our_Proc_File = create_proc_entry("test", 0644, NULL);
	Our_Proc_File->read_proc = procfile_read;
	Our_Proc_File->owner = THIS_MODULE;
	Our_Proc_File->mode = S_IFREG | S_IRUGO;
	Our_Proc_File->uid = 0;
	Our_Proc_File->gid = 0;
	Our_Proc_File->size = 37;

	printk(KERN_INFO "Trying to create /proc/test:\n");

	if (Our_Proc_File == NULL) {
		rv = -ENOMEM;
		remove_proc_entry("test", &proc_root);
		printk(KERN_INFO "Error: Could not initialize /proc/test\n");
	} else {
		printk(KERN_INFO "Success!\n");

	return rv;

void cleanup_module()
	remove_proc_entry("test", &proc_root);
	printk(KERN_INFO "/proc/test removed\n");

Chapter 6. Using /proc For Input

6.1. 使用 /proc 作为输入


由于 /proc 文件系统是为内核输出其运行信息而设计的,它并未向内核输入信息提供了任何准备。结构体struct proc_dir_entry并没有指向输入函数的指针,而是指向了一个输出函数。作为替代办法,向/proc写入信息,我们可以使用标准的文件系统提供的机制。

在Linux中有一种标准的注册文件系统的方法。既然每种文件系统都必须有处理文件索引节点inode和文件本身的函数[8],那么就一定有种结构体去存放这些函数的指针。这就是结构体struct inode_operations,它其中又包含一个指向结构体struct file_operations的指针。在 /proc 文件系统中,当我们需要注册一个新文件时,我们被允许选择哪一个struct inode_operations结构体。这就是我们将使用的机制,用包含结构体struct inode_operations指针的结构体struct file_operations来指向我们的module_inputmodule_output函数。





Example 6-1. procfs.c

 *  procfs.c -  create a "file" in /proc, which allows both input and output.
#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/module.h>	/* Specifically, a module */
#include <linux/proc_fs.h>	/* Necessary because we use proc fs */
#include <asm/uaccess.h>	/* for get_user and put_user */

 * Here we keep the last message received, to prove
 * that we can process our input 
static char Message[MESSAGE_LENGTH];
static struct proc_dir_entry *Our_Proc_File;

#define PROC_ENTRY_FILENAME "rw_test"

static ssize_t module_output(struct file *filp,	/* see include/linux/fs.h   */
			     char *buffer,	/* buffer to fill with data */
			     size_t length,	/* length of the buffer     */
			     loff_t * offset)
	static int finished = 0;
	int i;
	char message[MESSAGE_LENGTH + 30];

	 * We return 0 to indicate end of file, that we have
	 * no more information. Otherwise, processes will
	 * continue to read from us in an endless loop. 
	if (finished) {
		finished = 0;
		return 0;

	 * We use put_user to copy the string from the kernel's
	 * memory segment to the memory segment of the process
	 * that called us. get_user, BTW, is
	 * used for the reverse. 
	sprintf(message, "Last input:%s", Message);
	for (i = 0; i < length && message[i]; i++)
		put_user(message[i], buffer + i);

	 * Notice, we assume here that the size of the message
	 * is below len, or it will be received cut. In a real
	 * life situation, if the size of the message is less
	 * than len then we'd return len and on the second call
	 * start filling the buffer with the len+1'th byte of
	 * the message. 
	finished = 1;

	return i;		/* Return the number of bytes "read" */

static ssize_t
module_input(struct file *filp, const char *buff, size_t len, loff_t * off)
	int i;
	 * Put the input into Message, where module_output
	 * will later be able to use it 
	for (i = 0; i < MESSAGE_LENGTH - 1 && i < len; i++)
		get_user(Message[i], buff + i);

	Message[i] = '\0';	/* we want a standard, zero terminated string */
	return i;

 * This function decides whether to allow an operation
 * (return zero) or not allow it (return a non-zero
 * which indicates why it is not allowed).
 * The operation can be one of the following values:
 * 0 - Execute (run the "file" - meaningless in our case)
 * 2 - Write (input to the kernel module)
 * 4 - Read (output from the kernel module)
 * This is the real function that checks file
 * permissions. The permissions returned by ls -l are
 * for referece only, and can be overridden here.

static int module_permission(struct inode *inode, int op, struct nameidata *foo)
	 * We allow everybody to read from our module, but
	 * only root (uid 0) may write to it 
	if (op == 4 || (op == 2 && current->euid == 0))
		return 0;

	 * If it's anything else, access is denied 
	return -EACCES;

 * The file is opened - we don't really care about
 * that, but it does mean we need to increment the
 * module's reference count. 
int module_open(struct inode *inode, struct file *file)
	return 0;

 * The file is closed - again, interesting only because
 * of the reference count. 
int module_close(struct inode *inode, struct file *file)
	return 0;		/* success */

static struct file_operations File_Ops_4_Our_Proc_File = {
	.read = module_output,
	.write = module_input,
	.open = module_open,
	.release = module_close,

 * Inode operations for our proc file. We need it so
 * we'll have some place to specify the file operations
 * structure we want to use, and the function we use for
 * permissions. It's also possible to specify functions
 * to be called for anything else which could be done to
 * an inode (although we don't bother, we just put
 * NULL). 

static struct inode_operations Inode_Ops_4_Our_Proc_File = {
	.permission = module_permission,	/* check for permissions */

 * Module initialization and cleanup 
int init_module()
	int rv = 0;
	Our_Proc_File = create_proc_entry(PROC_ENTRY_FILENAME, 0644, NULL);
	Our_Proc_File->owner = THIS_MODULE;
	Our_Proc_File->proc_iops = &Inode_Ops_4_Our_Proc_File;
	Our_Proc_File->proc_fops = &File_Ops_4_Our_Proc_File;
	Our_Proc_File->mode = S_IFREG | S_IRUGO | S_IWUSR;
	Our_Proc_File->uid = 0;
	Our_Proc_File->gid = 0;
	Our_Proc_File->size = 80;

	if (Our_Proc_File == NULL) {
		rv = -ENOMEM;
		remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);
		printk(KERN_INFO "Error: Could not initialize /proc/test\n");

	return rv;

void cleanup_module()
	remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);

还需要更多的关于procfs的例子?我要提醒你的是:第一,有消息说也许不久procfs将被sysfs取代;第二,如果你真的很想多了解些procfs,你可以参考路径 linux/Documentation/DocBook/ 下的那些技术性的文档。在内核代码树根目录下使用 make help 来获得如何将这些文档转化为你偏好的格式,例如: make htmldocs 。如果你要为内核加入一些你的文档,你也应该考虑这样做。

Chapter 7. Talking To Device Files

7.1. 与设备文件对话 (writes and IOCTLs)



解决之道是在Unix系统中的函数ioctl(Input Output ConTroL的简写)。每个设备可以有自己的ioctl命令,通过读取ioctl's可以从进程中向内核发送信息,或写ioctl's向进程返回信息[9],或者两者都是,或都不是。函数ioctl调用时需要三个参数:合适的设备文件的文件描述符,ioctl号,和一个可以被一个任务使用来传递任何东西的long类型的参数[10]

ioctl号是反映主设备号,ioctl的种类,对应的命令和参数类型的数字。它通常是通过在头文件中宏调用(_IO, _IOR, _IOW或_IOWR,取决于其种类)来建立的。该头文件应该被使用ioctl的用户程序包含(这样它们就可以生成正确的ioctl's)和内核驱动模块包含(这样模块才能理解它)。在下面的例子中,头文件为chardev.h,源程序为ioctl.c


Example 7-1. chardev.c

 *  chardev.c - Create an input/output character device

#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/module.h>	/* Specifically, a module */
#include <linux/fs.h>
#include <asm/uaccess.h>	/* for get_user and put_user */

#include "chardev.h"
#define SUCCESS 0
#define DEVICE_NAME "char_dev"
#define BUF_LEN 80

 * Is the device open right now? Used to prevent
 * concurent access into the same device 
static int Device_Open = 0;

 * The message the device will give when asked 
static char Message[BUF_LEN];

 * How far did the process reading the message get?
 * Useful if the message is larger than the size of the
 * buffer we get to fill in device_read. 
static char *Message_Ptr;

 * This is called whenever a process attempts to open the device file 
static int device_open(struct inode *inode, struct file *file)
#ifdef DEBUG
	printk("device_open(%p)\n", file);

	 * We don't want to talk to two processes at the same time 
	if (Device_Open)
		return -EBUSY;

	 * Initialize the message 
	Message_Ptr = Message;
	return SUCCESS;

static int device_release(struct inode *inode, struct file *file)
#ifdef DEBUG
	printk("device_release(%p,%p)\n", inode, file);

	 * We're now ready for our next caller 

	return SUCCESS;

 * This function is called whenever a process which has already opened the
 * device file attempts to read from it.
static ssize_t device_read(struct file *file,	/* see include/linux/fs.h   */
			   char __user * buffer,	/* buffer to be
							 * filled with data */
			   size_t length,	/* length of the buffer     */
			   loff_t * offset)
	 * Number of bytes actually written to the buffer 
	int bytes_read = 0;

#ifdef DEBUG
	printk("device_read(%p,%p,%d)\n", file, buffer, length);

	 * If we're at the end of the message, return 0
	 * (which signifies end of file) 
	if (*Message_Ptr == 0)
		return 0;

	 * Actually put the data into the buffer 
	while (length && *Message_Ptr) {

		 * Because the buffer is in the user data segment,
		 * not the kernel data segment, assignment wouldn't
		 * work. Instead, we have to use put_user which
		 * copies data from the kernel data segment to the
		 * user data segment. 
		put_user(*(Message_Ptr++), buffer++);

#ifdef DEBUG
	printk("Read %d bytes, %d left\n", bytes_read, length);

	 * Read functions are supposed to return the number
	 * of bytes actually inserted into the buffer 
	return bytes_read;

 * This function is called when somebody tries to
 * write into our device file. 
static ssize_t
device_write(struct file *file,
	     const char __user * buffer, size_t length, loff_t * offset)
	int i;

#ifdef DEBUG
	printk("device_write(%p,%s,%d)", file, buffer, length);

	for (i = 0; i < length && i < BUF_LEN; i++)
		get_user(Message[i], buffer + i);

	Message_Ptr = Message;

	 * Again, return the number of input characters used 
	return i;

 * This function is called whenever a process tries to do an ioctl on our
 * device file. We get two extra parameters (additional to the inode and file
 * structures, which all device functions get): the number of the ioctl called
 * and the parameter given to the ioctl function.
 * If the ioctl is write or read/write (meaning output is returned to the
 * calling process), the ioctl call returns the output of this function.
int device_ioctl(struct inode *inode,	/* see include/linux/fs.h */
		 struct file *file,	/* ditto */
		 unsigned int ioctl_num,	/* number and param for ioctl */
		 unsigned long ioctl_param)
	int i;
	char *temp;
	char ch;

	 * Switch according to the ioctl called 
	switch (ioctl_num) {
		 * Receive a pointer to a message (in user space) and set that
		 * to be the device's message.  Get the parameter given to 
		 * ioctl by the process. 
		temp = (char *)ioctl_param;

		 * Find the length of the message 
		get_user(ch, temp);
		for (i = 0; ch && i < BUF_LEN; i++, temp++)
			get_user(ch, temp);

		device_write(file, (char *)ioctl_param, i, 0);

		 * Give the current message to the calling process - 
		 * the parameter we got is a pointer, fill it. 
		i = device_read(file, (char *)ioctl_param, 99, 0);

		 * Put a zero at the end of the buffer, so it will be 
		 * properly terminated 
		put_user('\0', (char *)ioctl_param + i);

		 * This ioctl is both input (ioctl_param) and 
		 * output (the return value of this function) 
		return Message[ioctl_param];

	return SUCCESS;

/* Module Declarations */

 * This structure will hold the functions to be called
 * when a process does something to the device we
 * created. Since a pointer to this structure is kept in
 * the devices table, it can't be local to
 * init_module. NULL is for unimplemented functions. 
struct file_operations Fops = {
	.read = device_read,
	.write = device_write,
	.ioctl = device_ioctl,
	.open = device_open,
	.release = device_release,	/* a.k.a. close */

 * Initialize the module - Register the character device 
int init_module()
	int ret_val;
	 * Register the character device (atleast try) 
	ret_val = register_chrdev(MAJOR_NUM, DEVICE_NAME, &Fops);

	 * Negative values signify an error 
	if (ret_val < 0) {
		printk("%s failed with %d\n",
		       "Sorry, registering the character device ", ret_val);
		return ret_val;

	printk("%s The major device number is %d.\n",
	       "Registeration is a success", MAJOR_NUM);
	printk("If you want to talk to the device driver,\n");
	printk("you'll have to create a device file. \n");
	printk("We suggest you use:\n");
	printk("mknod %s c %d 0\n", DEVICE_FILE_NAME, MAJOR_NUM);
	printk("The device file name is important, because\n");
	printk("the ioctl program assumes that's the\n");
	printk("file you'll use.\n");

	return 0;

 * Cleanup - unregister the appropriate file from /proc 
void cleanup_module()
	int ret;

	 * Unregister the device 
	ret = unregister_chrdev(MAJOR_NUM, DEVICE_NAME);

	 * If there's an error, report it 
	if (ret < 0)
		printk("Error in module_unregister_chrdev: %d\n", ret);

Example 7-2. chardev.h

 *  chardev.h - the header file with the ioctl definitions.
 *  The declarations here have to be in a header file, because
 *  they need to be known both to the kernel module
 *  (in chardev.c) and the process calling ioctl (ioctl.c)

#ifndef CHARDEV_H
#define CHARDEV_H

#include <linux/ioctl.h>

 * The major device number. We can't rely on dynamic 
 * registration any more, because ioctls need to know 
 * it. 
#define MAJOR_NUM 100

 * Set the message of the device driver 
#define IOCTL_SET_MSG _IOR(MAJOR_NUM, 0, char *)
 * _IOR means that we're creating an ioctl command 
 * number for passing information from a user process
 * to the kernel module. 
 * The first arguments, MAJOR_NUM, is the major device 
 * number we're using.
 * The second argument is the number of the command 
 * (there could be several with different meanings).
 * The third argument is the type we want to get from 
 * the process to the kernel.

 * Get the message of the device driver 
#define IOCTL_GET_MSG _IOR(MAJOR_NUM, 1, char *)
 * This IOCTL is used for output, to get the message 
 * of the device driver. However, we still need the 
 * buffer to place the message in to be input, 
 * as it is allocated by the process.

 * Get the n'th byte of the message 
 * The IOCTL is used for both input and output. It 
 * receives from the user a number, n, and returns 
 * Message[n]. 

 * The name of the device file 
#define DEVICE_FILE_NAME "char_dev"


Example 7-3. ioctl.c

 *  ioctl.c - the process to use ioctl's to control the kernel module
 *  Until now we could have used cat for input and output.  But now
 *  we need to do ioctl's, which require writing our own process.

 * device specifics, such as ioctl numbers and the
 * major device file. 
#include "chardev.h"

#include <fcntl.h>		/* open */
#include <unistd.h>		/* exit */
#include <sys/ioctl.h>		/* ioctl */

 * Functions for the ioctl calls 

ioctl_set_msg(int file_desc, char *message)
	int ret_val;

	ret_val = ioctl(file_desc, IOCTL_SET_MSG, message);

	if (ret_val < 0) {
		printf("ioctl_set_msg failed:%d\n", ret_val);

ioctl_get_msg(int file_desc)
	int ret_val;
	char message[100];

	 * Warning - this is dangerous because we don't tell
	 * the kernel how far it's allowed to write, so it
	 * might overflow the buffer. In a real production
	 * program, we would have used two ioctls - one to tell
	 * the kernel the buffer length and another to give
	 * it the buffer to fill
	ret_val = ioctl(file_desc, IOCTL_GET_MSG, message);

	if (ret_val < 0) {
		printf("ioctl_get_msg failed:%d\n", ret_val);

	printf("get_msg message:%s\n", message);

ioctl_get_nth_byte(int file_desc)
	int i;
	char c;

	printf("get_nth_byte message:");

	i = 0;
	while (c != 0) {
		c = ioctl(file_desc, IOCTL_GET_NTH_BYTE, i++);

		if (c < 0) {
			    ("ioctl_get_nth_byte failed at the %d'th byte:\n",


 * Main - Call the ioctl functions 
	int file_desc, ret_val;
	char *msg = "Message passed by ioctl\n";

	file_desc = open(DEVICE_FILE_NAME, 0);
	if (file_desc < 0) {
		printf("Can't open device file: %s\n", DEVICE_FILE_NAME);

	ioctl_set_msg(file_desc, msg);


Chapter 8. System Calls

8.1. 系统调用



别管什么/proc文件和什么设备文件了,它们只是小的细节问题。所有进程同内核打交道的根本方式是系统调用。当一个进程需要内核提供某项服务时(像打开一个文件,生成一个新进程,或要求更多的内存),就会发生系统调用。如果你想你的系统运作方式看起来有意思点,这就是你动手的地方。顺便说一句,如果你想知道没个程序使用了哪些系统调用,运行strace <arguments>


系统调用是这条规则的例外。所发生的事是一个进程用合适的值填充寄存器,然后调用一条跳转到已被定义过的内核中的位置的指令(当然,这些定义过的位置是对于用户进程可读的,但是显然是不可写的)。在Intel架构中,这是通过 0x80 中断完成的。硬件明白一旦你跳转到这个位置,你就不再是在处处受限的用户态中运行了,而是在无所不能的内核态中。

内核中的进程可以跳转过去的位置叫做系统调用。那儿将检查系统调用的序号,这些序号将告诉内核用户进程需要什么样的服务。然后,通过查找系统调用表(sys_call_table) 找到内核函数的地址,调用该函数。当函数返回时,再做一些系统检查,接着就返回用户进程(或是另一个进程,如果该进程的时间用完了)。如果你想阅读一下这方面的源代码,它们就在文件arch/$<$architecture$>$/kernel/entry.S中ENTRY(system_call)行的下面。


这就是这样的一个模块。我们可以“监视”一个特定的用户,然后使用printk()输出该用户打开的每个文件的消息。在结束前,我们用自己的our_sys_open函数替换了打开文件的系统调用。该函数检查当前进程的用户序号(uid,user's id),如果匹配我们监视的用户的序号,它调用printk()输出将要打开的文件的名字。要不然,就用同样的参数调用原始的open()函数,真正的打开文件。

函数init_module改变了系统调用表中的恰当位置的值然后用一个变量保存下来。函数cleanup_module则使用该变量将所有东西还原。这种处理方法其实是很危险的。想象一下,如果我们有两个这样的模块,A和B。A用A_open替换了系统的sys_open函数,而B用B_open。现在,我们先把模块A加载,那么原先的系统调用被A_open替代了,A_open在完成工作后自身又会调用原始的sys_open函数 。接着,我们加载B模块,它用B_open更改了现在的已更改为A_open(显然它认为是原始的sys_open系统调用)的系统调用。



Example 8-1. syscall.c

 *  syscall.c
 *  System call "stealing" sample.

 * Copyright (C) 2001 by Peter Jay Salzman 

 * The necessary header files 

 * Standard in kernel modules 
#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/module.h>	/* Specifically, a module, */
#include <linux/moduleparam.h>	/* which will have params */
#include <linux/unistd.h>	/* The list of system calls */

 * For the current (process) structure, we need
 * this to know who the current user is. 
#include <linux/sched.h>
#include <asm/uaccess.h>

 * The system call table (a table of functions). We
 * just define this as external, and the kernel will
 * fill it up for us when we are insmod'ed
 * sys_call_table is no longer exported in 2.6.x kernels.
 * If you really want to try this DANGEROUS module you will
 * have to apply the supplied patch against your current kernel
 * and recompile it.
extern void *sys_call_table[];

 * UID we want to spy on - will be filled from the
 * command line 
static int uid;
module_param(uid, int, 0644);

 * A pointer to the original system call. The reason
 * we keep this, rather than call the original function
 * (sys_open), is because somebody else might have
 * replaced the system call before us. Note that this
 * is not 100% safe, because if another module
 * replaced sys_open before us, then when we're inserted
 * we'll call the function in that module - and it
 * might be removed before we are.
 * Another reason for this is that we can't get sys_open.
 * It's a static variable, so it is not exported. 
asmlinkage int (*original_call) (const char *, int, int);

 * The function we'll replace sys_open (the function
 * called when you call the open system call) with. To
 * find the exact prototype, with the number and type
 * of arguments, we find the original function first
 * (it's at fs/open.c).
 * In theory, this means that we're tied to the
 * current version of the kernel. In practice, the
 * system calls almost never change (it would wreck havoc
 * and require programs to be recompiled, since the system
 * calls are the interface between the kernel and the
 * processes).
asmlinkage int our_sys_open(const char *filename, int flags, int mode)
	int i = 0;
	char ch;

	 * Check if this is the user we're spying on 
	if (uid == current->uid) {
		 * Report the file, if relevant 
		printk("Opened file by %d: ", uid);
		do {
			get_user(ch, filename + i);
			printk("%c", ch);
		} while (ch != 0);

	 * Call the original sys_open - otherwise, we lose
	 * the ability to open files 
	return original_call(filename, flags, mode);

 * Initialize the module - replace the system call 
int init_module()
	 * Warning - too late for it now, but maybe for
	 * next time... 
	printk("I'm dangerous. I hope you did a ");
	printk("sync before you insmod'ed me.\n");
	printk("My counterpart, cleanup_module(), is even");
	printk("more dangerous. If\n");
	printk("you value your file system, it will ");
	printk("be \"sync; rmmod\" \n");
	printk("when you remove this module.\n");

	 * Keep a pointer to the original function in
	 * original_call, and then replace the system call
	 * in the system call table with our_sys_open 
	original_call = sys_call_table[__NR_open];
	sys_call_table[__NR_open] = our_sys_open;

	 * To get the address of the function for system
	 * call foo, go to sys_call_table[__NR_foo]. 

	printk("Spying on UID:%d\n", uid);

	return 0;

 * Cleanup - unregister the appropriate file from /proc 
void cleanup_module()
	 * Return the system call back to normal 
	if (sys_call_table[__NR_open] != our_sys_open) {
		printk("Somebody else also played with the ");
		printk("open system call\n");
		printk("The system may be left in ");
		printk("an unstable state.\n");

	sys_call_table[__NR_open] = original_call;

Chapter 9. Blocking Processes

9.1. 阻塞进程

9.1.1. Enter Sandman




更有趣的是,module_close并不垄断唤醒等待中的请求文件的进程的权力。一个信号,像Ctrl+c (SIGINT也能够唤醒别的进程[13]。在这种情况下,我们想立即返回-EINTR 。这对用户很重要,举个例子来说,用户可以在某个进程接受到文件前终止该进程。


Example 9-1. sleep.c

 *  sleep.c - create a /proc file, and if several processes try to open it at
 *  the same time, put all but one to sleep

#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/module.h>	/* Specifically, a module */
#include <linux/proc_fs.h>	/* Necessary because we use proc fs */
#include <linux/sched.h>	/* For putting processes to sleep and 
				   waking them up */
#include <asm/uaccess.h>	/* for get_user and put_user */

 * The module's file functions 

 * Here we keep the last message received, to prove that we can process our
 * input
static char Message[MESSAGE_LENGTH];

static struct proc_dir_entry *Our_Proc_File;
#define PROC_ENTRY_FILENAME "sleep"

 * Since we use the file operations struct, we can't use the special proc
 * output provisions - we have to use a standard read function, which is this
 * function
static ssize_t module_output(struct file *file,	/* see include/linux/fs.h   */
			     char *buf,	/* The buffer to put data to 
					   (in the user segment)    */
			     size_t len,	/* The length of the buffer */
			     loff_t * offset)
	static int finished = 0;
	int i;
	char message[MESSAGE_LENGTH + 30];

	 * Return 0 to signify end of file - that we have nothing 
	 * more to say at this point.
	if (finished) {
		finished = 0;
		return 0;

	 * If you don't understand this by now, you're hopeless as a kernel
	 * programmer.
	sprintf(message, "Last input:%s\n", Message);
	for (i = 0; i < len && message[i]; i++)
		put_user(message[i], buf + i);

	finished = 1;
	return i;		/* Return the number of bytes "read" */

 * This function receives input from the user when the user writes to the /proc
 * file.
static ssize_t module_input(struct file *file,	/* The file itself */
			    const char *buf,	/* The buffer with input */
			    size_t length,	/* The buffer's length */
			    loff_t * offset)
{				/* offset to file - ignore */
	int i;

	 * Put the input into Message, where module_output will later be 
	 * able to use it
	for (i = 0; i < MESSAGE_LENGTH - 1 && i < length; i++)
		get_user(Message[i], buf + i);
	 * we want a standard, zero terminated string 
	Message[i] = '\0';

	 * We need to return the number of input characters used 
	return i;

 * 1 if the file is currently open by somebody 
int Already_Open = 0;

 * Queue of processes who want our file 
 * Called when the /proc file is opened 
static int module_open(struct inode *inode, struct file *file)
	 * If the file's flags include O_NONBLOCK, it means the process doesn't
	 * want to wait for the file.  In this case, if the file is already 
	 * open, we should fail with -EAGAIN, meaning "you'll have to try 
	 * again", instead of blocking a process which would rather stay awake.
	if ((file->f_flags & O_NONBLOCK) && Already_Open)
		return -EAGAIN;

	 * This is the correct place for try_module_get(THIS_MODULE) because 
	 * if a process is in the loop, which is within the kernel module,
	 * the kernel module must not be removed.

	 * If the file is already open, wait until it isn't 

	while (Already_Open) {
		int i, is_sig = 0;

		 * This function puts the current process, including any system
		 * calls, such as us, to sleep.  Execution will be resumed right
		 * after the function call, either because somebody called 
		 * wake_up(&WaitQ) (only module_close does that, when the file 
		 * is closed) or when a signal, such as Ctrl-C, is sent 
		 * to the process
		wait_event_interruptible(WaitQ, !Already_Open);

		 * If we woke up because we got a signal we're not blocking, 
		 * return -EINTR (fail the system call).  This allows processes
		 * to be killed or stopped.

 * Emmanuel Papirakis:
 * This is a little update to work with 2.2.*.  Signals now are contained in
 * two words (64 bits) and are stored in a structure that contains an array of
 * two unsigned longs.  We now have to make 2 checks in our if.
 * Ori Pomerantz:
 * Nobody promised me they'll never use more than 64 bits, or that this book
 * won't be used for a version of Linux with a word size of 16 bits.  This code
 * would work in any case.
		for (i = 0; i < _NSIG_WORDS && !is_sig; i++)
			is_sig =
			    current->pending.signal.sig[i] & ~current->

		if (is_sig) {
			 * It's important to put module_put(THIS_MODULE) here,
			 * because for processes where the open is interrupted
			 * there will never be a corresponding close. If we 
			 * don't decrement the usage count here, we will be 
			 * left with a positive usage count which we'll have no
			 * way to bring down to zero, giving us an immortal 
			 * module, which can only be killed by rebooting 
			 * the machine.
			return -EINTR;

	 * If we got here, Already_Open must be zero 

	 * Open the file 
	Already_Open = 1;
	return 0;		/* Allow the access */

 * Called when the /proc file is closed 
int module_close(struct inode *inode, struct file *file)
	 * Set Already_Open to zero, so one of the processes in the WaitQ will
	 * be able to set Already_Open back to one and to open the file. All 
	 * the other processes will be called when Already_Open is back to one,
	 * so they'll go back to sleep.
	Already_Open = 0;

	 * Wake up all the processes in WaitQ, so if anybody is waiting for the
	 * file, they can have it.


	return 0;		/* success */

 * This function decides whether to allow an operation (return zero) or not
 * allow it (return a non-zero which indicates why it is not allowed).
 * The operation can be one of the following values:
 * 0 - Execute (run the "file" - meaningless in our case)
 * 2 - Write (input to the kernel module)
 * 4 - Read (output from the kernel module)
 * This is the real function that checks file permissions. The permissions
 * returned by ls -l are for reference only, and can be overridden here.
static int module_permission(struct inode *inode, int op, struct nameidata *nd)
	 * We allow everybody to read from our module, but only root (uid 0)
	 * may write to it
	if (op == 4 || (op == 2 && current->euid == 0))
		return 0;

	 * If it's anything else, access is denied 
	return -EACCES;

 * Structures to register as the /proc file, with pointers to all the relevant
 * functions.

 * File operations for our proc file. This is where we place pointers to all
 * the functions called when somebody tries to do something to our file. NULL
 * means we don't want to deal with something.
static struct file_operations File_Ops_4_Our_Proc_File = {
	.read = module_output,	/* "read" from the file */
	.write = module_input,	/* "write" to the file */
	.open = module_open,	/* called when the /proc file is opened */
	.release = module_close,	/* called when it's closed */

 * Inode operations for our proc file.  We need it so we'll have somewhere to
 * specify the file operations structure we want to use, and the function we
 * use for permissions. It's also possible to specify functions to be called
 * for anything else which could be done to an inode (although we don't bother,
 * we just put NULL).

static struct inode_operations Inode_Ops_4_Our_Proc_File = {
	.permission = module_permission,	/* check for permissions */

 * Module initialization and cleanup 

 * Initialize the module - register the proc file 

int init_module()
	int rv = 0;
	Our_Proc_File = create_proc_entry(PROC_ENTRY_FILENAME, 0644, NULL);
	Our_Proc_File->owner = THIS_MODULE;
	Our_Proc_File->proc_iops = &Inode_Ops_4_Our_Proc_File;
	Our_Proc_File->proc_fops = &File_Ops_4_Our_Proc_File;
	Our_Proc_File->mode = S_IFREG | S_IRUGO | S_IWUSR;
	Our_Proc_File->uid = 0;
	Our_Proc_File->gid = 0;
	Our_Proc_File->size = 80;

	if (Our_Proc_File == NULL) {
		rv = -ENOMEM;
		remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);
		printk(KERN_INFO "Error: Could not initialize /proc/test\n");

	return rv;

 * Cleanup - unregister our file from /proc.  This could get dangerous if
 * there are still processes waiting in WaitQ, because they are inside our
 * open function, which will get unloaded. I'll explain how to avoid removal
 * of a kernel module in such a case in chapter 10.
void cleanup_module()
	remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);

Chapter 10. Replacing Printks

10.1. 替换printk

在Section中, 我说过最好不要在X中进行内核模块编程。在真正的内核模块开发中的确是这样。但在实际应用中,你想在任何加载模块的tty[14]终端中显示信息。


Example 10-1. print_string.c

 *  print_string.c - Send output to the tty we're running on, regardless if it's
 *  through X11, telnet, etc.  We do this by printing the string to the tty
 *  associated with the current task.
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/sched.h>	/* For current */
#include <linux/tty.h>		/* For the tty declarations */
#include <linux/version.h>	/* For LINUX_VERSION_CODE */

MODULE_AUTHOR("Peter Jay Salzman");

static void print_string(char *str)
	struct tty_struct *my_tty;

	 * tty struct went into signal struct in 2.6.6 
	 * The tty for the current task 
	my_tty = current->tty;
	 * The tty for the current task, for 2.6.6+ kernels 
	my_tty = current->signal->tty;

	 * If my_tty is NULL, the current task has no tty you can print to 
	 * (ie, if it's a daemon).  If so, there's nothing we can do.
	if (my_tty != NULL) {

		 * my_tty->driver is a struct which holds the tty's functions,
		 * one of which (write) is used to write strings to the tty. 
		 * It can be used to take a string either from the user's or 
		 * kernel's memory segment.
		 * The function's 1st parameter is the tty to write to,
		 * because the same function would normally be used for all 
		 * tty's of a certain type.  The 2nd parameter controls whether
		 * the function receives a string from kernel memory (false, 0)
		 * or from user memory (true, non zero).  The 3rd parameter is
		 * a pointer to a string.  The 4th parameter is the length of
		 * the string.
		((my_tty->driver)->write) (my_tty,	/* The tty itself */
					   0,	/* Don't take the string 
						   from user space        */
					   str,	/* String                 */
					   strlen(str));	/* Length */

		 * ttys were originally hardware devices, which (usually) 
		 * strictly followed the ASCII standard.  In ASCII, to move to
		 * a new line you need two characters, a carriage return and a
		 * line feed.  On Unix, the ASCII line feed is used for both 
		 * purposes - so we can't just use \n, because it wouldn't have
		 * a carriage return and the next line will start at the
		 * column right after the line feed.
		 * This is why text files are different between Unix and 
		 * MS Windows.  In CP/M and derivatives, like MS-DOS and 
		 * MS Windows, the ASCII standard was strictly adhered to,
		 * and therefore a newline requirs both a LF and a CR.
		((my_tty->driver)->write) (my_tty, 0, "\015\012", 2);

static int __init print_string_init(void)
	print_string("The module has been inserted.  Hello world!");
	return 0;

static void __exit print_string_exit(void)
	print_string("The module has been removed.  Farewell world!");


10.2. 让你的键盘指示灯闪起来



Example 10-2. kbleds.c

 *  kbleds.c - Blink keyboard leds until the module is unloaded.

#include <linux/module.h>
#include <linux/config.h>
#include <linux/init.h>
#include <linux/tty.h>		/* For fg_console, MAX_NR_CONSOLES */
#include <linux/kd.h>		/* For KDSETLED */
#include <linux/console_struct.h>	/* For vc_cons */

MODULE_DESCRIPTION("Example module illustrating the use of Keyboard LEDs.");
MODULE_AUTHOR("Daniele Paolo Scarpazza");

struct timer_list my_timer;
struct tty_driver *my_driver;
char kbledstatus = 0;

#define BLINK_DELAY   HZ/5
#define ALL_LEDS_ON   0x07
#define RESTORE_LEDS  0xFF

 * Function my_timer_func blinks the keyboard LEDs periodically by invoking
 * command KDSETLED of ioctl() on the keyboard driver. To learn more on virtual 
 * terminal ioctl operations, please see file:
 *     /usr/src/linux/drivers/char/vt_ioctl.c, function vt_ioctl().
 * The argument to KDSETLED is alternatively set to 7 (thus causing the led 
 * mode to be set to LED_SHOW_IOCTL, and all the leds are lit) and to 0xFF
 * (any value above 7 switches back the led mode to LED_SHOW_FLAGS, thus
 * the LEDs reflect the actual keyboard status).  To learn more on this, 
 * please see file:
 *     /usr/src/linux/drivers/char/keyboard.c, function setledstate().

static void my_timer_func(unsigned long ptr)
	int *pstatus = (int *)ptr;

	if (*pstatus == ALL_LEDS_ON)
		*pstatus = RESTORE_LEDS;
		*pstatus = ALL_LEDS_ON;

	(my_driver->ioctl) (vc_cons[fg_console].d->vc_tty, NULL, KDSETLED,

	my_timer.expires = jiffies + BLINK_DELAY;

static int __init kbleds_init(void)
	int i;

	printk(KERN_INFO "kbleds: loading\n");
	printk(KERN_INFO "kbleds: fgconsole is %x\n", fg_console);
	for (i = 0; i < MAX_NR_CONSOLES; i++) {
		if (!vc_cons[i].d)
		printk(KERN_INFO "poet_atkm: console[%i/%i] #%i, tty %lx\n", i,
		       MAX_NR_CONSOLES, vc_cons[i].d->vc_num,
		       (unsigned long)vc_cons[i].d->vc_tty);
	printk(KERN_INFO "kbleds: finished scanning consoles\n");

	my_driver = vc_cons[fg_console].d->vc_tty->driver;
	printk(KERN_INFO "kbleds: tty driver magic %x\n", my_driver->magic);

	 * Set up the LED blink timer the first time
	my_timer.function = my_timer_func; = (unsigned long)&kbledstatus;
	my_timer.expires = jiffies + BLINK_DELAY;

	return 0;

static void __exit kbleds_cleanup(void)
	printk(KERN_INFO "kbleds: unloading...\n");
	(my_driver->ioctl) (vc_cons[fg_console].d->vc_tty, NULL, KDSETLED,


如果上面的方法都无法满足你调试的需要,你就可能需要其它的技巧了。还记得那个在 make menuconfig 时的CONFIG_LL_DEBUG参数吗?如果你激活该选项,你就可以获得对串口的底层操纵。如果这仍然不够爽,你还可以对kernel/printk.c或其它的基本的系统底层调用打补丁来使用printascii,从而可以通过串口跟踪内核的每步动作。如果你的架构不支持上面的例子却有一个标准的串口,这可能应该是你首先应该考虑的了。通过网络上的终端调试同样值得尝试。


Chapter 11. Scheduling Tasks

11.1. 任务调度





Example 11-1. sched.c

 *  sched.c - scheduale a function to be called on every timer interrupt.
 *  Copyright (C) 2001 by Peter Jay Salzman

 * The necessary header files 

 * Standard in kernel modules 
#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/module.h>	/* Specifically, a module */
#include <linux/proc_fs.h>	/* Necessary because we use the proc fs */
#include <linux/workqueue.h>	/* We scheduale tasks here */
#include <linux/sched.h>	/* We need to put ourselves to sleep 
				   and wake up later */
#include <linux/init.h>		/* For __init and __exit */
#include <linux/interrupt.h>	/* For irqreturn_t */

struct proc_dir_entry *Our_Proc_File;
#define PROC_ENTRY_FILENAME "sched"
#define MY_WORK_QUEUE_NAME "WQsched.c"

 * The number of times the timer interrupt has been called so far 
static int TimerIntrpt = 0;

static void intrpt_routine(void *);

static int die = 0;		/* set this to 1 for shutdown */

 * The work queue structure for this task, from workqueue.h 
static struct workqueue_struct *my_workqueue;

static struct work_struct Task;
static DECLARE_WORK(Task, intrpt_routine, NULL);

 * This function will be called on every timer interrupt. Notice the void*
 * pointer - task functions can be used for more than one purpose, each time
 * getting a different parameter.
static void intrpt_routine(void *irrelevant)
	 * Increment the counter 

	 * If cleanup wants us to die
	if (die == 0)
		queue_delayed_work(my_workqueue, &Task, 100);

 * Put data into the proc fs file. 
procfile_read(char *buffer,
	      char **buffer_location,
	      off_t offset, int buffer_length, int *eof, void *data)
	int len;		/* The number of bytes actually used */

	 * It's static so it will still be in memory 
	 * when we leave this function
	static char my_buffer[80];

	static int count = 1;

	 * We give all of our information in one go, so if the anybody asks us
	 * if we have more information the answer should always be no.
	if (offset > 0)
		return 0;

	 * Fill the buffer and get its length 
	len = sprintf(my_buffer, "Timer called %d times so far\n", TimerIntrpt);

	 * Tell the function which called us where the buffer is 
	*buffer_location = my_buffer;

	 * Return the length 
	return len;

 * Initialize the module - register the proc file 
int __init init_module()
	int rv = 0;
	 * Put the task in the work_timer task queue, so it will be executed at
	 * next timer interrupt
	my_workqueue = create_workqueue(MY_WORK_QUEUE_NAME);
	queue_delayed_work(my_workqueue, &Task, 100);

	Our_Proc_File = create_proc_entry(PROC_ENTRY_FILENAME, 0644, NULL);
	Our_Proc_File->read_proc = procfile_read;
	Our_Proc_File->owner = THIS_MODULE;
	Our_Proc_File->mode = S_IFREG | S_IRUGO;
	Our_Proc_File->uid = 0;
	Our_Proc_File->gid = 0;
	Our_Proc_File->size = 80;

	if (Our_Proc_File == NULL) {
		rv = -ENOMEM;
		remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);
		printk(KERN_INFO "Error: Could not initialize /proc/%s\n",

	return rv;

 * Cleanup
void __exit cleanup_module()
	 * Unregister our /proc file 
	remove_proc_entry(PROC_ENTRY_FILENAME, &proc_root);
	printk(KERN_INFO "/proc/%s removed\n", PROC_ENTRY_FILENAME);

	die = 1;		/* keep intrp_routine from queueing itself */
	cancel_delayed_work(&Task);	/* no "new ones" */
	flush_workqueue(my_workqueue);	/* wait till all "old ones" finished */

	 * Sleep until intrpt_routine is called one last time. This is 
	 * necessary, because otherwise we'll deallocate the memory holding 
	 * intrpt_routine and Task while work_timer still references them.
	 * Notice that here we don't allow signals to interrupt us.
	 * Since WaitQ is now not NULL, this automatically tells the interrupt
	 * routine it's time to die.


 * some work_queue related functions
 * are just available to GPL licensed Modules 

Chapter 12. Interrupt Handlers

12.1. Interrupt Handlers

12.1.1. Interrupt Handlers




当CPU接收到一个中断时,它停止正在处理的一切事务(除非它在处理另一个更重要的中断,在这种情况下它只会处理完这个重要的中断才会回来处理新产生的中断),将运行中的那些参数压入栈中然后调用中断处理程序。这同时意味着中断处理程序本身也有一些限制,因为此时系统的状态并不确定。解决的办法是让中断处理程序尽快的完成它的事务,通常是从硬件读取信息和向硬件发送指令,然后安排下一次接收信息的相关处理(这被称为"bottom half"[17]),然后返回。内核确保被安排的事务被尽快的执行。当被执行时,在内核模块中允许的操作就是被允许的。


然后,在中断处理程序内部,我们与硬件对话,接着使用带tq_immediate()mark_bh(BH_IMMEDIATE)queue_task_irq()去对bottom half队列进行调度。我们不能使用2.0版本种标准的queue_task 的原因是中断可能就发生在别人的queue_task[18]中。我们需要mark_bh是因为早期版本的Linux只有一个可以存储32个bottom half的数组,并且现在它们中的一个(BH_IMMEDIATE)已经被用来连接没有分配到队列中的入口的硬件驱动的bottom half。

12.1.2. Intel架构中的键盘


在写这章的事例代码时,我遇到了一些困难。一方面,我需要一个可以得到实际有意义结果的,能在各种平台上工作的例子。另一方面,内核中已经包括了各种设备驱动,并且这些驱动将无法和我的例子共存。我找到的解决办法是为键盘中断写点东西,当然首先禁用普通的键盘中断。因为该中断在内核中定义为一个静态连接的符号(见drivers/char/keyboard.c)),我们没有办法恢复。所以在insmod前,如果你爱惜你的机器,新打开一个终端运行sleep 120; reboot。

该代码将自己绑定在IRQ 1, 也就是Intel架构中键盘的IRQ。然后,当接收到一个键盘中断请求时,它读取键盘的状态(那就是inb(0x64)的目的)和扫描码,也就是键盘返回的键值。然后,一旦内核认为这是符合条件的,它运行got_char去给出操作的键(扫描码的头7个位)和是按下键(扫描码的第8位为0)还是弹起键(扫描码的第8位为1)。

Example 12-1. intrpt.c

 *  intrpt.c - An interrupt handler.
 *  Copyright (C) 2001 by Peter Jay Salzman

 * The necessary header files 

 * Standard in kernel modules 
#include <linux/kernel.h>	/* We're doing kernel work */
#include <linux/module.h>	/* Specifically, a module */
#include <linux/sched.h>
#include <linux/workqueue.h>
#include <linux/interrupt.h>	/* We want an interrupt */
#include <asm/io.h>

#define MY_WORK_QUEUE_NAME "WQsched.c"

static struct workqueue_struct *my_workqueue;

 * This will get called by the kernel as soon as it's safe
 * to do everything normally allowed by kernel modules.
static void got_char(void *scancode)
	printk("Scan Code %x %s.\n",
	       (int)*((char *)scancode) & 0x7F,
	       *((char *)scancode) & 0x80 ? "Released" : "Pressed");

 * This function services keyboard interrupts. It reads the relevant
 * information from the keyboard and then puts the non time critical
 * part into the work queue. This will be run when the kernel considers it safe.
irqreturn_t irq_handler(int irq, void *dev_id, struct pt_regs *regs)
	 * This variables are static because they need to be
	 * accessible (through pointers) to the bottom half routine.
	static int initialised = 0;
	static unsigned char scancode;
	static struct work_struct task;
	unsigned char status;

	 * Read keyboard status
	status = inb(0x64);
	scancode = inb(0x60);

	if (initialised == 0) {
		INIT_WORK(&task, got_char, &scancode);
		initialised = 1;
	} else {
		PREPARE_WORK(&task, got_char, &scancode);

	queue_work(my_workqueue, &task);

	return IRQ_HANDLED;

 * Initialize the module - register the IRQ handler 
int init_module()
	my_workqueue = create_workqueue(MY_WORK_QUEUE_NAME);

	 * Since the keyboard handler won't co-exist with another handler,
	 * such as us, we have to disable it (free its IRQ) before we do
	 * anything.  Since we don't know where it is, there's no way to
	 * reinstate it later - so the computer will have to be rebooted
	 * when we're done.
	free_irq(1, NULL);

	 * Request IRQ 1, the keyboard IRQ, to go to our irq_handler.
	 * SA_SHIRQ means we're willing to have othe handlers on this IRQ.
	 * SA_INTERRUPT can be used to make the handler into a fast interrupt.
	return request_irq(1,	/* The number of the keyboard IRQ on PCs */
			   irq_handler,	/* our handler */
			   SA_SHIRQ, "test_keyboard_irq_handler",
			   (void *)(irq_handler));

 * Cleanup 
void cleanup_module()
	 * This is only here for completeness. It's totally irrelevant, since
	 * we don't have a way to restore the normal keyboard interrupt so the
	 * computer is completely useless and has to be rebooted.
	free_irq(1, NULL);

 * some work_queue related functions are just available to GPL licensed Modules

Chapter 13. Symmetric Multi Processing

13.1. 对称多线程处理






Chapter 14. Common Pitfalls

14.1. 注意








Appendix A. Changes: 2.0 To 2.2

A.1. 从2.0到2.2的变化

A.1.1. 从2.0到2.2的变化

我对内核的了解并不很完全所以我也无法写出所有的变化。在修改代码(更确切的说,是采用Emmanuel Papirakis的修改)时,我遇到了以下的这些修改。我将它们都列出来以方便模块编写者们,特别是学习该档案先前版本并熟悉我提到的这些技巧(但已经更换到新版本的)的那些人。

更多的这方面的参考资料在 Richard Gooch's的站点上。







close in file_operations


read, write in file_operations

这些函数的头文件改变了。它们现在返回ssize_t而不是整形值,且它们的参数表也变了。inode 不再是一个参数,文件中的偏移量也一样。




在 task 结构体中的signals不再是一个32位整形变量,而是一个为_NSIG_WORDS 整形的数组。



Module Parameters


Symmetrical Multi-Processing


Appendix B. Where To Go From Here

B.1. 为什么这样写?


但是,我决定否。我写本书的目的是提供基本的,入门的对神秘的内核模块编程的认识和这方面的常用技巧。对于那些非常热衷与内核编程的人,我推荐Juan-Mariano de Goyeneche的内核资源列表。 同样,就同Linus本人说的那样,学习内核最好的方法是自己阅读内核源代码。

如果你对更多的短小的示例内核模块感兴趣,我向你推荐 Phrack magazine 这本杂志。即使你不关心安全问题,作为一个程序员你还是应该时时考虑这个问题的。这些内核模块代码都很短,不需要费多大劲就能读懂。




最方便的保持某个文件被打开的方法是使用命令tail -f打开该文件。






Teletype, 原先是一种用来和Unix系统交互的键盘和打印机结合起来的装置。现在,它只是一个用来同Unix或类似的系统交流文字流的抽象的设备,而不管它具体是显示器,X中的xterm,还是一个通过telnet的网络连接。






这里是译者给出的关于“bottom half”的一点解释,来源是google上搜索到的英文资料:

“底部”,“bottom half”常在涉及中断的设备驱动中提到。










