附加lab从011开始编号,下一个是012(为了避开lab10,lab11…)
注意,这里主要接 6.S081-5用户空间和内核空间的切换–Trap机制
代码流程:write() -> ECALL -> uservec()(Trampoline.s) -> usertrap() -> syscall() -> sys_write() -> syscall() -> usertrapret() -> usertrapret() ->ret
Trap的时候,我们需要做什么?(当前处于user mode,现在需要执行系统调用)
本实验将主要关心的是,执行系统调用时计算机的状态 – 可以用寄存器状态来判断,主要关心的寄存器有:
PC 程序计数器(Program Counter Register)
mode标志位(是supervisor mode 还是 user mode)
SATP(Supervisor Address Translation and Protection)寄存器,它包含了指向page table的物理内存地址
STVEC(Supervisor Trap Vector Base Address Register)寄存器,它指向了内核中处理trap的指令的起始地址。
SEPC(Supervisor Exception Program Counter)寄存器,在trap的过程中保存程序计数器的值。
SSRATCH(Supervisor Scratch Register)寄存器
后面的实验内容和流程,主要是实验内容参考-中文notes,实验的流程和工具参考6.S081 Lab00 xv6启动过程
总结一下:系统调用被刻意设计的看起来像是函数调用,但是背后的user/kernel转换比函数调用要复杂的多。之所以这么复杂,很大一部分原因是要保持user/kernel之间的隔离性,内核不能信任来自用户空间的任何内容。
打开sh.c,代码如下,注意write(2, "$ ", 2);
,它是我们接下来要追踪的系统调用。-- shell 将 "$ "
通过 write系统调用输出到文件描述符2。 – 注意,我这里其实原本是fprintf(2, "$ ");
,需要修改成write
,然后make clean
,然后再调试
int
getcmd(char *buf, int nbuf)
{
// fprintf(2, "$ ");
write(2, "$ ", 2);
memset(buf, 0, nbuf);
gets(buf, nbuf);
if(buf[0] == 0) // EOF
return -1;
return 0;
}
int
main(void)
{
static char buf[100];
int fd;
// Ensure that three file descriptors are open.
while((fd = open("console", O_RDWR)) >= 0){
if(fd >= 3){
close(fd);
break;
}
}
// Read and run input commands.
...
}
进入调试
wc@r740:~/OS_experiment/xv6-riscv-fall19$ make CPUS=1 qemu-gdb
*** Now run 'gdb' in another window.
qemu-system-riscv64 -machine virt -bios none -kernel kernel/kernel -m 128M -smp 1 -nographic -drive file=fs.img,if=none,format=raw,id=x0 -device virtio-blk-device,drive=x0,bus=virtio-mmio-bus.0 -S -gdb tcp::26017
wc@r740:~/OS_experiment/xv6-riscv-fall19$ riscv64-unknown-elf-gdb kernel/kernel
# 一大堆输出 ...
(gdb) target remote localhost:26017
Remote debugging using localhost:26017
0x0000000000001000 in ?? ()
#define SYS_write 16
.global write
write:
li a7, SYS_write
ecall
ret
ecall会让我们跳转到内核,kernel执行完之后会返回,然后执行ret,最终返回到shell。
0xd66
0000000000000d64 <write>:
.global write
write:
li a7, SYS_write
d64: 48c1 li a7,16
ecall
d66: 00000073 ecall
ret
d6a: 8082 ret
display/i $pc
在每次断点的时候,都显示下一条指令的反汇编)Remote debugging using localhost:26017
0x0000000000001000 in ?? ()
(gdb) display/i $pc
1: x/i $pc
=> 0x1000: auipc t0,0x0
(gdb) b *0xd66
Breakpoint 1 at 0xd66
(gdb) c
Continuing.
Breakpoint 1, 0x0000000000000d66 in ?? ()
1: x/i $pc
=> 0xd66: ecall
write(2, "$ ", 2)
) ——所以a0是2(文件描述符),a1是字符串指针,a2是2(写入字符数)。(gdb) delete 1
(gdb) p $pc
$1 = (void (*)()) 0xd66
(gdb) info reg
(gdb) info reg
ra 0x24 0x24
sp 0x3f80 0x3f80
gp 0x505050505050505 0x505050505050505
tp 0x505050505050505 0x505050505050505
t0 0x505050505050505 361700864190383365
t1 0x505050505050505 361700864190383365
t2 0x505050505050505 361700864190383365
fp 0x3fa0 0x3fa0
s1 0x13f8 5112
a0 0x3fffffe000 274877898752
a1 0x1280 4736
a2 0x2 2
a3 0x505050505050505 361700864190383365
a4 0x505050505050505 361700864190383365
a5 0x2 2
a6 0x505050505050505 361700864190383365
a7 0x10 16
s2 0x64 100
s3 0x20 32
s4 0x13fb 5115
s5 0x1380 4992
s6 0x505050505050505 361700864190383365
s7 0x505050505050505 361700864190383365
s8 0x505050505050505 361700864190383365
s9 0x505050505050505 361700864190383365
s10 0x505050505050505 361700864190383365
--Type <RET> for more, q to quit, c to continue without paging--c
s11 0x505050505050505 361700864190383365
t3 0x505050505050505 361700864190383365
t4 0x505050505050505 361700864190383365
t5 0x505050505050505 361700864190383365
t6 0x505050505050505 361700864190383365
pc 0x3ffffff004 0x3ffffff004
(gdb) p/c *($a1)@2
$7 = {36 '$', 0 '\000'}
此外,寄存器可以看出,sp和pc还都比较接近0,说明现在运行在用户态。(-- 说明是虚拟地址),因为物理地址,至少都是从0x80000(OS启动处)开始的了。 --xv6中 kernel的虚拟地址和物理地址是一样的。
查看SATP寄存器(页表地址)
查看页表:qemu界面,按ctrl + a然后按c可以进入qemu console, 输入info mem可以查看页表的,但是我这里看不到。
$ QEMU 4.1.0 monitor - type 'help' for more information
(qemu) info mem
unknown command: 'info mem'
寻找原因:在如下所示的路径文件下,存在的代码如下,说明我必须是I386机型,才能使用info mem,这里我进行了注释的修改,最终make会报错,因此后面关于page table的输出,我都只能暂时省去了。
// /home/wc/OS_experiment/qemu-4.1.0/riscv64-softmmu/hmp-commands-info.h
// changed by levi
#if defined(TARGET_I386)
{
.name = "mem",
.args_type = "",
.params = "",
.help = "show the active virtual memory mappings",
.cmd = hmp_info_mem,
},
#endif
// {
// .name = "mem",
// .args_type = "",
// .params = "",
// .help = "show the active virtual memory mappings",
// .cmd = hmp_info_mem,
// },
Breakpoint 1 at 0xd66
(gdb) c
Continuing.
Breakpoint 1, 0x0000000000000d66 in ?? ()
1: x/i $pc
=> 0xd66: ecall
(gdb) x/3i 0xd64
0xd64: li a7,16
=> 0xd66: ecall
0xd6a: ret
(gdb) stepi
0x0000003ffffff004 in ?? ()
1: x/i $pc
=> 0x3ffffff004: sd ra,40(a0)
(gdb) p $pc
$1 = (void (*)()) 0x3ffffff004
(gdb) x/6i 0x3ffffff000
0x3ffffff000: csrrw a0,sscratch,a0
=> 0x3ffffff004: sd ra,40(a0)
0x3ffffff008: sd sp,48(a0)
0x3ffffff00c: sd gp,56(a0)
0x3ffffff010: sd tp,64(a0)
0x3ffffff014: sd t0,72(a0)
(gdb) info reg
ra 0x24 0x24
sp 0x3f80 0x3f80
gp 0x505050505050505 0x505050505050505
tp 0x505050505050505 0x505050505050505
t0 0x505050505050505 361700864190383365
t1 0x505050505050505 361700864190383365
t2 0x505050505050505 361700864190383365
fp 0x3fa0 0x3fa0
s1 0x13f8 5112
a0 0x3fffffe000 274877898752
a1 0x1280 4736
a2 0x2 2
a3 0x505050505050505 361700864190383365
a4 0x505050505050505 361700864190383365
a5 0x2 2
a6 0x505050505050505 361700864190383365
a7 0x10 16
s2 0x64 100
s3 0x20 32
s4 0x13fb 5115
s5 0x1380 4992
s6 0x505050505050505 361700864190383365
s7 0x505050505050505 361700864190383365
s8 0x505050505050505 361700864190383365
s9 0x505050505050505 361700864190383365
s10 0x505050505050505 361700864190383365
--Type <RET> for more, q to quit, c to continue without paging--c
s11 0x505050505050505 361700864190383365
t3 0x505050505050505 361700864190383365
t4 0x505050505050505 361700864190383365
t5 0x505050505050505 361700864190383365
t6 0x505050505050505 361700864190383365
pc 0x3ffffff004 0x3ffffff004
(gdb) p/x $stvec
$2 = 0x3ffffff000
PTE_u
标志位。这也是为什么trap机制是安全的。ecall指令都做了什么??
(gdb) p/x $sepc
$4 = 0xd66
根据0. 一些背景说明中的要求,我们只完成了橘色部分,红色部分还需要别的函数/指令完成
Trap的时候,我们需要做什么?(当前处于user mode,现在需要执行系统调用)
为什么ecall 不切换pagetable?——切换page table的代价比较高,不用在不必要的场景切换page table。
所以ecall的下一条指令的位置是STVEC指向的地址,也就是trampoline page的起始地址。(注,实际上ecall是CPU的指令,自然在gdb中看不到具体内容)
接上面,由于我们的page table中存放了trampoline page(-- 这样每个进程都有自己的trapframe page,并且此处的虚拟地址总是0x3ffffffe000)
trampframe存放的内容如下(proc.h 中的trapframe的结构体),刚开始有5个kernel实现存放在trapframe中的数据,后面是32个用户寄存器的内容。
// per-process data for the trap handling code in trampoline.S.
// sits in a page by itself just under the trampoline page in the
// user page table. not specially mapped in the kernel page table.
// the sscratch register points here.
// uservec in trampoline.S saves user registers in the trapframe,
// then initializes registers from the trapframe's
// kernel_sp, kernel_hartid, kernel_satp, and jumps to kernel_trap.
// usertrapret() and userret in trampoline.S set up
// the trapframe's kernel_*, restore user registers from the
// trapframe, switch to the user page table, and enter user space.
// the trapframe includes callee-saved user registers like s0-s11 because the
// return-to-user path via usertrapret() doesn't return through
// the entire kernel call stack.
struct trapframe {
/* 0 */ uint64 kernel_satp; // kernel page table
/* 8 */ uint64 kernel_sp; // top of process's kernel stack
/* 16 */ uint64 kernel_trap; // usertrap()
/* 24 */ uint64 epc; // saved user program counter
/* 32 */ uint64 kernel_hartid; // saved kernel tp
/* 40 */ uint64 ra;
/* 48 */ uint64 sp;
/* 56 */ uint64 gp;
/* 64 */ uint64 tp;
/* 72 */ uint64 t0;
/* 80 */ uint64 t1;
/* 88 */ uint64 t2;
/* 96 */ uint64 s0;
/* 104 */ uint64 s1;
/* 112 */ uint64 a0;
/* 120 */ uint64 a1;
/* 128 */ uint64 a2;
/* 136 */ uint64 a3;
/* 144 */ uint64 a4;
/* 152 */ uint64 a5;
/* 160 */ uint64 a6;
/* 168 */ uint64 a7;
/* 176 */ uint64 s2;
/* 184 */ uint64 s3;
/* 192 */ uint64 s4;
/* 200 */ uint64 s5;
/* 208 */ uint64 s6;
/* 216 */ uint64 s7;
/* 224 */ uint64 s8;
/* 232 */ uint64 s9;
/* 240 */ uint64 s10;
/* 248 */ uint64 s11;
/* 256 */ uint64 t3;
/* 264 */ uint64 t4;
/* 272 */ uint64 t5;
/* 280 */ uint64 t6;
};
查看trampoline.S的代码,第一行就是csrrw a0, sscratch, a0
这个指令交换了a0和sscratch两个寄存器的内容。(为了清晰我这里给出了trampoline.S的完整代码——后续也会用到)
#this is trampoline.S
# code to switch between user and kernel space.
#
# this code is mapped at the same virtual address
# (TRAMPOLINE) in user and kernel space so that
# it continues to work when it switches page tables.
#
# kernel.ld causes this to be aligned
# to a page boundary.
#
.section trampsec
.globl trampoline
trampoline:
.align 4
.globl uservec
uservec:
#
# trap.c sets stvec to point here, so
# traps from user space start here,
# in supervisor mode, but with a
# user page table.
#
# sscratch points to where the process's p->tf is
# mapped into user space, at TRAPFRAME.
#
# swap a0 and sscratch
# so that a0 is TRAPFRAME
csrrw a0, sscratch, a0
# save the user registers in TRAPFRAME
sd ra, 40(a0)
sd sp, 48(a0)
sd gp, 56(a0)
sd tp, 64(a0)
sd t0, 72(a0)
sd t1, 80(a0)
sd t2, 88(a0)
sd s0, 96(a0)
sd s1, 104(a0)
sd a1, 120(a0)
sd a2, 128(a0)
sd a3, 136(a0)
sd a4, 144(a0)
sd a5, 152(a0)
sd a6, 160(a0)
sd a7, 168(a0)
sd s2, 176(a0)
sd s3, 184(a0)
sd s4, 192(a0)
sd s5, 200(a0)
sd s6, 208(a0)
sd s7, 216(a0)
sd s8, 224(a0)
sd s9, 232(a0)
sd s10, 240(a0)
sd s11, 248(a0)
sd t3, 256(a0)
sd t4, 264(a0)
sd t5, 272(a0)
sd t6, 280(a0)
# save the user a0 in p->tf->a0
csrr t0, sscratch
sd t0, 112(a0)
# restore kernel stack pointer from p->tf->kernel_sp
ld sp, 8(a0)
# make tp hold the current hartid, from p->tf->kernel_hartid
ld tp, 32(a0)
# load the address of usertrap(), p->tf->kernel_trap
ld t0, 16(a0)
# restore kernel page table from p->tf->kernel_satp
ld t1, 0(a0)
csrw satp, t1
sfence.vma zero, zero
# a0 is no longer valid, since the kernel page
# table does not specially map p->tf.
# jump to usertrap(), which does not return
jr t0
.globl userret
userret:
# userret(TRAPFRAME, pagetable)
# switch from kernel to user.
# usertrapret() calls here.
# a0: TRAPFRAME, in user page table.
# a1: user page table, for satp.
# switch to the user page table.
csrw satp, a1
sfence.vma zero, zero
# put the saved user a0 in sscratch, so we
# can swap it with our a0 (TRAPFRAME) in the last step.
ld t0, 112(a0)
csrw sscratch, t0
# restore all but a0 from TRAPFRAME
ld ra, 40(a0)
ld sp, 48(a0)
ld gp, 56(a0)
ld tp, 64(a0)
ld t0, 72(a0)
ld t1, 80(a0)
ld t2, 88(a0)
ld s0, 96(a0)
ld s1, 104(a0)
ld a1, 120(a0)
ld a2, 128(a0)
ld a3, 136(a0)
ld a4, 144(a0)
ld a5, 152(a0)
ld a6, 160(a0)
ld a7, 168(a0)
ld s2, 176(a0)
ld s3, 184(a0)
ld s4, 192(a0)
ld s5, 200(a0)
ld s6, 208(a0)
ld s7, 216(a0)
ld s8, 224(a0)
ld s9, 232(a0)
ld s10, 240(a0)
ld s11, 248(a0)
ld t3, 256(a0)
ld t4, 264(a0)
ld t5, 272(a0)
ld t6, 280(a0)
# restore user a0, and save TRAPFRAME in sscratch
csrrw a0, sscratch, a0
# return to user mode and user pc.
# usertrapret() set up sstatus and sepc.
sret
打印sscratch寄存器的内容,现在是2,其实这里就是a0之前的值,正如之前所说,write(2, "$ ", 2);
函数调用的时候a0作为第一个传入参数的保存者,保存的是文件描述符2。也就是说:在进入到user space之前,内核会将trapframe page的地址保存在这个寄存器中,也就是0x3fffffe000
这个地址。更重要的是,RISC-V有一个指令允许交换任意两个寄存器的值。而SSCRATCH寄存器的作用就是保存另一个寄存器的值,并将自己的值加载给另一个寄存器。,所以,现在的a0就是trapframe的首地址(0x3ffffffe000)
(gdb) p/x $sscratch
$5 = 0x2
(gdb) p/x $a0
$6 = 0x3fffffe000
所以trampoline.S的后续指令(第二三四五六七八…)都有意义了,也就是a0其实是trapframe的首地址,由第二条指令sd ra, 40(a0)
可知,ra被保存在了trapframe + 40的位置…。注意代码的最后(倒数第二条指令,又将a0和sscratch寄存器的内容互换了回去,然后执行sret就返回用户空间了)
# restore kernel stack pointer from p->tf->kernel_sp
ld sp, 8(a0)
(gdb) p/x $sp
$9 = 0x3fffffc000
(gdb) p/x $tp
$10 = 0x0
(gdb) p/x $t0
$11 = 0x8000276a
(gdb) p/x $t1
$12 = 0x505050505050505
下一条指令是交换SATP和t1寄存器。这条指令执行完成之后,当前程序会从user page table切换到kernel page table。现在我们在QEMU中打印page table,可以看出与之前的page table完全不一样。——成功切换了page table——切换到了kernel page table——有了kernel_pagetable我们就可以读取kernel的data
这里还有个问题,为什么代码没有崩溃?毕竟我们在内存中的某个位置执行代码,程序计数器保存的是虚拟地址,如果我们切换了page table,为什么同一个虚拟地址不会通过新的page table寻址走到一些无关的page中?看起来我们现在没有崩溃并且还在执行这些指令。有人来猜一下原因吗?
学生回答:因为我们还在trampoline代码中,而trampoline代码在用户空间和内核空间都映射到了同一个地址。
之所以叫trampoline page,是因为你某种程度在它上面“弹跳”了一下,然后从用户空间走到了内核空间。
这就是本科的时候,柏军老师说的弹簧床可以防止程序“跑飞”(这里的弹簧床和本实验的弹簧床有所不同)——柏军老师指的是操作系统启动的时候,用弹簧床程序引导bootloader去main()的首地址去执行(而不是直接用bootloader执行main),这里的“弹簧床”意思就是,当main()里面出现bug了,就再去弹簧床处执行(再次调用main()),这样程序就不会因为一次bug而panic。
最后一条指令是jr t0。执行了这条指令,我们就要从trampoline跳到内核的C代码中。这条指令的作用是跳转到t0指向的函数中。(前面已经说过,是usertrap函数),当然也可以打印一下:
(gdb) stepi
0x0000003ffffff08e in ?? ()
1: x/i $pc
=> 0x3ffffff08e: jr t0
(gdb) x/3i $t0
0x8000276a <usertrap>: addi sp,sp,-32
0x8000276c <usertrap+2>: sd ra,24(sp)
0x8000276e <usertrap+4>: sd s0,16(sp)
接下来我们就要以kernel stack,kernel page table跳转到usertrap函数。
usertrap某种程度上存储并恢复硬件状态,但是它也需要检查触发trap的原因,以确定相应的处理方式。代码如下。
//
// handle an interrupt, exception, or system call from user space.
// called from trampoline.S
//
void
usertrap(void)
{
int which_dev = 0;
if((r_sstatus() & SSTATUS_SPP) != 0)
panic("usertrap: not from user mode");
// send interrupts and exceptions to kerneltrap(),
// since we're now in the kernel.
w_stvec((uint64)kernelvec);
struct proc *p = myproc();
// save user program counter.
p->tf->epc = r_sepc();
if(r_scause() == 8){
// system call
if(p->killed)
exit(-1);
// sepc points to the ecall instruction,
// but we want to return to the next instruction.
p->tf->epc += 4;
// an interrupt will change sstatus &c registers,
// so don't enable until done with those registers.
intr_on();
syscall();
} else if((which_dev = devintr()) != 0){
// ok
}
else if(r_scause() == 13 || r_scause() == 15)
{
// printf("usertrap(): unexpected scause %p pid=%d\n", r_scause(), p->pid);
// printf(" sepc=%p stval=%p\n", r_sepc(), r_stval());
uvmalloc(p->pagetable, PGROUNDDOWN(r_stval()), PGROUNDDOWN(r_stval()) + 4096);
}
else {
printf("usertrap(): unexpected scause %p pid=%d\n", r_scause(), p->pid);
printf(" sepc=%p stval=%p\n", r_sepc(), r_stval());
printf("page down:%d\n",PGROUNDDOWN(r_stval()));
// printf("r :%d\n",r_scause());
// int sz = 0;
// while( !(r_stval()>=sz && r_stval()
// {
// sz = sz + 4096;
// }
// printf("sz:%d\n",sz);
p->killed = 1;
}
if(p->killed)
exit(-1);
// give up the CPU if this is a timer interrupt.
if(which_dev == 2)
yield();
usertrapret();
}
它做的第一件事情是更改STVEC寄存器。取决于trap是来自于用户空间还是内核空间,实际上XV6处理trap的方法是不一样的。目前为止,我们只讨论过当trap是由用户空间发起时会发生什么。如果trap从内核空间发起,将会是一个非常不同的处理流程,因为从内核发起的话,程序已经在使用kernel page table。所以当trap发生时,程序执行仍然在内核的话,很多处理都不必存在。
在内核中执行任何操作之前,usertrap中先将STVEC指向了kernelvec变量,这是内核空间trap处理代码的位置,而不是用户空间trap处理代码的位置。
然后来一步步分析代码,注意看注释部分
// 找出当前正在运行的进程 -- 通过hartid (之前切换pagetable前已经保存到t0)
struct proc *p = myproc();
// 把当前进程的pc保存到当前进程的trapframe(防止进程切换找不到了)
// save user program counter.
p->tf->epc = r_sepc();
// 找出usertrap的原因,如果如果是8,那么是系统调用
if(r_scause() == 8){
// system call
if(p->killed)
exit(-1);
// sepc points to the ecall instruction,
// but we want to return to the next instruction.
p->tf->epc += 4;
// an interrupt will change sstatus &c registers,
// so don't enable until done with those registers.
intr_on();
syscall();
}
sys_write
void
syscall(void)
{
int num;
struct proc *p = myproc();
num = p->tf->a7;
if(num > 0 && num < NELEM(syscalls) && syscalls[num]) {
p->tf->a0 = syscalls[num]();
} else {
printf("%d %s: unknown sys call %d\n",
p->pid, p->name, num);
p->tf->a0 = -1;
}
}
sys_write
(参数保存在a0,a1和a2),现在需要返回了处理返回用户空间之前,内核要做的工作。
//
// return to user space
//
void
usertrapret(void)
{
struct proc *p = myproc();
// turn off interrupts, since we're switching
// now from kerneltrap() to usertrap().
intr_off();
// send syscalls, interrupts, and exceptions to trampoline.S
w_stvec(TRAMPOLINE + (uservec - trampoline));
// set up trapframe values that uservec will need when
// the process next re-enters the kernel.
p->tf->kernel_satp = r_satp(); // kernel page table
p->tf->kernel_sp = p->kstack + PGSIZE; // process's kernel stack
p->tf->kernel_trap = (uint64)usertrap;
p->tf->kernel_hartid = r_tp(); // hartid for cpuid()
// set up the registers that trampoline.S's sret will use
// to get to user space.
// set S Previous Privilege mode to User.
unsigned long x = r_sstatus();
x &= ~SSTATUS_SPP; // clear SPP to 0 for user mode
x |= SSTATUS_SPIE; // enable interrupts in user mode
w_sstatus(x);
// set S Exception Program Counter to the saved user pc.
w_sepc(p->tf->epc);
// tell trampoline.S the user page table to switch to.
uint64 satp = MAKE_SATP(p->pagetable);
// jump to trampoline.S at the top of memory, which
// switches to the user page table, restores user registers,
// and switches to user mode with sret.
uint64 fn = TRAMPOLINE + (userret - trampoline);
((void (*)(uint64,uint64))fn)(TRAPFRAME, satp);
}
unsigned long x = r_sstatus();
// tell trampoline.S the user page table to switch to.
uint64 satp = MAKE_SATP(p->pagetable);
uint64 fn = TRAMPOLINE + (userret - trampoline);
((void (*)(uint64,uint64))fn)(TRAPFRAME, satp);
这个函数包含了所有能将我们带回到用户空间的指令。
.globl userret
userret:
# userret(TRAPFRAME, pagetable)
# switch from kernel to user.
# usertrapret() calls here.
# a0: TRAPFRAME, in user page table.
# a1: user page table, for satp.
# switch to the user page table.
csrw satp, a1
sfence.vma zero, zero
# put the saved user a0 in sscratch, so we
# can swap it with our a0 (TRAPFRAME) in the last step.
ld t0, 112(a0)
csrw sscratch, t0
# restore all but a0 from TRAPFRAME
ld ra, 40(a0)
ld sp, 48(a0)
ld gp, 56(a0)
ld tp, 64(a0)
ld t0, 72(a0)
ld t1, 80(a0)
ld t2, 88(a0)
ld s0, 96(a0)
ld s1, 104(a0)
ld a1, 120(a0)
ld a2, 128(a0)
ld a3, 136(a0)
ld a4, 144(a0)
ld a5, 152(a0)
ld a6, 160(a0)
ld a7, 168(a0)
ld s2, 176(a0)
ld s3, 184(a0)
ld s4, 192(a0)
ld s5, 200(a0)
ld s6, 208(a0)
ld s7, 216(a0)
ld s8, 224(a0)
ld s9, 232(a0)
ld s10, 240(a0)
ld s11, 248(a0)
ld t3, 256(a0)
ld t4, 264(a0)
ld t5, 272(a0)
ld t6, 280(a0)
# restore user a0, and save TRAPFRAME in sscratch
csrrw a0, sscratch, a0
# return to user mode and user pc.
# usertrapret() set up sstatus and sepc.
sret
首先是切换pagetable(从kernel pa tb到user pg tb)
然后是将之前保存在trapframe中的registers还原。user page table也映射了trampoline page,所以程序还能继续执行而不是崩溃。 (在这里a0是trapframe的地址)
解释 sfence.vma zero, zero
是clear page table 的TLB
重新打印所有寄存器(和调用前一致) (除了a0 —— a0 被write的返回值覆盖了)
打印pc可以看出来,现在已经回到用户空间了 —— 以为pc很小,明显是虚拟地址
(gdb) p/x $pc
$44 = 0xd6a
最后总结一下,系统调用被刻意设计的看起来像是函数调用,但是背后的user/kernel转换比函数调用要复杂的多。之所以这么复杂,很大一部分原因是要保持user/kernel之间的隔离性,内核不能信任来自用户空间的任何内容。
Trampoline page之所以叫trampoline page,是因为你某种程度在它上面“弹跳”了一下,然后从用户空间走到了内核空间。从内核空间弹了一下,又走出来。——trampoline代码在用户空间和内核空间都映射到了同一个地址。这样不会让程序在user和kernel pagetable切换的时候崩溃。