视频链接
课件链接
该视频课程使用 64位 编译器!
本文使用编译器从Ch.3.6开始换到64位,因此3.6之前 地址 为4字节,之后为8字节!
C编译(ccl)与链接(ld)
Switch是否总时比if-else高效?
while循环总比for循环高效么?
指针引用比数组高效么?
函数的本地临时变量为什么比入参的引用更高效?
算数表达式的括号也能影响运算速度?
负数“补码”可视化
事实上,有符号数(two’s complement,补码)的符号位,是具有权重的,只不过需要取反,如-2表示为 1111 , 1110 = − 2 7 + ∑ w = 1 w = 6 2 w + 0 ∗ 2 w = 0 = − 2 1111,1110=\red {-2^7}+\sum_{\red {w=1}}^{w=6}2^w+0*2^{\red {w=0}}=-2 1111,1110=−27+∑w=1w=62w+0∗2w=0=−2
“unsigned and singed numbers have same bit pattern,just a bunch of bits to computer itself.”
//sizeof return unsigned int, cast a into unsigned, you got stuck forever
for(int a=1;a-sizeof(a)>=0;a--)
//so be care of unsigned "i" used for array in case a[i]
//i=0; i--=UMAX; a[i] may cause out of bounds
int main()
{
unsigned int a=numeric_limits<unsigned int>::max();
int b=-1;
unsigned int c=-3;
cout<<(int)a<<" "<<a<<endl; //-1 4294967295
cout<<(b==a?"True":"Flase")<<endl; //True
cout<<(b>a?"True":"Flase")<<endl; //Flase
cout<<std::hex<<c<<" "<<-c<<" "<<c+(-c)<<endl; //fffffffd 3 0
cout<<std::hex<<b<<"\n"<<numeric_limits<int>::max()<<endl; //ffffffff 7fffffff
cout<<b+numeric_limits<int>::max()<<endl; //7ffffffe
return 0;
}
0110 = − 0 × 2 3 + 1 × 2 2 + 1 × 2 1 + 0 × 2 0 = 6 0110 = -0\times2^{3}+1\times2^{2}+1\times2^{1}+0\times2^{0}=6 0110=−0×23+1×22+1×21+0×20=6
1110 = − 1 × 2 3 + 1 × 2 2 + 1 × 2 1 + 0 × 2 0 = − 2 1110 = -1\times2^{3}+1\times2^{2}+1\times2^{1}+0\times2^{0}=-2 1110=−1×23+1×22+1×21+0×20=−2
11110 = − 1 × 2 4 + 1 × 2 3 + 1 × 2 2 + 1 × 2 1 + 0 × 2 0 = − 2 11110 = \red{-1\times2^{4}+1\times2^{3}}+1\times2^{2}+1\times2^{1}+0\times2^{0}=-2 11110=−1×24+1×23+1×22+1×21+0×20=−2
符号位左移填充, − 1 × 2 n + 1 + 1 × 2 n = − 1 × 2 n \red{-1\times2^{n+1}+1\times2^{n}}=-1\times2^{n} −1×2n+1+1×2n=−1×2n“负权重”不变
( − 1 ) s M × 2 E (-1)^{s}M\times2^{E} (−1)sM×2E
precision | sign field | exp field | frac field |
---|---|---|---|
value | s | exp | frac |
single | 1 bit | k = 8 bit | 23 bit |
double | 1bit | k = 11 bit | 52 bit |
Extended precision 英特尔特用 | 1 bit |15 bit | 64 bit
共10字节,对齐16字节,因此后6字节为空
e x p ≠ 000...0 exp \neq 000...0 exp=000...0 or 111...1 111...1 111...1
- Why bias = 2 k − 1 − 1 = 2^{k-1}-1 =2k−1−1, not 2 k − 1 2^{k-1} 2k−1?
e x p = 000...0 exp = 000...0 exp=000...0
s | exp | frac | represent | |
---|---|---|---|---|
denorms | 0 | 0000,0000 | 11…1 | 2 − 126 × ( 2 − 1 + . . . + 2 − 23 ) = 2 − 126 × ( 1 − 2 − 23 ) 2^{-126}\times(2^{-1}+...+2^{-23})=2^{-126}\times(1-2^{-23}) 2−126×(2−1+...+2−23)=2−126×(1−2−23) |
norms | 0 | 0000,0001 | 00…0 | 1.0 × 2 − 126 1.0\times2^{-126} 1.0×2−126 |
将"1.00…0"移位成 0.1 × 2 1 0.1\times2^{1} 0.1×21,并将 2 1 2^1 21"隐藏"至E中,因此 E = 1 − b i a s ≠ 0 − b i a s E=1-bias\red\neq 0-bias E=1−bias=0−bias | ||||
从而实现了从 2 − 126 2^{-126} 2−126到 2 − 127 2^{-127} 2−127,从 D E N O R M m a x DENORM_{max} DENORMmax到 N O R M m i n NORM_{min} NORMmin的平滑过渡,使浮点数如无符号整型+1进位! | ||||
非标准化值最高精度 = 0 , 00000000 , 000...01 = 1 × 2 − 126 − 23 = 2 − 149 =0,00000000,000...01=1\times2^{\red{-126-23}}=2^{-149} =0,00000000,000...01=1×2−126−23=2−149 |
使用非标准化浮点可以表示更接近“0”的小数,越靠近0,E越小分辨率越高,数与数间距越小
exp | frac | meaning | |
---|---|---|---|
+ ∞ +\infin +∞ | 111…1 | 000…0 | overflows |
NaN | 111…1 | ≠ \red\neq = 000…0 | no feasible answer |
1.0 / − 0.0 = − ∞ 1.0/-0.0=-\infin 1.0/−0.0=−∞
− 1 = ∞ − ∞ = ∞ × 0 = N a N \sqrt{-1}=\infin - \infin =\infin \times 0 =NaN −1=∞−∞=∞×0=NaN
Round to nearest 2 − 2 2^{-2} 2−2,watch out nearsest right bit( 2 − 3 2^{-3} 2−3 in this case)
value binary Note Rounded Rounded Value 2 3 32 2 \frac{3}{32} 2323 10.00 0 1 1 2 10.00\red{0}11_{2} 10.000112 0.00011 < 2 − 3 0.00011<2^{-3} 0.00011<2−3 10.0 0 2 10.00_{2} 10.002 2 2 3 16 2\frac{3}{16} 2163 10.00 1 1 0 2 10.00\red{1}10_{2} 10.001102 0.00110 > 2 − 3 0.00110>2^{-3} 0.00110>2−3 10.0 1 2 10.01_{2} 10.012 2 1 4 2\frac{1}{4} 241 2 7 8 2\frac{7}{8} 287 10.11 1 0 0 2 10.11\red{1}00_{2} 10.111002 0.00100 = 2 − 3 0.00100=2^{-3} 0.00100=2−3
got odd (10.11) if drop >it10.11 1 2 + 0.00 1 2 = 11.0 0 2 10.111_{2}+0.001_{2}=11.00_{2} 10.1112+0.0012=11.002 3 3 3 2 5 8 2\frac{5}{8} 285 10.10 1 0 0 2 10.10\red{1}00_{2} 10.101002 0.00100 = 2 − 3 0.00100=2^{-3} 0.00100=2−3
got even (10.10) if drop >it10.10 1 2 − 0.00 1 2 = 10.1 0 2 10.101_{2}-0.001_{2}=10.10_{2} 10.1012−0.0012=10.102 2 1 2 2\frac{1}{2} 221
( 3.14 + 1 e 10 ) − 1 e 10 = 1 e 10 − 1 e 10 = 0 (3.14+1e10)-1e10=1e10-1e10=0 (3.14+1e10)−1e10=1e10−1e10=0
3.14 + ( 1 e 10 − 1 e 10 ) = 3.14 + 0 = 3.14 3.14+(1e10-1e10)=3.14+0=3.14 3.14+(1e10−1e10)=3.14+0=3.14
( 1 e 20 ∗ 1 e 20 ) ∗ 1 e − 20 = ∞ ∗ 1 e − 20 = ∞ (1e20*1e20)*1e-20=\infin * 1e-20=\infin (1e20∗1e20)∗1e−20=∞∗1e−20=∞
1 e 20 ∗ ( 1 e 20 ∗ 1 e − 20 ) = 1 e 20 ∗ 1 = 1 e 20 1e20*(1e20*1e-20)=1e20*1=1e20 1e20∗(1e20∗1e−20)=1e20∗1=1e20
d m i n < 0 d_{min}<0 dmin<0, d m i n ∗ 2 = o v e r f l o w < 0 d_{min}*2 =overflow < 0 dmin∗2=overflow<0 #负数溢出也小于0
#include
#include
using namespace std;
int main()
{
int x=0x7FFFFFFF;
float f=0.0;
double d=0.0;
cout<<"int(x):"<<x<<endl<<"float(x):"<<(float)x<<endl;
cout<<((x==(int)(float)x)?"True":"False")<<endl;//返回True,可能有编译器优化
f=(float)x;//float仅23个有效位,x中最后9位被round掉
cout<<((x==(int)f)?"True":"False")<<endl; //返回False
return 0;
}
Data Lab
Intell x86(字母“x”86,不念“叉86”)
date Transistors MHz feature 8086 1978 29K 5-10 First 16-bit microprocessor,1MB addr space
Slight vatiation was a basis for IBM pc8286 8386 1985 275K 16-33 32bit + “flat addressing”=> Unix capable
IA32(Intell Architecture 32)Pentium 4E 2004 125M 2800-3800 First x86-64
power consumption 100W
power budget problemCore 2 2006 291M 1060-3500 First multi-core Inter processor Core i7 2008 731M 1700-3900 4 cores — shark machine 1980s,RISC vs. CISC.(Reduced instruction set computer)
Desktop Mode | Server Model |
---|---|
4 cores | 8 cores |
Integrated graphics | Integrated I/O |
3.3-3.8 GHz | 2~2.6 GHz |
65W | 45W |
years | Intell | AMD |
---|---|---|
2001 | A little bit slower for a lot cheaper | Itanium /aɪˈteɪniəm/ 安腾Arch = IA64 too ideally, disappointing |
2003 | Come up with x86-64, or called “AMD64” | Insisting focus on IA64 |
2004 | EM64T(almost identical to x86-64) lots of code still run in 32 bit mode. |
|
Cross license allows AMD to produce x86 processors. |
Sufficiently simple and could be customized(个性化).
Lower power requirement than x86 machine.
Sell companies the rights (Intellectual property) to use their designs,not chips.
terminology | definitions | Examples |
---|---|---|
Architechture or ISA | Instruction Set Architecture The parts of a processor design that one needs to understand or write machine code. |
Instruction Set Specification,Registers. |
Microarchitecture | Implementation of the architecture ISA is the abstraction helps hardware people design |
Cache sizes and core frequency. |
Machine Code | Byte-level programs that processor executes | |
Assembly Code | Text version of machine code |
There is no way (or instructions) you can directly access or manipulate cache.
PC:Program counter
Address of next instruction
Called “RIP”(x86-64)
Register file
Heavily used program data
Condition codes
Store status information about most recent arithmetic or logical operation
Used for conditional branching
Memory
Byte addressable array
Code and user data
Stack to support procedures
以之前的浮点实验为例
调用gcc 实际间接调用了一系列(a sequency of program)进程
Options starting with -g, -f, -m, -O, -W, or --param are automatically
【-O】Do optimization
【-Og】Use debug level optimizations to makethe code readable
【-O2】The most common optimization level
Instruction | Function | output |
---|---|---|
g++ -E *.cpp | Preprocess only | *.i |
g++ -Og -S *.cpp | “Stop” after compile | *.s |
g++ -c *.s | Compile to get assemblely code | *.o |
g++ *.o | Link and get excutable program | *.exe、a.out |
objdump -d *.exe | disassemble binary excutable program | *.s |
Period indicates “not instructions” but information needs by debuger、linker and so on.
Disasemble by gdb
#include
using namespace std;
int main()
{
cout<<"hello world\n";
return 0;
}
>gdb .\*.exe
>(gdb) disassemble main
Dump of assembler code for function main:
0x00401460 <+0>: push %ebp
0x00401461 <+1>: mov %esp,%ebp
0x00401463 <+3>: and $0xfffffff0,%esp
0x00401466 <+6>: sub $0x10,%esp
0x00401469 <+9>: call 0x401a30 <__main>
0x0040146e <+14>: movl $0x405065,0x4(%esp)
0x00401476 <+22>: movl $0x408254,(%esp)
End of assembler dump.
>(gdb) x/3xb 0x00401466
0x401466 <main+6>: 0x83 0xec 0x10
x86-64 Integer Registers
%r* means 64bits
%e*x = %r*L (%r*x的low-order 32 bits)
why “ax,bx,ex …”? 历史沿用
详见 Intel SDM 下载地址
“q” for “quad word” (64bits,Intell terminology)
“l” for “long word” (32bits)
“word” for 16 bits (8086)
Src Types | Example | Dest | C analog(treat reg as var) |
---|---|---|---|
Immediate | $0x400 | Reg,Mem | temp = 0x4; *p=0x4; |
Register | %rax,%r13 | Reg,Mem | temp2 = temp1;*p=temp; |
Memory | (%rax) | Reg | temp = *p; |
Memory Dereference |
movq (Reg),[Reg/Mem]
location in Memory,Address = register value
C type | Machine Level |
---|
void swap(long *xp,long *yp)
{
long t0 = *xp;
long t1=*yp;
*xp=t1;
*yp=t0;
}|
swap:
movq (%rdi), %rax
movq (%rsi), %rdx
movq %rdx, (%rdi)
movq %rax, (%rsi)
ret >* [Arguments always come in (at most 6) specific registers in orders]():rdi,rsi,... >* [Register Allocation algorithm?]()
movq Disp(Reg),[Reg/Mem]
location in memory,Address = value in Reg + const Disp
movq Disp(Rb,Ri,Scale),Reg/Mem
location in memory,Address = Rb + Scale*Ri + Disp
Load Effective Address = ampersand(&) operation in C
Preety handy way to do arithmetic and C compiler likes to use it.
Src would be memory refrence.
Dest has to be register,store the address computed from Src, not value.
long m12(long x)
{
return 12*x;
}
//g++ -S *.cpp
__Z3m12l:
movl %edx, %eax
addl %eax, %eax
addl %edx, %eax
sall $2, %eax
popl %ebp
ret
//g++ -Og -S *.cpp
__Z3m12l:
movl 4(%esp), %eax
leal (%eax,%eax,2), %edx //x+x*2 ==> dx
leal 0(,%edx,4), %eax //(x+x*2)*4 ==> ax
ret
//lecture
leal (%eax,%eax,2), %edx //x+x*2 ==> dx
sall $2, %edx //(x+x*2)<<2 ==> ax
ret
Format | Computation in C form |
---|---|
addq Src, Dest | Dest = Dest + Src |
subq Src, Dest | Dest = Dest - Src |
imulq Src, Dest | Dest = Dest * Src |
salq Src, Dest | Dest = Dest << Src (=shlq) |
sarq Src, Dest | Dest = Dest >> Src (Arithmetic) |
shrq Src, Dest | Dest = Dest >> Src (Logical) |
xorq Src, Dest | Dest = Dest ^ Src |
andq Src, Dest | Dest = Dest & Src |
orq Src, Dest | Dest = Dest | Src |
Format | Computation in C form |
---|---|
incq Dest | Dest = Dest + 1 |
decq Dest | Dest = Dest - 1 |
negq Dest | Dest = -1 * Dest (negate 取反) |
notq Dest | Dest = ~ Dest (tilde “~” not exclamation “!”) |
sarq Src, Dest | Dest = Dest >> Src (Arithmetic) |
shrq Src, Dest | Dest = Dest >> Src (Logical) |
xorq Src, Dest | Dest = Dest ^ Src |
andq Src, Dest | Dest = Dest & Src |
orq Src, Dest | Dest = Dest | Src |
So far the registers we should know
All of them is one bit flag, get or set not directly but as a side effect of other operation.
Registers | name to memorize | set if |
---|---|---|
CF | Carry Flag | carry out from most significant bit (unsigned overflow) |
ZF | Zero Flag | Dest == 0 |
SF | Sign Flag | Dest<0(as signed) |
OF | Overflow Flag | two’s-complement(signed)overflow a>0,b>0,a+b<0 a<0,b<0,a+b>0 a*b<0,can’t overflow |
Attention! Lea 不影响标志位! |
各指令对标志位的影响
Do substraction (Src1 - Src2) ,and set 4 flags above,but do nothing(like store in Dest)with the result
Src1-Src2 | CF | ZF | SF | OF |
---|
0|0|0|0|0
=0|0|1|0|0
(unsigned) cmpq 2,1|1|0|1|0
(signed) cmpq 2,1|1|0|1|0
(signed) cmpq INT_MAX,INT_MIN|0|0|0|1
小实验
//test.cpp #include
using namespace std; int main() { unsigned int ua=1; unsigned int ub=2; unsigned int uc=0; uc=ua-ub; return 0; } g++ -g -DEBUG test.cpp #-g 保留行号 gdb a.exe (gdb) list #打印行号 (gdb) break 9 #在return前设置断点 (gdb) run #运行并停在第一个断点 (gdb) info registers eflags eflags 0x297 [ CF PF AF SF IF ] #中括号内Condition Code被置1
个人理解,只要符号位进位,CF便会 set
Src1+Src2|binary form|result|flags
-|-|-|-|-
I N T _ M I N 2 + I N T _ M I N 2 \frac {INT\_MIN}{2} + \frac {INT\_MIN}{2} 2INT_MIN+2INT_MIN|1100…00
+1100…00|(1)10…00|CF=1,SF=1,OF=0
I N T _ M I N 2 + I N T _ M I N 2 − 1 \frac {INT\_MIN}{2} + \frac {INT\_MIN}{2} - 1 2INT_MIN+2INT_MIN−1|1100…00
+1100…00
+1111…11|(1)011…1|CF=1,SF=0,OF=1
负+负=正 overflow
Like computing a & b without setting destination.
testq Src1, Src2 = Computing(Src1 & Src2) set eflags
Set low-order byte of destination to 0 or 1 based on combinations of condition codes,without changing remaining 7 bytes.
Setx | Condition | set True if last result |
---|---|---|
sete | ZF | =0 |
setne | ~ ZF | ≠ 0 \neq 0 =0 |
sets | SF | <0 |
setns | ~ SF | >=0 |
setg | ~ (SF ^ OF)& (~ ZF) | > (signed) |
setge | ~ (SF ^ OF) | >= (signed) |
setl | (SF ^ OF) | < (signed) |
setle | (SF ^ OF)| ZF | <= (signed) |
seta | ~CF & ~ZF | Above (unsiged) |
setb | CF | Below (unsigned) |
举例:
bool mycmp(long a,long b)
{
return a>b;
}
mycmp:
movl 8(%esp), %eax
cmpl %eax, 4(%esp)
setg %al
#movzbq %al, %eax #move with zero extension byte to quad
ret
x86-64’s(AMD)weird quirks
If result is 32 bits,remaining 32 bits will be zeroed,but other-length data type instruction won’t.
jmp、je、jne、js、jns、jg、jge、jl、jle、ja、jb, same as setX.
举例:
long abs(long x,long y)
{
long result;
if(x>y)
result = x-y;
else
result = y-x;
return result;
}
>gcc -Og -S -fno-if-conversion test.cpp
abs: # only exist in assembly code,changing into address in object code
movl 4(%esp), %edx #x
movl 8(%esp), %eax #y
cmpl %eax, %edx # y, x
jg L14
subl %edx, %eax # y-x
ret
L14:
subl %eax, %edx # x-y
movl %edx, %eax
ret
指令重排:if-else两个分支结果都计算,最后再选择结果返回.
>gcc -Og -S test.cpp #去掉-fno-if-conversion,gcc 默认允许指令重排
abs:
movq %rdi, %rax #x
subq %rsi, %rax #x=x-y
movq %rsi, %rdx #y
subq %rdi, %rdx #y=y-x
cmpq %rsi, %rdi #x-y
cmovle %rdx, %rax #if(x<=y)ret(y-x)
ret #result in %rax
Why:Branches are very disruptive to instruction flow through pipelines,Wasteful but more efficient.
See:pipelining、branch prediction.
只要branch prediction足够准确(98%),“管线“执行效率就会很高(提前20条指令)。
预测错误,回头重算,最多花费40时钟周期。
(gcc主动)避免进行指令重排的情况
“Do-While” Loop
long popcount(unsigned long x)
{
long res=0;
do
{
res += x & 0x1;
x >>= 1;
}while(x);
return res;
}
popcount:
movl 4(%esp), %edx
movl $0, %eax
L12:
movl %edx, %ecx
andl $1, %ecx
addl %ecx, %eax
shrl %edx
jne L12
ret
“While” Loop
Test at the very beginning and skip the loop if condition doesn’t hold.
long popcount(unsigned long x)
{ ... while(x){...} ... }
popcount:
movl 4(%esp), %edx # x
movl $0, %eax
L13:
testl %edx, %edx
je L11
movl %edx, %ecx
andl $1, %ecx
addl %ecx, %eax
shrl %edx
jmp L13
L11:
ret
“For” Loop
for( Init; Test; Update)
body;
Semantics =
Init;
while(Test)
{ Body; Update; }
long popcount(unsigned long x)
{
size_t i=0;
long res=0;
for(i=0;i<32;i++)
{
res += x & 0x1;
x >>= 1;
}
return res;
}
>g++ -Og -S test.cpp
popcount:
pushl %ebx
movl 8(%esp), %ecx # x
movl $0, %eax # res=0
movl $0, %edx # i=0
L13:
cmpl $31, %edx # i>31
ja L11 # return
movl %ecx, %ebx
andl $1, %ebx # x & 1
addl %ebx, %eax # res += i
shrl %ecx # x >>= 1
addl $1, %edx # i += 1
jmp L13
L11:
popl %ebx
ret
提升编译优化等级-O1,无需initial test,转换为"do-while"循环。
>g++ -O1 -S test.cpp
popcount:
...
movl $32, %edx
movl $0, %eax
L4:
movl %ecx, %ebx
...
shrl %ecx
subl $1, %edx
jne L4
...
首次test非真,无循环。
popcount:
movl $0, %eax
ret
long switch_try(unsigned long x)
{
long res=0;
switch (x)
{
case 1:
res += 1;
break;
case 2:
res += 2;
case 3:
res *= 3;
break;
case 5:
case 4:
res -=1;
break;
case -1:
res *= -1;
break;
default:
res = 100;
}
return res;
}
switch_try:
movl 4(%esp), %eax # x
leal 1(%eax), %edx # case -1负数的情况,通过+偏置1转化为无符号数
cmpl $6, %edx # case 中最大值5,偏置后为6
ja L12 # 小技巧
# 用ja比较,小于-1的负数,偏置后仍为负数
# 在无符号数格式下,大于有符号数的正数范围,从而归属 defult
jmp *L14(,%edx,4) # Indirect jump,L14+4*(x+偏置) 的单元存储的值,作为jump地址
.section .rdata,"dr"
.align 4
L14: # Jump Table,compiler给结构,assembler(汇编器)填地址
.long L13 # need a long type value as address x=-1
.long L12 # x=0
.long L11 # x=1
.long L16 # x=2
.long L17 # x=3
.long L18 # x=4
.long L18 # x=5
.text
L17:
movl $0, %eax # x=3,res=0*3=0
L16: # x=2,res+=2,res==x,因此res用%eax表示 有优化
leal (%eax,%eax,2), %eax # res=2*3=6
ret
L13: # x=-1
movl $0, %eax # res=0*(-1)=0,compiler直接优化赋值0
ret
L18: # x=4
movl $-1, %eax
ret
L12: # default case
movl $100, %eax
L11: # x=1
rep ret # ja前已偏置+1,故直接返回%eax
long switch_try(long x)
{
long res=0;
switch (x)
{
case 1:
res=0;
break;
case 100:
res=99;
break;
default:
res = -1;
}
return res;
}
switch_try:
movl 4(%esp), %eax
cmpl $1, %eax
je L13
cmpl $100, %eax
je L15
movl $-1, %eax
ret
L13:
movl $0, %eax
ret
L15:
movl $99, %eax
ret
Switch是否总时比if-else高效?根据以上分析,答案是否定的We are never happy with a simple explanation. We want to understand how we could actually implement it as a program if we ever had to do so.
ABI,Application Binary Interface,一种机器码层面的二进制程序接口协定。
Adress | Values Meaning |
---|---|
High Adress | (%rbp)Stack Bottom |
… | |
Low Adress | ( %rsp )Stack Top |
pushq Src
step 1:Fetch operand at Src(imediate or registers).
step 2:Decrement %rsp by 8.
step 3:Write operand at address given by %rsp.
popq Dest
step 1:Read value at address given by %rsp.
step 2:Increament %rsp by 8.
step 3:Store value at Dest(must be register).
Data at top of stack is stll there in the memory,but is no longer part of stack.
call label
step1:Push return address on stack,sp=sp-sizeof(address)
step2:Jump to label //%rip是不允许被显式操作的
ret
step1:Pop address(of next instruction right after call)from stack,sp=sp+sizeof(address)
step2:Jump to address
long sub_try(long x)
{
return x+1;
}
long call_try(long x)
{
return sub_try(x+1);
}
sub_try:
movl 4(%esp), %eax
addl $1, %eax
ret
call_try:
subl $4, %esp
movl 8(%esp), %eax
addl $1, %eax
movl %eax, (%esp)
call sub_try # sp=sp-4; *sp=addr after call
addl $4, %esp
ret
ABI规定
前6个整型入参用寄存器{ %rdi、%rsi、%rdx、%rcx、%r8、%r9 },6个之后的参数适用栈,返回值用 %rax。
long incr(long *p,long val)
{
long x=*p;
long y=x+val;
*p=y;
return x;
}
long call_incr()
{
long v1=15213;
long v2=incr(&v1,3000);
return v1+v2;
}
incr:
movl 4(%esp), %edx # dx=*(-28+4)=-4
movl (%edx), %eax # ax=*(-4)=15213
movl %eax, %ecx # cx=ax
addl 8(%esp), %ecx # cx=cx+*(-20)=15213+3000
movl %ecx, (%edx) # *(-4)=18213
ret # sp=sp+4=-24
call_incr: # 设sp=0 <--- start
subl $24, %esp # sp=-24 分配24字节空间
movl $15213, 20(%esp)# *(-4)=15213
movl $3000, 4(%esp) # *(-20)=3000
leal 20(%esp), %eax # ax=-4
movl %eax, (%esp) # *(-24)=-4
call incr # sp=sp-4=-28,4字节返回地址入栈
addl 20(%esp), %eax # ax=15213+*(-4)=15213+18213
addl $24, %esp # sp=0 清空栈
ret
浮点型入参使用一组特殊的寄存器。
函数的本地临时变量为什么比入参的引用更高效?因为临时变量用寄存器,而引用需要解引用,或间接寻址,相对低效
stack fame :Each block we use for particular call。
发生调用时:
P calls Q,Arguments > No.6存在P帧中。
大多数系统限制了栈的最大深度。
%rbp 作为 frame pointer。
某些情况下%rbp会用于记录 caller的栈帧底。
《CS:APP(Third.Ed)》英文版 P.286
void proc(long a1, long *a1p, int a2, int *a2p, short a3, short *a3p, char a4, char *a4p)
{
*a1p += a1;
*a2p += a2;
*a3p += a3;
*a4p += a4;
}
long call_proc()
{
long x1 = 1;
int x2 = 2;
short x3 = 3;
char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
return (x1+x2)*(x3-x4);
}
call_proc: # callee save
subq $32, %rsp # Allocate 32-byte stack frame
movq $1, 24(%rsp) # Store 1 in &x1
movl $2, 20(%rsp) # Store 2 in &x2
movw $3, 18(%rsp) # Store 3 in &x3
movb $4, 17(%rsp) # Store 4 in &x4
leaq 17(%rsp), %rax # Create &x4
movq %rax, 8(%rsp) # Store &x4 as argument 8
movl $4, (%rsp) # Store 4 as argument 7
leaq 18(%rsp), %r9 # Pass &x3 as argument 6
movl $3, %r8d # Pass 3 as argument 5
leaq 20(%rsp), %rcx # Pass &x2 as argument 4
movl $2, %edx # Pass 2 as argument 3
leaq 24(%rsp), %rsi # Pass &x1 as argument 2
movl $1, %edi # Pass 1 as argument 1
call proc
movslq 20(%rsp), %rdx # Get x2 and convert to long
addq 24(%rsp), %rdx # Compute x1+x2
movswl 18(%rsp), %eax # Get x3 and convert to int
movsbl 17(%rsp), %ecx # Get x4 and convert to int
subl %ecx, %eax # Compute x3-x4
cltq # Convert to long
imulq %rdx, %rax # Compute (x1+x2) * (x3-x4)
addq $32, %rsp # Deallocate stack frame
ret # Return
proc:
movq 16(%rsp), %rax
addq %rdi, (%rsi)
addl %edx, (%rcx)
addw %r8w, (%r9)
movl 8(%rsp), %edx
addb %dl, (%rax)
ret
“Callee Saved” 的情况较多
《CS:APP(Third.Ed)》英文 P.288
long P(long x, long y)
{
long u = Q(y);
long v = Q(x);
return u + v;
}
P:
pushq %rbp # Save %rbp | Callee-Saved
pushq %rbx # Save %rbx
subq $8, %rsp # Align stack frame
movq %rdi, %rbp # Save x | Caller-Saved
movq %rsi, %rdi # Move y to first argument
call Q # Call Q(y)
movq %rax, %rbx # Save result
movq %rbp, %rdi # Move x to first argument
call Q # Call Q(x)
addq %rbx, %rax # Add Q(y) to Q(x), believe rbp not changed before & after Q
addq $8, %rsp # Deallocate last part of stack
popq %rbx # Restore %rbx 注意先进后出,变量出栈反入栈顺序
popq %rbp # Restore %rbp
ret
视频例题
unsigned long pcount_r(unsigned long x)
{
if (x==0)
return 0;
else
return (x & 1) + pcount_r(x >> 1);
}
pcount_r:
pushl %ebx # *sp = ebx; sp = sp-4;
subl $24, %esp # sp = sp-24;
movl 32(%esp), %eax # eax = *(sp+32) = x;
testl %eax, %eax #
jne L14 # if(eax != 0) goto L14;
L12:
addl $24, %esp # sp = sp+24;
popl %ebx # sp = sp+4; ebx = *sp;
ret
L14:
movl %eax, %ebx # ebx = eax;
andl $1, %ebx # ebx = ebx & 1;
shrl %eax # eax >> 1;
movl %eax, (%esp) # *sp = eax;
call pcount_r #
addl %ebx, %eax # eax = eax + { ebx = (x & 1)}
# eax 并没有被push,最后一层callee返回时eax = 0
jmp L12
echo "eax即作为输入参数,最in的一层callee中变0后又作为输出暂存,实在是妙啊!!!"
对于复杂的数据结构,建议拆分用typedef多次嵌套定义,明晰结构
//声明大小为5的数组,元素是函数指针,函数入参为(int),返回值为int指针
int *(*a[5])(int);
//使用typedef简化声明
typedef int *(*pFun)(int);
pFun a[5];
//声明大小为5的数组,元素是A类函数指针,A类函数入参为B类函数指针,B类函数无入参,无返回值
int *(*b[5])(void(*)(void));
//使用typedef分两步简化声明
typedef void(*pVoidFunc)(void); //定义函数类型B
typedef int *(*pFunc)(pVoidFunc);
pFunc b[5];
注意typedef是存储类关键字(如 static、auto、mutable、register等)
typedef static int STCINT;
>> 编译报错"一个以上的存储类"
汇编程序员期望一种看似高级语言,但又留有汇编层面灵活性、可玩性(技巧层面),C语言诞生。
之前操作系统都是用汇编写的(=͟͟͞͞=͟͟͞͞(●⁰ꈊ⁰● |||)),Kernighan、Dennis Ritchie等人为实现灵活性,在创造C时引入了指针操作。
在继续探讨指针前需要注意:
int main()
{
int *p=NULL;
cout << sizeof(p)<<endl; // = 4
return 0;
}
int类型的指针大小为4,说明并不是64位地址。使用 gcc -v 查看后醒悟使用的是32位编译器,赶紧切换64位
>> gcc -v
...
Target: x86_64-w64-mingw32
...
int main()
{
int *p=NULL;
cout << sizeof(p)<<endl;
int a[8]={0};
cout<<"\nsizeof(a)"<<sizeof(a)<<"\n" // = 32
<<"\nsizeof(a[0])"<<sizeof(a[0])<<"\n" // = 4 a[0]=*(a+0)
<<"\nsizeof(*a)"<<sizeof(*a)<<"\n"<<endl; // = 4
int b[2][3]={0};
cout<<"\nsizeof(b)"<<sizeof(b)<<"\n" // = 24
<<"\nsizeof(b[0])"<<sizeof(b[0])<<"\n" // = 12 b[0]=*(b+0)
<<"\nsizeof(*b)"<<sizeof(*b)<<"\n" // = 12
<<"\nsizeof(b[1][1])"<<sizeof(b[1][1])<<"\n" // = 4 b[1][1]=*(b[1]+1)
<<"\nsizeof(*b[1])"<<sizeof(*b[1])<<"\n"<<endl; // = 4
cout <<a<<"=?="<<&a<<endl; // 0x61fdf0=?=0x61fdf0
cout <<b[1]<<"=?="<<b[0]<<":"<<b[1]-b[0]<<endl; // 0x61fddc=?=0x61fdd0:3
b[0][1]=1;
b[1][0]=2;
cout <<*b[1]<<endl; // *b[1] = 2 = *(*(b+1)),说明 '[]' 优先级> '*'
return 0;
}
二维数组的结构 = 数组{数组指针1、数组指针2、…},而数组指针1指向数组{元素1、元素2、…},且二维数组是一段地址连续的空间,视频里将这种数组称作 Nested array。
以下举例说明,非直接声明的二维数组,分配的空间地址并不连续,视频里将这种数组称作 Multi-level array。
int get_ele(int arr[3][3],size_t r,size_t c)
{
return arr[r][c];
}
int main()
{
int a1[3]={1,2,3},a2[3]={4,5,6},a3[3]={7,8,9};
int *(arr[3])={a1,a2,a3};
cout<<arr[2]<<"\n" // 0x61fdfc
<<arr[1]<<"\n" // 0x61fe08
<<arr[2]-arr[1]<<"\n" // -3
<<(char*)(arr[2])-(char*)(arr[1])<<endl; // -12
int arr2[3][3]={0};
cout<<arr2[2]<<"\n" // 0x61fdc8
<<arr2[1]<<"\n" // 0x61fdbc
<<arr2[2]-arr2[1]<<"\n" // 3
<<(char*)(arr2[2])-(char*)(arr2[1])<<endl; // 12
get_ele(arr2,1,2);
return 0;
}
get_ele:
leaq (%rdx,%rdx,2), %rdx # rdx = rdx + 2*rdx = 3*rdx
leaq 0(,%rdx,4), %rax # rax = 4 * rdx
addq %rax, %rcx # rcx = rcx + 12 * r
movl (%rcx,%r8,4), %eax # eax = *(rcx + j * 4)
ret
Nested Array 和 Multi-Level Array 在汇编层面完全不同:
Nested Array 因为空间连续,只需要一次Memory Reference就能拿到元素:
N A [ i n d e x ] [ d i g i t ] = ∗ ( N A + i n d e x ⋅ c o l ⋅ s i z e o f ( e l e m ) + d i g i t ⋅ s i z e o f ( e l e m ) ) NA[index][digit]=*(NA+index \cdot col \cdot sizeof(elem) + digit \cdot sizeof(elem)) NA[index][digit]=∗(NA+index⋅col⋅sizeof(elem)+digit⋅sizeof(elem))
Multi-Level Array 需要两次Memory Reference,第一次拿数组指针,第二次拿元素:
M A [ i n d e x ] [ d i g i t ] = ∗ ( ∗ ( M A + i n d e x ⋅ s i z e o f ( p o i n t e r ) ) + d i g i t ⋅ s i z e o f ( e l e m ) ) MA[index][digit]=*(*(MA+index \cdot sizeof(pointer)) + digit \cdot sizeof(elem)) MA[index][digit]=∗(∗(MA+index⋅sizeof(pointer))+digit⋅sizeof(elem))
struct A
{
int a[4];
int i;
struct A *next;
};
void set_val(struct A* pA, int val)
{
while(pA)
{
pA->a[pA->i]=val;
pA=pA->next;
}
}
set_val: # rcx := pA, rax := i, edx := val
L7:
testq %rcx, %rcx
je L5
movslq 16(%rcx), %rax # 4 byte value and do sign extension
movl %edx, (%rcx,%rax,4)
movq 24(%rcx), %rcx # 注意这里next相对A的起始地址偏移24
jmp L7
L5:
ret
注意这里 next 相对 A的起始地址偏移24,是因为数据对齐,i 之后留4空字节(padding bytes),对齐8字节。现代计算机内存通常一次取64个字节,如果存储对象因为地址没有对齐,横跨两个64字节块,将导致系统花费很多额外的步骤来"拼数据"。x86系统下没有对齐只会导致运行速度变慢,其他系统可能直接就内存错误。
与其声明__attribute__((packed))强制编译器不对齐,不如定义结构体"大"Field在前,"小"Field在后,来减少浪费的 Padding Bytes。
对齐只针对原始数据类型(char、short、int…),汇编层面不存在“聚合类数据”(数组、结构体…)。
Scalar Operation
addss = add for scalar single precision
SIMD(single instruction multiple data)Operation
addps = add for pack single precision
整型使用regular registers,浮点型使用XXM registers,当然也可以都使用XXM提高运算速度就是有点浪费 。传参时整型与浮点型交错按规矩依次入座
double double_test(float *pd, float Val)
{
float x=*(pd);
if(Val>x)
*pd=x+Val;
return x;
}
double_test:
movss (%rcx), %xmm0
comiss %xmm0, %xmm1
jbe .L6
addss %xmm0, %xmm1
movss %xmm1, (%rcx)
.L6:
cvtss2sd %xmm0, %xmm0
ret
miscellaneous topics
目前64位系统只使用了47位地址,约 256 × 1 0 12 256 \times 10^{12} 256×1012字节约256 Terabytes。
Terabytes << Petabytes << Exabytes(Google累计信息总量) << Zettabyte(全人类信息总量)
HEX Address | Content | note |
---|---|---|
00007FFFFFFFFFFF | Stack | 0x7FFFFFFFFFFF -0x7FFFFF7FFFFF = 2 23 2^{23} 223 = 8M |
00007FFFFF7FFFFF | ||
Shared Libraries | Executable machine instructions,read only | |
Heap | Dynamically allocated as needed when malloc()、calloc()、new() Address moving up |
|
Data | Statically allocated data global vars、static vars、const string |
|
Text | Executable machine instructions,read only | |
400000 |
表格自2015年Slider,2020年Slider中,Shared Libraries 处于最高地址,高于Stack。
Cent OS 环境下可使用 ulimit -a 查看全部系统限制:
[root@VM-4-10-centos]# ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 14819
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 100001
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 14819
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
观察分配地址:
#include
using namespace std;
typedef int (*P1)(void);
typedef void (*P2)(void);
int global_arr[20]={0};
int global_var=0;
void stack_frame_obs()
{
int local_arr[20]={0};
int *pc=(int*)malloc(20);
int *pc_last=&pc[20];
cout<<"stack_local_arr:\t"<<&local_arr<<"\n"
<<"stack_local_arr_last:\t"<<&local_arr[20]<<"\n"
<<"stack_pc:\t"<<pc<<"\n"
<<"stack_pc_last:\t"<<pc_last<<endl;
return;
}
void memory_obs(void)
{
int local_val=0;
int local_arr[20]={0};
//数组指针强转字符指针,计算最后元素地址
int *local_arr_last=&local_arr[20];
int *global_arr_last=&global_arr[20];
int *pc=(int*)malloc(20);
int *pc_last=&pc[20];
stack_frame_obs();
cout<<"local_val:\t"<<&local_val<<"\n"
<<"local_arr:\t"<<&local_arr<<"\n"
<<"local_arr_last:\t"<<local_arr_last<<"\n"
<<"pc:\t"<<pc<<"\n"
<<"pc_last:\t"<<pc_last<<"\n"
<<"global_var:\t"<<&global_var<<"\n"
<<"global_arr:\t"<<global_arr<<"\n"
<<"global_arr_last:\t"<<global_arr_last<<endl;
return;
}
int main()
{
memory_obs();
P1 pfunc1=main;
P2 pfunc2=memory_obs;
cout<<"Main:\t"<<(void *)pfunc1<<endl;
cout<<"Memory_obs:\t"<<(void *)pfunc2<<endl;
return 0;
}
Cent OS 结果
[root@VM-4-10-centos]# ./a.out
stack_local_arr: 0x7ffe44a783d0
stack_local_arr_last: 0x7ffe44a78420 // 栈地址始终高于堆地址
stack_pc: 0x8abf10 // 地址高于pc,堆按需分配,地址递增
stack_pc_last: 0x8abf60
local_val: 0x7ffe44a7849c
local_arr: 0x7ffe44a78440 // 地址高于stack_local_arr,栈帧地址递减
local_arr_last: 0x7ffe44a78490
pc: 0x8abeb0 // 堆内、栈帧内数组元素地址递增
pc_last: 0x8abf00 // < stack_pc= 0x8abf10
global_var: 0x6021d0
global_arr: 0x602180
global_arr_last: 0x6021d0
Main: 0x400bde
Memory_obs: 0x4009f1 //text 可执行指令始终处于最低地址
Win64 环境下堆地址居然高于栈地址?Whatever
>>PS C:Users> .\a.exe
stack_local_arr: 0x61fcb0
stack_local_arr_last: 0x61fd00
stack_pc: 0xec1680
stack_pc_last: 0xec16d0
local_val: 0x61fdac
local_arr: 0x61fd50
local_arr_last: 0x61fda0
pc: 0xec1620
pc_last: 0xec1670
global_var: 0x408090
global_arr: 0x408040
global_arr_last: 0x408090
Main: 0x40182f
Memory_obs: 0x401656
Exceeding the memory size allocated for an array,potentially that risk of being a vulnerability.
Most come from (culprit)
gets()编写于1970s,UNIX刚发行,那时人们还不怎么考虑安全问题。
// kind of implementation of Unix function gets()
char *gets(char *dest)
{
int c = getchar(); //EOF 应该是整型,char可能不够大
char *p = dest;
while(c! = EOF && c !='\n')
{
*p++ = c;
c = getchar();
}
*p='\0';
return dest;
}
Others like strcpy、strcat、scanf(%s)、sscanf、fscanf,they all have no idea what limit is on number of characters to read. Typically,return address should be overwrite first.
二进制层面的注入,与SQL数据库注入不同。
finger user@host
finger 命令使用 gets() 接收信息
finger “exploit-code padding new-return-address”,exploit-code = excuted root shell on victim machine with a direct TCP connection to the attacker.
CERT computer emergency response team 就此成立并安家CMU
AOL 聊天软件客户端存在注入漏洞,AOL注入测试PC是不是Microsoft平台,达到 Block MS 的目的,More than 10 skirmishes between MS and AOL
Worms and Viruses
Protection
假设 ret_orit 是一个库函数
int ret_orit(int a,int b)
{
return a+b;
}
0000000000401596 <ret_orit>:
401596: 8d 04 11 lea (%rcx,%rdx,1),%eax
401599: c3 retq
Gadget address = 0x401596,完成了 %eax = %rcx + %rdx 动作。
有趣的是,在X86架构中,ret指令以 0xc3 结尾,那就很容找到这些片段的位置了。
假设我们始终取 0xc3 的 前三个字节 凑指令:
void ret_orit(int *p)
{
*p=0x11048d22;
return;
}
0000000000401596 <ret_orit>:
401596: c7 01 22 8d 04 11 movl $0x11048d22,(%rcx)
40159c: c3 retq
Gadget address = 0x401599,三个字节0x8d、0x04、0x11同样完成了 %eax = %rcx + %rdx 。
“Just match the byte patterm of some existing code.”
Address | Content |
---|---|
stack | address of Gadget n code |
… | … |
%rsp | address of Gadget 1 code (used to be callee return address) |
通过缓冲区溢出,将callee return address 及 其后的所有地址,依次替换为 Gadget 的地址,则跳转执行 Gadget 命令后,Gadget 最后的 ret 指令又使得 %rip 从 %rsp - 8 取下一条 Gadget 的地址,再 ret,再跳转 … 直到完成攻击。 |
A way to ceate an alias that will let you refrence memory in different ways.
联合体并不改变实际位,只改变解读位的方式。
#include
#include
using namespace std;
typedef union{
int a;
float b;
}i_a_f;
int main()
{
i_a_f t;
t.a=1;
float b=t.a;
cout<<"\nunion(int):"<<t.a // union(int):1
<<"\nunion(float):"<<t.b// union(float):1.4013e-45
<<"\ncast:"<<b; // cast:1
return 0;
}
通过 Union 很容易了解到机器的 Byte Ordering
Big Endian 最大的在尾端(地址最低)
Little Endian 最小的在尾端(地址最低)x86、ARM、IOS
Bi Endian 大小端都行