四 Cache Lab

这个LAB 是上完CMU CSAPP的11-12 LECTURE之后，就可以做了。
csapp 课程观看地址：https://search.bilibili.com/all?keyword=csapp&from_source=banner_search
lab 4 下载地址: http://csapp.cs.cmu.edu/3e/cachelab-handout.tar

这次实验的任务很明确，就是制作自己的缓存系统，具体来说是

实现一个缓存模拟器，根据给定的 trace 文件来输出对应的操作
利用缓存机制加速矩阵转置的运算

这个LAB 需要写C的代码了。非常贴心的CMU 在进入LAB之前传授了一套最基本的C语言心法。
https://www.cs.cmu.edu/~213/activities/cbootcamp/cbootcamp_s19.pdf
里面几乎涵盖了所有写LAB要用到的C语言知识。里面还有,还附带了很多可以跑的C代码（都在PPT中）。

(安装VALGRIND 教程：https://blog.csdn.net/SoaringLee_fighting/article/details/77925402）

C 语言复习

注意代码风格，不要写糟糕的代码
小心隐式类型转换
小心未定义的行为
小心内存泄露
宏和指针计算很容易出错

例 1

int foo(unsigned int u) {
    return (u > -1) ? 1 : 0;
}

因为 u 是无符号整型，所以在比较的时候 -1 也会按照无符号整型来处理，于是实际的比较相当于 u > int_max，使得这个函数总是会返回 0。

例 2

int main() {
    int* a = malloc(100*sizeof(int));
    for (int i=0; i < 100; i++) {
        a[i] = i / a[i];
    }
    free(a);
    return 0;
}

这里 a 中的值都没有进行初始化，所以 main 函数的行为是未定义的。

例 3

int main() {
    char w[strln("C programming")];
    strcpy(w, "C programming");
    printf("%s\n", w);
    return 0;
}

strlen 返回的长度是不包括最后的 \0 的，写入的时候会越界。

例 4

struct ht_node {
    int key;
    int data;
};
typedef struct ht_node* node;
node makeNnode(int k, int e) {
    node curr = malloc(sizeof(node));
    curr->key = k;
    curr->data = e;
}

这里把 node 定义为一个指针，并不是指向一个结构体。

例 5

char *strcdup(int n, char c){
    char dup[n+1];
    int i;
    for (i = 0; i < n; i++)
        dup[i] = c;
    dup[i] = '\0';
    char *A = dup;
    return A;
}

strcdup 函数返回了一个分配在栈中的指针，函数返回之后地址 A 可能会被抹掉。

例 6

#define IS_GREATER(a, b) a > b
inline int isGreater(int a, int b) {
    return a > b ? 1 : 0;
}
int m1 = IS_GREATER(1, 0) + 1;
int m2 = isGreater(1, 0) + 1;

IS_GREATER 是一个没有带括号的宏，所以 m1 的值相当于 1 > 0+1 = 0

例 7

#define NEXT_BYTE(a) ((char *)(a + 1));
int a1 = 54; // &a1 = 0x100
long long a2 = 42; // &a2 = 0x200
void* b1 = NEXT_BYTE(&a1);
void* b2 = NEXT_BYTE(&a2);

这里 b1 指向 0x104
这里 b2 指向 0x208
会根据类型的不同，决定下一个 byte 的起始位置。

注意提交的代码列数不要超过 80，会人工检查的。

PART A 实现一个缓存模拟器

步骤大概如下：

0.读一遍C语言书写规范
http://www.cs.cmu.edu/~213/codeStyle.html

1.解析参数，把参数进行保存。

第一步的关键点，getopt 的使用；还有s,E,b,t 这几个参数一定要跟VALUE的，所以要加冒号。
构造一个globalArg的结构体。然后读取参数存进去，方便之后使用。

2.写按行读文件的代码

核心就是使用FILE 以及FGET的使用。PPT里有教，结合GOOGLE，应该不难。
随后就是读到一行后的解析问题，如果首字母是I，直接CONTINUE
其他根据规则去分别拆出OPT, ADDRESS, BLOCK
随后把ADREES 传进TRY HIT，也就是第四步的方法里。

3.根据参数，构造出CACHE的数据结构，因为B用不到，一个二维数组就够了。

缓存除了要记录是哪个SET，每个SET还有多个LINE。每个LINE里有TAG 和 VALID 2个字段。
在结合需要使用LRU的缓存替换策略。如果该SET存满了，我每次要找到TIMESTAMP最小的替换。为了方便，我把TIMESTAMP初始化为0，之后每个操作+1. 当TIMESTAMP = 0的时候就代表不VALID。

实现缓存HIT MISS 逻辑函数

根据内存地址，和S,B这2个参数进行进行切割。把内存地址分割为， TAG, SET, BLOCK
然后就是找到二维数组的SET里面去遍历每个LINE。暴力实现下LRU思想。

5.补充-V 功能
如果GLOABL ARGS 的VERBOSE 是TRUE的话，按照要求多打印一些内容。

综上思想，完整代码如下

写的时候遇到一个坎，记录如下

注意参数传的是s, 在开SET的时候要用（1<
二维数组不太好初始化，使用一维数组结合CALLOC，然后INDEX需要自己算
string address 转换成 2进制，使用

unsigned long addr;
sscanf(address, "%lx", &addr);

使用strtok去截取STRING，会在截断的时候补上终止符，造成原STRING 不能继续使用。（会被砍断）
FILE 传进去的STRING 因为来自用户传参，要做NULL CHECK。

/*
 * csim.c - implement a cache simulator
 */
#include "cachelab.h"
#include 
#include 
#include 
#include 
#include 
#include 

#define ADDR_SIZE 64
typedef struct {
    char *filePath;
    int setNum;
    int lineNum;
    int blockNum;
    bool verbose;
} GlobalArgs;
GlobalArgs globalArgs;
static const char *optString = "hvs:E:b:t:";

/* 
 * setArgs - Read the arguments from user input,
 *           then store them into globalArgs.
 */
void setArgs(int argc, char **argv)
{
    int opt = getopt( argc, argv, optString );
    while( opt != -1 ) {
        switch( opt ) {
            case 'v':
                globalArgs.verbose = true; /* true */
                break;
            case 's':
                globalArgs.setNum = atoi(optarg);
                break;
            case 'E':
                globalArgs.lineNum = atoi(optarg);
                break;
            case 'b':
                globalArgs.blockNum = atoi(optarg);
                break;
            case 't':
                globalArgs.filePath = optarg;
                break;
            default:
                printf("wrong argument\n");
                break;
        }
        
        opt = getopt( argc, argv, optString );
    }
};
typedef struct {
    unsigned long tag;
    long timestamp; // 0 means not valid
} CacheLine;


/* 
 * tryToHitCache - Give a visiting address and check cache miss or hit
 *                 update the hit/miss count into res array
 */
void tryToHitCache(unsigned long address, CacheLine* cache, int res[3]) 
{
    static int timestamp = 0;
    int tagLen = (ADDR_SIZE - globalArgs.blockNum - globalArgs.setNum);
    int set = (int) ((address << tagLen) >> (tagLen + globalArgs.blockNum));
    unsigned long tag = (address >> (ADDR_SIZE - tagLen));
    int min = set * globalArgs.lineNum;
    for (int i = 0; i < globalArgs.lineNum; i++) {
        int idx = set * globalArgs.lineNum + i;
        if (cache[idx].tag == tag && cache[idx].timestamp != 0) {
            cache[idx].timestamp = ++timestamp;
            res[0]++;
            if (globalArgs.verbose) {
        printf(" hit");
            }
        return;
        } else if (cache[idx].timestamp == 0) {
            cache[idx].timestamp = ++timestamp;
            cache[idx].tag = tag;
            res[1]++;
        if (globalArgs.verbose) {
        printf(" miss");
            }
            return;
        }
        if (cache[idx].timestamp < cache[min].timestamp) {
            min = idx;
    }
    }
    cache[min].timestamp = ++timestamp;
    cache[min].tag = tag;
    res[1]++;
    res[2]++;
    if (globalArgs.verbose) {
    printf(" miss eviction");
    }
    return;
}
int main(int argc, char **argv)
{
    setArgs(argc,argv);
    if (globalArgs.filePath == NULL) {
        printf("File path not found. Please use -t {filepath}\n");
        exit(0);
    }
    FILE *fp = fopen(globalArgs.filePath,"r");
    if (fp == NULL) {
        printf("File not found %s\n", globalArgs.filePath);
        exit(0);
    }

    int res[3];//0:hit, 1: miss, 2 : evict
    for (int i = 0; i < 3; i++) {
        res[i] = 0;
    }

    int S = (1 << globalArgs.setNum);
    int E = (1 << globalArgs.lineNum);
    CacheLine *cache = 
        calloc(S * E, sizeof(CacheLine));

    char opt;
    unsigned long address;
    int block;
    while (fscanf(fp," %c %lx,%d", &opt, &address, &block) > 0) {
        if (opt == 'I') continue;
        //int hit = res[0], miss = res[1], evict = res[2];
        if (globalArgs.verbose) {
            printf("%c %lx,%d",opt,address,block);
        }
        tryToHitCache(address,cache,res);
        if (opt == 'M') {
            tryToHitCache(address,cache,res);
        }
        if (globalArgs.verbose) {
        printf("\n");
    }
        
    }
    fclose(fp);
    free(cache);
    
    printSummary(res[0], res[1], res[2]);
    return 0;
}

验证无内存泄漏

image.png

验证代码长度不超80

[hadoop@hadoop000 cachelab-handout]$ wc -L csim.c
76 csim.c

PART B 利用局部性优化矩阵转置

我的心路历程，首先大概知道优化思路是根据http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf 里面说的内容，用BLOCKING的技巧。

针对32 *32 ，我写了第一个NAIVE的版本。
首先我们知道B = 5，也就是32字节，SET 也是32个。所以一共是1KB的缓存。
一个整数要4个字节，也就是一个SET LINE 里可以放8个整数。根据这个思想。我把BLOCK SIZE定为8.

有下面的图：

32 x 32

这里的数字表示对应的值会存在缓存的哪个 set 中，我们可以看到第九行和第一行会冲突，但是如果我们分成 8 x 8 的小块，就可以保证尽可能利用到缓存的特性（读取第 1 个的时候后面 7 个也已经载入缓存中）

NAVIE 代码如下

image.png

但是要拿满分数，需要MISS < 300才可以。

仔细分析，我是忽略了A一读，B一写。因为块大小正好是HUANCUN空间的整数倍。所以当读A 去写B的时候，B的缓存行会覆盖掉A的缓存行，造成第2个的读取的时候A 又MISS（因为现在SET 0的位置放的是B的缓存行，TAG不一致）。
那么我们具体来分析下343是怎么来的.
首先文档里给了一个很重要的提示

image.png

也就是对角线和非对角线的情况是不一样的。
我们可以想下。当处理非对角线的时候，如下图

image.png

是不会发生，A读去写B ，因为SET号相同，相互覆盖的问题。所以非对角线就是1/8的MISS率。
那么对角线是多少呢？
我们不难发现，每一行A的读都会有2个MISS。一开始构建B的列，8个B 会全MISS，构建好之后，每次填充B的列，会有2次MISS。
所以在一个8*8 的块中，我们需要64个A读，64个B写。同时因为有8行，A有16个MISS + 开始B有8个MISS，之后7列，每列2个MISS。

一共38个MISS。
对角线的块数占所有块数的1/4
所以有了如下

image.png

和结果343 十分接近。
那么怎么优化呢？问题既然来自于对角线矩阵。A B 的相互覆盖。那么如果我一上来就把8个A，存进寄存器里，之后赋值的时候把寄存器的值塞进B，就可以很好避免这个问题。（题目里允许12个变量原来有如此玄机）
4个循环变量，加上8个寄存器存储
改进为如下代码

image.png

在运行之前，我们可以先做下计算，看看结果和我们想的一样不？
A是每行1个MISS。
B开始还是8个MISS，之后每列就一个MISS了。
所以对角线操作的块里，128个就23个MISS
1/4 * 23 / 128 + 1/8 * 3/4 = 0.168 -> 0.13867 * 2096 = 290
结果是287，也十分接近

image.png

64 * 64

装备了上面的技能我们继续前行，我们看64 *64最适合的BLOCK SIZE是多少。
首先画出缓存SET 号图。

64 x 64

因为宽度的变化，现在第 5 行就会和第 1 行冲突，所以如果我们还用原来的 8 x 8，肯定是不行的。那么如果用 4 x 4 呢？
于是又写了如下的NAIVE代码并且进行了MISS率计算

image.png

发现超过要求的 1300 还是比较多的。验证一下

image.png

和算的差不多。

下面优化进1300，我是参考网上的解法，我自己是真的想不出来，这种技巧。
即使看了代码之后，我去算MISS率，也画掉我3张草稿纸，以及2个小时。
下面也请你做好长时间的努力，如果你真的想琢磨透，这几行代码是如何在缓存间跳舞的话。先上代码。但你不要抄，你去研究代码在怎么搬数据，然后自己实现一边。

image.png

可以看到代码里分为2个步骤第一个步骤大概如下图所示。

image.png

经过计算做完第一步也就是A 变到B的样子会有11个MISS
A 4个， B开始4个，之后每次还有1个。B一共7个。

下面祭出我的草稿

首先分析对角线。
根据上面代码的空行，可以分为5步。前2步是把绿B和红A 分别存入寄存器里。
后三步如下图所示，按照顺序是第一步红，第二步绿，第三步黄

image.png

再看一个循环里的MISS数（图中表明步骤，该步骤做完缓存里是什么，这个步骤几个MISS）
STEP 2-5 是有2个MISS，图里写错。

image.png

所以对角线处理一个8 * 8 一次STEP1 需要8个MISS。Step2 第一个循环时（4+1+0+1+2）8个MISS
因为当时缓存里是B，1234行
之后3个循环一共（0+1+0+1+2）* 3 = 12个MISS。
因为缓存里是A，5678行，所以下一个循环做STEP2-1可以略去4个MISS
28个MISS

再看非对角线。

image.png

第一个循环需要5个MISS，之后3个循环只会在STEP2-5里多一个MISS出来。
所以是STEP 1 8个MISS， STEP 2 8个MISS
16个MISS

验证一下，是差了16

image.png

那么最后的MISS率在28 * 8（有8个对角线的块） + 16 * 56（56个非对角线的块） = 1120

看下结果也差不多

image.png

PART C 61*67

这个因为不是32的倍数，A和B的地址空间应该不是正好会重叠的。所以我们只要枚举BLOCK SIZE 找到一个最小的就可以。
我枚举了BLOCK SIZE 从2 * 2 到 32 * 32
结果如下

Function 0 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 0 (Transpose submission): hits:5064, misses:3115, evictions:3083

Function 1 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 1 (Transpose submission): hits:5531, misses:2648, evictions:2616

Function 2 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 2 (Transpose submission): hits:5754, misses:2425, evictions:2393

Function 3 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 3 (Transpose submission): hits:5883, misses:2296, evictions:2264

Function 4 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 4 (Transpose submission): hits:5955, misses:2224, evictions:2192

Function 5 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 5 (Transpose submission): hits:6027, misses:2152, evictions:2120

Function 6 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 6 (Transpose submission): hits:6061, misses:2118, evictions:2086

Function 7 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 7 (Transpose submission): hits:6087, misses:2092, evictions:2060

Function 8 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 8 (Transpose submission): hits:6103, misses:2076, evictions:2044

Function 9 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 9 (Transpose submission): hits:6090, misses:2089, evictions:2057

Function 10 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 10 (Transpose submission): hits:6122, misses:2057, evictions:2025

Function 11 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 11 (Transpose submission): hits:6131, misses:2048, evictions:2016

Function 12 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 12 (Transpose submission): hits:6183, misses:1996, evictions:1964

Function 13 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 13 (Transpose submission): hits:6158, misses:2021, evictions:1989

Function 14 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 14 (Transpose submission): hits:6187, misses:1992, evictions:1960

Function 15 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 15 (Transpose submission): hits:6229, misses:1950, evictions:1918

Function 16 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 16 (Transpose submission): hits:6218, misses:1961, evictions:1929

Function 17 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 17 (Transpose submission): hits:6200, misses:1979, evictions:1947

Function 18 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 18 (Transpose submission): hits:6177, misses:2002, evictions:1970

Function 19 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 19 (Transpose submission): hits:6222, misses:1957, evictions:1925

Function 20 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 20 (Transpose submission): hits:6220, misses:1959, evictions:1927

Function 21 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 21 (Transpose submission): hits:6251, misses:1928, evictions:1896

Function 22 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 22 (Transpose submission): hits:6164, misses:2015, evictions:1983

Function 23 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 23 (Transpose submission): hits:6072, misses:2107, evictions:2075

Function 24 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 24 (Transpose submission): hits:5977, misses:2202, evictions:2170

Function 25 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 25 (Transpose submission): hits:5881, misses:2298, evictions:2266

Function 26 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 26 (Transpose submission): hits:5779, misses:2400, evictions:2368

Function 27 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 27 (Transpose submission): hits:5684, misses:2495, evictions:2463

Function 28 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 28 (Transpose submission): hits:5584, misses:2595, evictions:2563

Function 29 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 29 (Transpose submission): hits:5588, misses:2591, evictions:2559

Function 30 (31 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 30 (Transpose submission): hits:5589, misses:2590, evictions:2558

最后选了23位BLOCK SIZE得到1912.

image.png

四 Cache Lab

C 语言复习

PART A 实现一个缓存模拟器

综上思想，完整代码如下

验证无内存泄漏

验证代码长度不超80

PART B 利用局部性优化矩阵转置

64 * 64

下面祭出我的草稿

PART C 61*67

你可能感兴趣的:(四 Cache Lab)