MIT6.824 Lab1 MapReduce

lab1是在单机上实现mapreduce库,因为没有分布式环境,所以只能实现序列化操作和用并行操作代替分布式操作。

首先看一下流程,主函数在src/main/wc.go里,自己提供的map和reduce函数,这次做的主要是wordcount,所以map和reduce函数为:

func mapF(filename string, contents string) []mapreduce.KeyValue {
    // Your code here (Part II).
    f := func(c rune) bool{
        return !unicode.IsLetter(c)
    }
    s := strings.FieldsFunc(contents, f)
    kv := make([]mapreduce.KeyValue, 0)
    for _, k := range s{
        kv = append(kv, mapreduce.KeyValue{k,"1"})
    }
    return kv
}

func reduceF(key string, values []string) string {
    // Your code here (Part II).
    count := 0
    for _, v := range values {
        vv, _ := strconv.Atoi(v)
        count = count + vv
    }
    return strconv.Itoa(count)
}

然后有两种测试的方法,一种是Sequential,另外种是Distributed,首先去实现Sequential的方法吧。这样可以测试你一些功能函数实现对不对。


Sequential

func Sequential(jobName string, files []string, nreduce int,
    mapF func(string, string) []KeyValue,
    reduceF func(string, []string) string,
) (mr *Master) {
    mr = newMaster("master")
    go mr.run(jobName, files, nreduce, func(phase jobPhase) {
        switch phase {
        case mapPhase:
            for i, f := range mr.files {
                doMap(mr.jobName, i, f, mr.nReduce, mapF)
            }
        case reducePhase:
            for i := 0; i < mr.nReduce; i++ {
                doReduce(mr.jobName, i, mergeName(mr.jobName, i), len(mr.files), reduceF)
            }
        }
    }, func() {
        mr.stats = []int{len(files) + nreduce}
    })
    return
}

main函数调用Sequential实现,在src/mapreduce/master.go:60,传的参数值有jobName,files是输入文件,nreduce是reduce的输入文件个数,map和reduce的函数。

首先创建一个master,然后开一个线程运行run函数,run的第三个参数是一个函数,定义为当phase是mapPhase时调用doMap,是reducePhase时调用doReduce,第四个参数是finish函数。

func (mr *Master) run(jobName string, files []string, nreduce int,
    schedule func(phase jobPhase),
    finish func(),
) {
    mr.jobName = jobName
    mr.files = files
    mr.nReduce = nreduce

    fmt.Printf("%s: Starting Map/Reduce task %s\n", mr.address, mr.jobName)

    schedule(mapPhase)
    schedule(reducePhase)
    finish()
    mr.merge()

    fmt.Printf("%s: Map/Reduce task completed\n", mr.address)

    mr.doneChannel <- true
}

可以看到run函数定义了先执行mapPhase,再执行reducePhase,schedule函数为Sequential里定义的串行执行。最后merge reduce task的输出文件。mr.doneChannel中输入true值(main函数里调用的mr.Wait()需要mr.doneChannel中有值,不然会阻塞等待)

然后我们去看一下doMap和doReduce的实现。

func doMap(
    jobName string, // the name of the MapReduce job
    mapTask int, // which map task this is
    inFile string,
    nReduce int, // the number of reduce task that will be run ("R" in the paper)
    mapF func(filename string, contents string) []KeyValue,
) {
    data, _ := ioutil.ReadFile(inFile)
    mapkv := mapF(inFile, string(data))
    f := make([]*os.File, nReduce)
    for i := 0; i < nReduce; i++{
        filename := reduceName(jobName, mapTask, i)
        f[i], _ = os.OpenFile(filename, os.O_RDONLY|os.O_WRONLY|os.O_CREATE, 0666)
        defer f[i].Close()
    }
    for _, kv := range mapkv{
        r := ihash(kv.Key) % nReduce
        enc := json.NewEncoder(f[r])
        enc.Encode(&kv)
    }
}

doMap的输入是一个文件,输出应该是nReduce个中间文件,先执行mapF,得到一个kv结构的数组。之后主要是需要调用读写文件的api,创建nReduce个中间文件,并且打开它们的描述符,用defer最后关闭。然后把kv数组的每个key做hash之后用json编码后存入对应文件描述符。

type ByKey []KeyValue
func(a ByKey) Len() int {return len(a)}
func(a ByKey) Swap(i, j int) {a[i], a[j] = a[j], a[i]}
func(a ByKey) Less(i, j int) bool {return a[i].Key < a[j].Key}
func doReduce(
    jobName string, // the name of the whole MapReduce job
    reduceTask int, // which reduce task this is
    outFile string, // write the output here
    nMap int, // the number of map tasks that were run ("M" in the paper)
    reduceF func(key string, values []string) string,
) {
    kvslice := make([]KeyValue, 0)
    for i := 0; i < nMap; i++{
        filename := reduceName(jobName, i, reduceTask)
        f, _ := os.OpenFile(filename, os.O_RDONLY, 0666)
        defer f.Close()
        dec := json.NewDecoder(f)
        var kv KeyValue
        for{ 
            err := dec.Decode(&kv)
            if err != nil {
                break
            }else{
                kvslice = append(kvslice, KeyValue(kv))
            }
        }
    }
    sort.Sort(ByKey(kvslice))
    lenkv := len(kvslice)
    value := make([]string, 0)
    ff, _ := os.OpenFile(outFile, os.O_CREATE|os.O_RDONLY|os.O_WRONLY,0666)
    defer ff.Close()
    enc := json.NewEncoder(ff)
    for i := 0; i < lenkv; i++{
        if i != 0 && kvslice[i].Key != kvslice[i-1].Key{
            s := reduceF(kvslice[i-1].Key, value)
            enc.Encode(&KeyValue{kvslice[i-1].Key, s})
            value = make([]string, 0)
        }
        value = append(value, kvslice[i].Value)
    }
    if len(value) != 0{
        s := reduceF(kvslice[lenkv-1].Key, value)
        enc.Encode(KeyValue{kvslice[lenkv-1].Key, s})
    }
}

doReduce的输入应该是对应每个reduce任务,分别从nMap个map任务中读取对应的中间文件来执行reduceF。所以对应每个reduce task,首先需要得到nMap个中间文件的name,然后读取出来,用json格式解码,把每个kv结构解码出来存入kvslice中,然后调用结构体排序,结构体排序需要定义三个规则,如上Len,Swap,Less,即可,然后将排好序的kvslice,对于每一个key值的list,输入到reduceF中计算得到一个值,然后写入最终输出文件。


Distributed

Sequential执行非常简单,下面就是实现Distributed了,分布式的略为复杂一些,主要是实现schedule这个函数,让我们来看一下流程吧。

func Distributed(jobName string, files []string, nreduce int, master string) (mr *Master) {
    mr = newMaster(master)
    mr.startRPCServer()
    go mr.run(jobName, files, nreduce,
        func(phase jobPhase) {
            ch := make(chan string)
            go mr.forwardRegistrations(ch)
            schedule(mr.jobName, mr.files, mr.nReduce, phase, ch)
        },
        func() {
            mr.stats = mr.killWorkers()
            mr.stopRPCServer()
        })
    return
}

首先调用startRPCServer()开启rpc服务器

func (mr *Master) startRPCServer() {
    rpcs := rpc.NewServer()
    rpcs.Register(mr)
    os.Remove(mr.address) // only needed for "unix"
    l, e := net.Listen("unix", mr.address)
    if e != nil {
        log.Fatal("RegstrationServer", mr.address, " error: ", e)
    }
    mr.l = l

    // now that we are listening on the master address, can fork off
    // accepting connections to another thread.
    go func() {
    loop:
        for {
            select {
            case <-mr.shutdown:
                break loop
            default:
            }
            conn, err := mr.l.Accept()
            if err == nil {
                go func() {
                    rpcs.ServeConn(conn)
                    conn.Close()
                }()
            } else {
                debug("RegistrationServer: accept error", err)
                break
            }
        }
        debug("RegistrationServer: done\n")
    }()
}

注册一个rpc服务器,然后监听一个地址,因为这个是在单机上实现的并行,所以类型是unix,地址是本地地址。然后开一个线程不停的循环监听,收到请求就ServeConn(conn)建立连接。这样本地就有个线程一直在监听请求。

接着Distributed也开一个线程运行run函数,这次的第三个参数schedule创建了一个channel,然后开启了一个线程去执forwardRegistrations

func (mr *Master) forwardRegistrations(ch chan string) {
    i := 0
    for {
        mr.Lock()
        if len(mr.workers) > i {
            // there's a worker that we haven't told schedule() about.
            w := mr.workers[i]
            go func() { ch <- w }() // send without holding the lock.
            i = i + 1
        } else {
            // wait for Register() to add an entry to workers[]
            // in response to an RPC from a new worker.
            mr.newCond.Wait()
        }
        mr.Unlock()
    }
}

这个的主要作用是当有worker新加入列表,就把它加到刚才创建的channel里,可以看做是worker队列。
注意这行go func() { ch <- w }(),因为ch是无缓冲channel,里面只能放进去就取出来,不能放两个值,不然会死锁,所以要额外开一个线程,以免父线程被阻塞。

schedule分别在mapPhase和reducePhase的时候调用我们自己实现的schedule函数。

func schedule(jobName string, mapFiles []string, nReduce int, phase jobPhase, registerChan chan string) {
    var ntasks int
    var n_other int // number of inputs (for reduce) or outputs (for map)
    switch phase {
    case mapPhase:
        ntasks = len(mapFiles)
        n_other = nReduce
    case reducePhase:
        ntasks = nReduce
        n_other = len(mapFiles)
    }

    fmt.Printf("Schedule: %v %v tasks (%d I/Os)\n", ntasks, phase, n_other)

    // All ntasks tasks have to be scheduled on workers. Once all tasks
    // have completed successfully, schedule() should return.
    //
    // Your code here (Part III, Part IV).
    //
    // var done sync.WaitGroup
    // for count := 0; count < ntasks; count++{
    //  var w, filename string
    //  w = <- registerChan
    //  if phase == mapPhase{
    //      filename = mapFiles[count]
    //  }else{
    //      filename = ""
    //  }
    //  done.Add(1)
    //  go func(count int){
    //      defer done.Done()
    //      call(w, "Worker.DoTask", &DoTaskArgs{jobName, filename, phase, count, n_other}, new(struct{}))
    //      println("Start ...")
    //      registerChan <- w
    //      println("Stop ...")
    //  }(count)
    // }    
    // done.Wait()
    
    var done sync.WaitGroup
    done.Add(ntasks)
    ch_task := make(chan int, ntasks)
    for i := 0; i < ntasks; i++{
        ch_task <- i
    }
    go func(){
        for {
            w := <- registerChan
            go func(w string){
                for{
                    task_id := <- ch_task
                    var filename string
                    if phase == mapPhase{
                        filename = mapFiles[task_id]
                    }else{
                        filename = ""
                    }
                    ok := call(w, "Worker.DoTask", &DoTaskArgs{jobName, filename, phase, task_id, n_other}, new(struct{}))
                    if ok == false{
                        ch_task <- task_id
                    }else{
                        done.Done()
                    }
                }
            }(w)
        }
    }()
    done.Wait()
    fmt.Printf("Schedule: %v done\n", phase)
}

有两种实现方法,下面的比较好,注释掉的有错误。
下面那种实现方法用WaitGroup,首先把WaitGroup的值设为ntasks,然后创建一个有缓冲的任务channel,把所有任务往channel里塞进去,然后开一个线程,无限循环,当worker channel里有空闲的worker,就开一个线程给它,让它无限循环,只要任务channel里有没执行的任务,就取出来执行,master通过rpc调用worker w的DoTask函数,因为这里是在本地运行的,所以前面mr已经开启过rpc服务器了,所以worker不需要开启rpc服务器了。如果有的任务失败了,就把任务继续塞进任务channel里,实现解决worker failure

上面说注释掉的写法不好是因为,循环任务,然后把可用的worker给取出来进行操作,registerChan <- w如果有多个w放进去,就会阻塞,导致Wait()无法完成,一直阻塞。虽然可以go func(){registerChan <- w},但是总觉得不太好,MR算法应该是每个worker开个线程一直工作,有任务来了就做,比较符合分布式的MR算法。

然后和Sequential一样执行完map和reduce任务之后,调用finish,分布式的finish需要先killWorkers

func (mr *Master) killWorkers() []int {
    mr.Lock()
    defer mr.Unlock()
    ntasks := make([]int, 0, len(mr.workers))
    for _, w := range mr.workers {
        debug("Master: shutdown worker %s\n", w)
        var reply ShutdownReply
        ok := call(w, "Worker.Shutdown", new(struct{}), &reply)
        if ok == false {
            fmt.Printf("Master: RPC %s shutdown error\n", w)
        } else {
            ntasks = append(ntasks, reply.Ntasks)
        }
    }
    return ntasks
}

用rpc调用worker.shutdown,然后mr关闭RPC服务器。


第一次用go,可能写的比较挫以及专业词汇使用错误,仍需努力。

你可能感兴趣的:(MIT6.824 Lab1 MapReduce)