原文地址: Multi-Threading浅谈
INTRODUCTION
使用Multiple Theads的框架编程,熟悉Benchmarking性能测试工具,了解搜索引擎Search Engine的原理。
OBJECTIVES
- Work with multiple threads
- Become familiar with benchmarking
- Become more familiar with the structure of a search engine
SUMMARY
This week we're working with threads. One step of creating a search engine is to index all the items we might want to search. This gets around the problem of having to search through everything each time we query.
Here's an example: I want to get a list of movies that are a particular genre, but if I don't have an index, the only thing I can do is load up a list of movies and then search through them (possibly sorting via genre first, then binary search). But, if I've indexed the list of movies by genre, then I already have that list. Then, searching through that list of comedies becomes much more efficient.
STEP 1: RUN THE BENCHMARKER
At this point, you've implemented the indexing system. We want to analyze our system to determine how many resources it's using. There are a few things to measure:
- How long does it take to index? (time)
- How big is the resulting index? (space)
- How fast/slow is it to make a query and get results? (time)
We basically have two ways to index: with either the TypeIndex (where we use SetOfMovies, indexing the movies by a field and storing them in a LinkedList of Movie structs), or the OffsetIndex (where we use MovieSet, and store just the DocIds and RowIds of where the movies are stored on disk).
Start by clarifying your hypotheses:
- Which is bigger, the TypeIndex or the OffsetIndex? Why?
- Which will be faster to query, the OffsetIndexer or the TypeIndex? For what kinds of queries? Why?
- Which will be faster to build, the TypeIndex or the OffsetIndex? Why?
There are clearly queries that will be answered faster with one index than the other:
- Which movies have 'Seattle' in the title?
- How many Action movies are there in the database?
To make the query comparison fair, let's define some sample queries:
- What movies have Seattle in the title and are Rom/Coms?
- ??
- Are there any other queries you're curious about?
Use the benchmarker to do some of these tests. Feel free to modify the code to measure different things to help explore your hypotheses.
An example invocation:
adrienne@virtualbox$ ./benchmarker data_small/
[snip]
Created DocIdMap
Cur Real Mem: 700 Peak Real Mem: 700 Cur VirtMem: 4376 PeakVirtMem: 4376
Building the OffsetIndex
Parsing and indexing files...
22197 entries in the index.
Took XXXX seconds to execute.
Memory usage:
Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX
Destroyed OffsetIndex
Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX
Building the TypeIndex
10 entries in the index.
Took XXXX seconds to execute.
Memory usage:
Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX
Destroyed TypeIndex
Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX
Destroyed DocIdMap
Cur Real Mem: XXXX Peak Real Mem: XXXX Cur VirtMem: XXXX PeakVirtMem: XXXX
FINISHING STEP 1
When you're done with this step:
- Do a short write-up of your results above. Put it in report.md.
STEP 2: BUILD A MULTITHREADED INDEXER
Now that we have an indexer, we're going to multi-thread the FileParser.
Changes to make:
- In FileParser.c, fill in the methods ParseTheFiles_MT and IndexTheFile_MT.
- When ParseTheFiles_MT is called, it should create 5 new threads to run the function IndexTheFile_MT.
- In IndexTheFile_MT, the threads will be sharing access to the iterator providing filenames, and access to the index. You'll have to prevent the threads from colliding access by using a mutex (in the readings: Mutexes).
FINISHING STEP 2
When you're done with this step:
- Commit your code! And push it to github, if you haven't done this yet.
-
Run your code on the provided "data" directory. The data directory was distributed as part of a7; either move that directory to the a9 directory, or point your program at a7/data.
- This will take a while.
- Rerun the benchmarker, calling the multi-threaded parser. Compare the results to the results from Part 1.
- Write this up in report.md.
EXPERIMENT
Now that your program works, let's do some experimentation. We spoke in class about how a computer with a single processor won't have much benefit when using multiple threads. We can measure the difference though.
Further, since we're running a virtual machine, we can tell it how many cores to use.
If your machine has multiple cores, it will show on this screen. Crank up the cores, see how performance improves! (that is, run benchmarker again).
SUBMISSION
Submit your assignment by pushing your code to Github. Tag it with the tag a9-final.
Tag your last commit, and submit the tag on Blackboard by the due date, 9pm (Pacific Time).
Things to turn in:
- All files in the a9 repo, with ParseTheFiles_MT and IndexTheFile_MT.
- report.md, where you write up the points noted above
- Include a short section on how you might design the system differently to support some queries you might want to make.
Things to remember
- Make sure your code compiles and runs; even if it crashes or isn't complete. Comment what you need to in order to make sure it compiles.
- Unable to compile = losing 50% of your grade.
- Commit early and often.
If you mistakenly tag your repo as a9-final and find yourself having to create a new final tag, delete the - tag first. We don't want to figure out which of a9-final, a9-final-a, a9-final2, ... is the one we should be grading.
- If there's no tag, automatic 0.
- Valgrind should report that there are no memory leaks (all heap has been freed) as well as NO ERRORS.
- If you have issues (you got sick, changed jobs, moved, ...) that impact your ability to finish on time, TELL US BEFORE THE DEADLINE. You will lose at least 50% of your points if you don't talk to us first.
The related background
HOW DOES THIS ALL FIT TOGETHER?
In the bigger scheme of things, we're putting pieces together to build a search engine. A search engine has a few components: a file crawler, a file processor, an indexer, and a query processor. The file crawler starts in one place, and traverses to find other files. On the web, this means starting at one page and following all the links on the page; in a file system, it means starting in one directory and traversing through the directory structure until all files have been foun. The file processor takes each of those files and finds significant words or concepts in the file. The indexer creates a data structure that (essentially) populates a data structure that makes it easy to find documents that have similar topics, or find documents that contain a certain word. The query processor lets us work with that data structure to get the results based on what we ask for.
In this homework, I'm providing you with a working movie indexer. We have a bunch of movie data, and you're going to build the pieces together to index the titles of these movies. Once the movies have been indexed, when we query the index, we will list out all of the movie titles that contain that word.
The trick is that we have about 5 million movies, videos and TV shows that we want to index. We can't just load them all into memory and search through them! We have to index them.
What does it mean to index them? Well, we start with a bunch of files (about 1000), and each file contains about 10,000 records. We open each file, get the words in the title of the movie from a row, and save which file and which row that record is in. Then, we want to find the movie record that has a title with a given word, we look up that word in our index, which has filename and row number of a movie with that word, and we open that file, go to that row, and print it out.
So what are you building in this assignment? First, you're going to write the part that does the file processing. This should be straightforward; you've done this a bunch already. But, we have so much data to deal with, we're going to throw a few threads at it at a time. Because your virtualbox is set up to have a single core, you won't see much improvement at first. But, you likely have multiple cores available to it. In the very last step, you'll attempt to run your code with multiple cores, which you will hopefully see an improvement.Finally, if you haven't already, you will be required to use the INTERFACES provided for this assignment. In fact, if you use the provided interfaces, it should be very straightforward, and much more complicated if you try to break the abstractions.
WHAT DOES OUR SYSTEM LOOK LIKE?
This week, you'll primarily be working with the FileParser (in orange), as well as the MainSystem.
We have a unique DocSet for every unique word in every movie title. The DocSet maintains a list of DocId's (which document contains the movie with this word) and a list of rows in that document (which row the specific movie is in).
When we ask the index for a set SearchResults for a given word, a list of docId-rowId pairs is returned. Then, we use that information to go look up the specific row in the file and return that row to the customer.
This is the overall flow:
- The FileCrawler starts at a given directory and puts all the files in the DocIdMap, which assigns a unique ID to that file.
- The FileParser gets all the files from the DocIdMap, reads in each row (to an application-specific data structure call a movie) and gives that to the MovieIndex.
- The MovieIndex chooses how to index the Movie datastructure, storing the information using the DocSet.
- When all the files have been indexed, the QueryProcessor gets a request from a customer (e.g. "Give me all movies with SISTER in the title"), asks the MovieIndex for the relevant records, and returns them to the customer.
(本文出自csprojectedu.com,转载请注明出处)