COURSEWORK 1 FOR INF2B (ADS THREAD, 2018-19)

代写INF2B留学生作业、代做Java编程语言作业、Edinburgh作业代做、代写Java课程设计作业

COURSEWORK 1 FOR INF2B (ADS THREAD, 2018-19)

STRING MATCHING

Submission Deadlines: The coursework consists of two parts (of a different nature) relating to

one problem. As shown below these have separate deadlines, you cannot swap the parts round

or submit either part later than the stated deadline (unless you have permission to do so from the

yeare organiser).

Part 1 consists of Exercise 1 on page 4, §2.1 and Exercise 2 on page 9, §2.2; these involve

fairly straightforward analysis.

Submit Part 1 by: 4.00PM, 25 February 2019.

Part 2 consists of the tasks in §3; these are all software related.

Submit Part 2 by: 4.00PM, 08 March 2019.

See §4 at the end of this document for instructions on how to submit.

§1. Introduction

In this practical we will consider two methods of searching for all occurrences of a given pattern

within some text (both given as strings). For example we might have a long piece of English text

about Scotland and we wish to search for all occurrences of “Edinburgh” within it. Our aim is to

return the offset position of all occurrences, if there are no occurrences then we indicate this by

a special convention discussed below.

The practical has several aims:

1. Practice at asymptotic analysis (with guidance).

2. Careful implementation of one algorithm (an implementation of the other algorithm is

supplied).

3. Carry out timing experiments in order to:

(a) Compare the algorithm that you implement with one that is supplied and decide the

point from which one is more efficient than the other (i.e., the overheads are overweighed

by the advantages).

(b) Determine the constant for the asymptotic analysis as it applies to your particular

implementation.

You can concentrate on Part 1 (up to and including §2) first and then on the rest. However it

would be a good idea to read the entire document through initially (without going into great

detail) so that you know what is expected for the complete coursework. Note that although the

document is fairly long the tasks you are required to carry out are quite modest. The emphasis

here is on understanding the problem well and so plenty of discussion has been included.

1

The practical requires you to submit various things. Each of these has some appropriate

marks assigned. It would however be a serious error to assume that by carrying out only these

tasks you have obtained the most (or indeed the main) benefit from the practical. Throughout

the text you are prompted to think about various matters that are not part of the submission. You

should consider these carefully along with possibly other issues as they occur to you. In other

words avoid the pitfall of just going through the motions in a mechanical fashion; in any case this

is not possible for most parts. The practical is issued to help you develop your understanding of

the material, the marks are a by-product (of course an important one) and far from being its only

rationale.

Notes

Good Scholarly Practice: Please remember the University requirement as regards all assessed

work for credit. Details and advice about this can be found at:

http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct

and links from there. Note that, in particular, you are required to take reasonable measures to

protect your assessed work from unauthorised access. For example, if you put any such work

on a public repository then you must set access permissions appropriately (generally permitting

access only to yourself, or your group in the case of group practicals).

Following instructions: The various parts of this practical place certain requirements with a

penalty if these are not followed. No negotiation of any kind will be entered into where the

requirements are not followed. The point of the requirements is to ensure that in doing the

various parts you practice and demonstrate understanding of the appropriate areas of the course.

What you submit must be considered work; imagine that your work was being given to you and

others as a sample answer. If you, or others, would find it unclear and unhelpful then your work

needs further revision.

Regrettably the various requirements and penalties can appear somewhat censorious and limiting.

This is not the intention, indeed much of what is stated can be seen in the much more

positive light of reassurance that the various tasks have quite simple answers. Please be assured

that plenty of variation is possible even when following the requirements, i.e., there isn’t necessarily

a unique way to do and present things clearly, concisely and correctly.

Obtaining Help: If, after careful study and deep thought, you need help then use the forum for

questions that you will find on the link at:

http://www.inf.ed.ac.uk/teaching/courses/inf2b/coursework/cwk1.html.

Note that this forum is for questions on this practical only. Any questions must be posted here

and not sent by email. The TA for this thread will monitor the forum at various times. You should

allow at least 2 days for a response.

2

§2. String matching

Our discussion is based on the chapter on string matching from the book Introduction to Algorithms

by Cormen, Leiserson, Rivest and Stein. We are interested in strings over some fixed

alphabet (which we need not worry about at this stage, e.g., it could be ASCII characters). We

are given some text which is simply a string over the alphabet. We will represent the text by

means of an array T which we will assume is indexed starting with 1 (this keeps some of the operations

more straightforward1). Our convention will be to think of the text as the array T[1 ..n],

i.e., the array has n entries and T[i] gives us the ith character of the text. We are also given a

pattern which is simply another string over our alphabet. We will always represent the pattern

by an array P[1 ..m]. If m>n then the problem is trivial as there can be no occurrences of the

pattern within the text. We will therefore assume throughout that m ? n.

Before going any further let us look at a simple example. Suppose the text is bacababa and the

pattern is bab, so that n = 8 and m = 3. We can think of their representations diagrammatically

as follows:

text T b a c a b a b a

pattern P b a b

Clearly the pattern occurs in the the text starting at position 5 but nowhere else.

We will adopt the following convention. We say that the pattern P occurs in the text with

shift (or offset) s if and only if

P[1 ..m] = T[s + 1 ..s + m]. (1)

The equation is to be understood as stating that P[i] = T[s + i], for 1 ? i ? m. One advantage

of this definition is that it can be expressed as P[t..m] = T[s + t..s + m] where t is the base

address of our arrays (so in this handout we have t = 1 and in Java t = 0). The point is that the

offset is independent of t.

Our problem is to find all shifts s such that equation (1) holds. We will return these shifts as a

list in increasing order. Thus the pattern occurs in the text if and only if the list is not empty. So

for our example we would return a list of one cell containing the number 4, i.e., the shift (note

that the shift is one less than the position at which a pattern occurs in the text since we number

array locations from 1 onwards [what happens if we number them from 0?]). We can think of our

list as a queue, though we only ever add items to it (of course subsequent use of the list would

remove items to find out where the pattern occurs though we will not do this).

§2.1 Naive algorithm

There is an obvious algorithm for our problem. We can think of it in terms of placing the pattern

under the text in successive positions (starting with 1) and testing to see if the pattern matches

the text immediately above it.

1Although you are probably used to starting arrays at 0 it is an important part of your training to be able to follow

and adapt to other conventions that are in use. Demonstrating this ability is part of this assignment

3

Naive-Matcher(T,P)

1. n T.length

2. m P.length

3. create a new empty queue S

4. for s 0 to n

m do

5a. if P[1 ..m] = T[s + 1 ..s + m] then enqueue(s, S)

Of course the test P[1 ..m] = T[s + 1 ..s + m] has to be implemented as a loop, this has not

been shown above for the sake of clarity. Line 5a should be seen as shorthand for the following

code:

5. matches TRUE

6. for i 1 to m do

7. if P[i] 6= T[s + i] then

8. matches FALSE

9. exit for loop

10. if matches then enqueue(s, S)

The lines have been numbered so that this can be slotted into the preceding pseudocode and in

your analysis assume that this has been done so you can refer to the line numbers as given

We have also shown S as a parameter to enqueue for the sake of clarity, if implemented

in Java then we make the appropriate method call. Another point to note is that, as mentioned

above, our convention here is to number arrays starting at 1 whereas Java starts at 0. We will

discuss this below, for now let us not worry about the implementation language.

Exercise 1. For item 1 you must use the O(·) notation along with any appropriate properties as

discussed in Lecture Note 2. You may not use named constants as part of the analysis. For item 2

you could use named constants but in fact this is not necessary, the ?(·) notation enables things to

be expressed simply and well. Finally, note that for full marks you must justify each expression

or claim with reference to the algorithm. Any answers that do not follow these requirements will

be awarded 0.

Let TNM(n, m) denote the worst case runtime of the algorithm Naive-Matcher on inputs of

size n and m respectively.

1. Prove that TNM(n, m) = O((n

m + 1)m). Note that the additive constant +1 is needed

to take care of the case when m = n. [15%]

2. Prove that TNM(n, m) = ?((n

m + 1)m). Thus you must produce inputs of size n,

m that cause the algorithm to have runtime as claimed. You cannot fix either of n or m.

(HINT: There are very simple examples.) [10%]

3. Is it true to say that TNM(n, m) = ?((n

m + 1)m)? Justify your answer briefly. [5%]

Note: The second part of the exercise is a good opportunity to underline the important point

that for many algorithms the worst case runtime for given input size(s) does not happen for all

possible inputs but only for some appropriate ones. As an example here, if we take the text

4

to consist entirely of the character a and the pattern to consist of anything starting with the

character b then the runtime is O(n). We saw the same point in connection with the simple linear

search algorithm.

§2.2 The Knuth-Morris-Pratt algorithm

Let us look again at the simple example. In trying to see if there is a match of the pattern with

the text starting at position 1, the first two characters match but the third fails. Our knowledge of

the text can be pictured as:

text T b a ? ...

pattern P b a b

(It is true that we looked at the next character in the text but we want to relate our knowledge

of the text to the pattern, taking the extra character into account does not help in the following

discussion.) This tells us immediately that it is not worth trying a match of the pattern with the

text starting at position 2, since the first character of the pattern fails to match its second character.

On the other hand it is worth trying position 3. This observation (and its generalisation below) is

crucial for the development of the algorithm.

Consider the general situation where we are comparing the pattern against the text starting at

position s+ 1 (recall that s is the shift from position 1 of the text). We have a match exactly up to

position q of the pattern (i.e., position s+q of the text). We allow q = 0 to mean there is no match

at all (this makes sense, q is the number of characters matched). We have found a complete match

of the pattern in the text if and only if q = m otherwise there is a mismatch with the character at

position q + 1 of the pattern. Whatever is the case we now know that T[s+ 1 ..s+q] = P[1 ..q].

Given this knowledge, we can decide just from the pattern if it makes sense to try matching it

with the text starting at position s + 2. We do this by comparing the pattern with itself thus:

a1 a2 a3 a4 ... aq

a1 a2 a3 ... aq1

We see that it is only worth trying a match with the text at the next position if and only if the

match above is successful (this is the same as saying that T[s+2 ..s+q] = P[1 ..q1]).

Suppose

that this fails, i.e., the shift of the pattern does not match with the pattern up to position q. We

can again decide just from the pattern if it is worth trying at the next position. Our diagram is:

a1 a2 a3 a4 ... aq

a1 a2 ... aq2

Here we need T[s + 3 ..s + q] = P[1 ..q

2].

The preceding discussion shows that what we need is the least shift s0 > s such that

T[s0 + 1 ..s + q] = P[1 ..s + q

s0

].

5

We can rephrase this as: the least shift s0 > s such that

T[s0 + 1 ..s + q] = P[1 ..k],

where s0 + k = s + q. We can also write the equation as T[s0 + 1 ..s0 + k] = P[1 ..k] [why?].

The crucial point here is that we can pre-compute these shifts from the pattern alone (recall that,

by assumption, T[s0 + 1 ..s + q] consists of the first q characters of the pattern). An equivalent

interpretation is to find the largest k

we can eliminate s0 altogether and write the equation as

T[s + q

k + 1 ..s + q] = P[1 ..k]. (?)

(It is important to get the indices right but don’t let this disguise the main idea which is quite

simple and intuitive.) Note that such a k always does exist since k = 0 satisfies the condition

(vacuously) except that it might not be the largest. On the other hand the requirement k

means that the possible values of k are bounded from above and hence there is a maximum.

At this point it might be helpful to recall that our discussion is based on the assumption that

T[s + 1 ..s + q] = P[1 ..q]. It follows that the equation (?) above is equivalent to

P[q

k + 1 ..q] = P[1 ..k]

since T[s + q

k + 1 ..s + q] = P[q

k + 1 ..q]. In terms of a diagram the last displayed

equation states

a1 a2 ... aqk+1

aqk+2

... aq

a1 a2 ... ak

So once we fix q we can find k just from the pattern P.

We define the function ? : {1, 2,...,m} ! {0, 1,...,m

1} by requiring that ?(q) is the

value k as just described. (We do not need to define ?(0) because q = 0 means there was no

match at all and so we will automatically move on. ) Note that the function ? can be represented

by an array indexed from 1 to m and we will use this without further comment. From the

preceding discussion we know that

0 (q) < q

for all q. Make sure that you understand fully the meaning of this function and how it relates to

the discussion above. The following example will help.

The function for our simple pattern bab is given by:

q 1 2 3

(q) 0 0 1

Let us see exactly why these values are correct by looking at them in turn. Our discussion will

refer to T (to help you relate the calculations to their intended use) but remember that this is not

necessary for the calculations.

6

(1) : Here T[s + 1 ..s + 1] is the same as the pattern entries P[1 . . 1], i.e., T[s + 1] is the same

character as P[1]. We want the largest k such that P[1 ..k] = T[s0 + 1 ..s0 + k] where

s0 + k = s + 1. Since we require s0 > s this immediately means that s0 = s + 1 and k = 0.

Note that this argument applies to any pattern, we always have ?(1) = 0. It is exactly what

we would expect: we are told that the attempt to match the pattern with the text starting

at position s + 1 failed at position s + 2 (i.e., T[s + 2] 6= P[2]). It follows that the next

position at which we should try a match is at s + 2.

(2) : We have T[s + 1, s + 2] = P[1, 2]. Is it worth trying for a match at position s + 2 of the

text? Well such an attempt would look like:

Clearly this is bound to fail. Thus we set k = 0 and this means that s0 = s + 2, there is no

point in attempting a match at position s + 2 (recall that the next attempted match starts at

position s0 + 1).

(3) : We have T[s + 1, s + 2, s + 3] = P[1, 2, 3]. Is it worth trying for a match at position s + 2

or s + 3 of the text? An attempt at position s + 1 would look like:

b a b

b a

Clearly this would fail (we already knew this from the previous case of course). So now

we ask if it is worth trying for a match at position s + 3. Here the picture is:

b a b

b

Thus it is worth trying this (and from the information we have we cannot rule out a match).

So k = 1 and s0 = s + 2.

It is worth stressing again that although the discussion above has referred to parts of the text

input, all these are simply initial segments of the pattern. For any initial segment (i.e., value of q)

we have a value for ?(q). The references to the text were simply a matter of convenience and

for aiding understanding. As noted above, for a given value of q the value of(q) is simply the

largest k such that

P[q

k + 1 ..q] = P[1 ..k].

Note that for k = 0 both sides of this equation consist of an empty array so the equation holds.

In terms of a picture we require the pattern to look like:

Pattern up to position q a1 ··· aqk

a1 a2 ··· ak

Pattern shifted to position q

k + 1 a1 a2 ··· ak

7

In intuitive terms just think about keeping one copy of the pattern up to position q. We then

place another copy under it but shifted along by one position, ignore the last entry (it falls out

of scope), and check for a match. If not we shift the bottom along by one more place and again

check for a match with the characters directly above. We keep doing this till we either find a

match (in which case k is the number of entries matched) or the bottom copy falls off the right

hand end (in which case k = 0). You should now be able to see that the lengthy discussion above

for the pattern bab can be summarised as:

q 1 2 3

Pattern b a b

Shift 1 b a

Shift 2 b

(q) 0 0 1

Make sure you understand how we can work out ?(q) for each q straight from the array of the

pattern and its shifts. Start with q = 1, how much of the displayed array is relevant? For each q

cover up the irrelevant part, if any, and then look for the maximum number of characters that

match completely on any shift line with the relevant part of the pattern.

The method we have used so far to compute ? is not particularly efficient (as a function of m,

the length of the pattern P). The crucial aspect of the approach is that we can compute the

function very efficiently. Here is the pseudocode (recall that we hold the value of ?(i) in an array

which we will also call ?):

Compute-Prefix-Function(P)

1. m P.length

2. ?[1] 0

3. k 0

4. for q 2 to m do

5. while k > 0 and P[k + 1] 6= P[q] do k [k]

6. if P[k + 1] = P[q] then k k + 1

7. [q] k

8. return

Remember that in this section we assume arrays start at index 1 but of course in Java they start at

index 0 so you will need to make an adjustment from the pseudocode to the Java implementation

of Compute-Prefix-Function, see the note on p. 13. It is by no means obvious that this algorithm

is correct, indeed the runtime is also not obvious. A good way to begin the study of an algorithm

is to simulate it on simple examples. In order to help you with this here are two further examples

of patterns and their functions:

1. P = bbcbbabbc q 1 2 3 4 5 6 7 8 9

?(q) 0 1 0 1 2 0 1 2 3

2. P = ababbabababc

q 1 2 3 4 5 6 7 8 9 10 11 12

(q) 0 0 1 2 0 1 2 3 4 3 4 0

8

You should first check that these functions are correct by computing ? by the naive method. Then

simulate Compute-Prefix-Function, at least for the first few values. In fact, the supplied code

(see §3.1) has an implementation of the naive method of computing the prefix function (note

that this does not implement Compute-Prefix-Function). This is in the Matcher class and the

method is:

public static int[] buildPrefixFunctionNaively(String pattern)

Important: This method is supplied to help you in cross checking your work. You must not

use it as part of your implementation since it would invalidate all the timing tests (in effect you

would be implementing an algorithm that is significantly slower than naive matching!).

A proof of correctness of the Compute-Prefix-Function is not easy. Again it would be a

good exercise to try proving the correctness of the algorithm, or more accurately think about

how you might approach such a task (put a limit on the time you spend on this). You can find a

complete proof in the book Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein.

The worst case runtime is ?(m) but a naive analysis does not work because of the presence

of a while loop (on line 5) within the overall for loop. We can derive the bound by using a form

of amortized analysis. We will not go into details for Compute-Prefix-Function because the

approach is similar to the one discussed below for KMP-Matcher.

We are now ready to look at a more efficient string matching algorithm.

KMP-Matcher(T,P)

1. n T.length

2. m P.length

3. Compute-Prefix-Function(P)

4. q 0

5. create new (empty) queue Q

6. for i 1 to n do

7. while q > 0 and P[q + 1] 6= T[i] do q [q]

8. if P[q + 1] = T[i] then q q + 1

9. if q = m then

10. enqueue(Q, i

m)

11. q [q]

12. return Q

The correctness of this algorithm is also not obvious but not so hard to establish given the correctness

of Compute-Prefix-Function. The next exercise will find the worst case runtime. Note

that this is a type of analysis that we do not otherwise cover in the course. It has therefore been

broken down into small and simple steps so that the things you have to do are ones that are common

with the type of analysis we have seen many times. Finally, although the question shows

three main items, the last one concludes the analysis for you and is not a question as such; in

your submission include only answers to the first two items.

Exercise 2. Define the function (s)

as follows: (0)

= 0 while for s > 0 the value of (s)

is

the value of q just after the body of the for loop of line 6 in KMP-Matcher has executed for the

value i = s. Consider an execution of the for loop on line 6 when i = s and count the following:

9

the number of executions of the statement controlled by the while loop on line 7;

the number of executions of the statement controlled by the if on line 8;

the number of executions of the statements on lines 10, 11 (controlled by the if on line 9).

Denote this count by cs. Define

cs = cs + (s).

This is called the amortized cost of the execution of the loop and the function

is called a

potential.

1. Let c be the total cost of executing the for loop in KMP-Matcher, counted in terms of

the statements listed above, and c the total amortized cost (i.e., c = c1 + ··· + cn and

c = c1 + ··· + cn). Prove that c = c + (n)

(0).

Explain briefly why it now follows

that c c. [15%]

2. Prove that cs = O(1) for all s (indeed cs 4). Break this down into the following easy

parts:

(a) Let q0 be the value of q before the while loop on line 7 is executed (i.e., q0 = (s1))

and let q1 be the value of q just after the while loop terminates. Let ws be the number

of times that the while loop body is executed. Prove that ws+q1q0

0. Recall that

(q) < q for all q; what happens to the value of q with each execution of the loop?

(b) Let q2 be the value of q after line 8 is executed. Prove that 1 + q2

q1 2.

(c) Let q3 the value of q just after the if statement starting on line 9 is executed (i.e.,

q3 = (s)).

Prove that 2 + q3

q2 2.

(d) Now put these observations together to deduce the overall claim. Note that cs ?

ws +1+2, thus

cs (ws + 1 + 2) + (s).

NOTE: Each of these requires no more than a few lines to justify. So think carefully and

express your justification very clearly. Overlong or unclear answers will receive very much

reduced credit and any answers that cannot be understood after 2 minutes (per part) will

be awarded 0. [20%]

3. We now use the preceding two main parts to deduce that the worst case runtime of the

algorithm is O(n); there is nothing for you to submit here but you should read through this

conclusion of the analysis and make sure you understand it. We use without proof the fact

that the worst case runtime of Compute-Prefix-Function is ?(m); recall that m  n.

Lines 1, 2, 4 and 12 of KMP-Matcher all cost O(1) while line 3 is O(m). The cost

of executing the for loop on line 6 for i = s is at most acs + b for some constants a

and b; if you have time give a brief justification of this observation. It follows that the

10

overall cost of executing the for loop on line 6 is at most ac + bn and hence at most

ac + bn. From the preceding part we deduce that ac + bn = O(n). Thus the total cost is

O(1) + O(1) + O(1) + O(1) + O(m) + O(n) = O(n) since m n.

In fact, we can improve our analysis to show that the worst case runtime is (n) but we

will not do this here (would you expect it to be sub-linear).

§3. Software tasks

We first give a guide to some of the suppled classes and methods (some more are mentioned

later).

§3.1 Supplied software

The files that you will need for this coursework can be found on-line at the coursework webpage:

http://www.inf.ed.ac.uk/teaching/courses/inf2b/coursework/cwk1.html

This page contains a file called inf2bcw1.tar which you should download. The file is tarred

so you need to extract it, e.g. by tar -xf inf2bcw1.tar, this will create a subdirectory

src of your current one that contains the software. The java package for this coursework is

package matcher. The only class that should be changed is the nearly empty class in the file

StudentClass.java which is where you will implement your part of the project. In fact,

the amount of code you need to write is fairly modest.

Recall that in the discussion above we assumed that arrays were indexed starting from 1.

However Java starts with 0. Rather than invent a new class in Java it is simplest to subtract 1

from all array indices. This is very important otherwise your software will not work.

Note: It is a part of the exercise that you translate the description of the algorithm based on

arrays starting at 1 to arrays starting at 0 in Java. As mentioned above, this is a deliberate choice

so that you practice switching conventions. In this case it is a very simple matter. You cannot

throw away location 0 of Java arrays and start using them from 1 onwards.

Important: Do not remove any headers from the supplied software including those in the

StudentClass.java file.

In the software the text and pattern will be represented by String variables text and

pattern; our alphabet is just any valid Java character (in experiments we will use a restricted

alphabet, this is discussed below). The following methods and classes are already supplied for

you.

class Queue. This implements the queue we use to hold the offsets for occurrences of the

text. In fact, for this practical we only need two methods.

– public Queue enqueue(Integer offset)

This puts s at the end of the queue.

11

– public Queue toString()

This simply returns a comma separated string of all the entries in the queue so that

you can print it out for checking your implementation. If the queue is empty then the

empty string is returned.

Class Matcher that contains various methods including

– public Queue matchNaively(String text, String pattern)

This takes text and pattern and uses NaiveMatcher to return a Queue structure

which consists of all the shifts in the text at which the pattern can be found. For example

if our text is "aababaab" and the pattern is "aa" then the returned Queue structure

has 0, 5 as its elements in this order. With the same text but with the pattern "aaa" the

returned Queue is empty.

§3.2 Tasks

Note: You are asked to write some code that is in fact fairly simple thanks to the supplied

methods. Any code that is incorrect will be awarded 0. Correct code will be marked for style

as well. This means that it should have useful comments and helpful indentation. Note that

pointless comments (e.g., stating the obvious such as “now we add the two variables together”)

or excessive indentation are almost as bad as the complete absence of either.

Important: The checking of your code includes an automated procedure. Partly for this reason,

you must follow the specification exactly and must not change any of the method names or their

parameters. Otherwise your code will fail the test and be awarded 0. In particular remember not

to remove any headers including those in the StudentClass.java file.

The “texts” that we will be searching are in fact DNA sequences. A DNA sequence or genetic

sequence is a succession of letters representing the primary structure of a real or hypothetical

DNA molecule or strand, with the capacity to carry information as described by molecular biology.

The possible letters are A, C, G, and T (note they are all in upper case) representing the four

nucleotide bases of a DNA strand: adenine, cytosine, guanine and thymine that are linked to a

phosphodiester backbone. Sequences can be derived from the biological raw material through a

process called DNA sequencing. This description is purely for background, you do not need to

know it for the exercise, all that is needed is that our alphabet consists of A, C, G, T (and indeed

this is only relevant to carrying out tests).

Your programming tasks are as follows.

1. In the StudentCode class, provide implementations of the method

public static int[] computePrefixFunction(String pattern) [10%]

In the StudentCode inner class KMPMatcher, implement the string matching algorithm:

public void search() [15%]

12

Note: Recall that the function

. In

§

2 we used the

convention that arrays start at index 1. However in Java they start at index 0. Since we

are just computing the array that defines

we now reduce each argument to the abstract

function

by 1. To be clear if pi is the Java array that gives us the values of the function

then

is given by pi[i-1]. If we were implementing a Java method getPi, say, then

we could hide this so that a call to getPi(i) looks up pi[i-1] and returns this as the

value. However to avoid extra coding we just work directly with the array pi

.

Keep your code straightforward, follow the algorithms as outlined in

§2.2 (taking care

of the issue regarding array indexing). Note that in designing the matching algorithm we

made the reasonable assumption that the pattern is not longer than the text. Your code must

test for this and return an empty queue if the pattern it strictly longer than the text. Remember

also that you must not use the supplied method buildPrefixFunctionNaively

as part of your implementation and if you do then you will be awarded 0 for the entire

coding part

.

Test your implementation by using the following methods in the Matcher class:  public static Boolean testPrefixFunction(String pattern)

This checks for occurrences of pattern with itself as discussed for building the

prefix function. Returns true if the list of occurrences (represented as shifts) is

correct otherwise false. You will not get any other information.

public static void testKMPMatcher(int t, int l)

Generates

t random patterns of length

l and finds their occurrences in a fixed text

using the method KMPMatcher.find. If all results are correct, then it returns an

appropriate message. Else, it reports on the failure. Note that the integer parameters

here and elsewhere must not be negative (an exception is raised otherwise).

Naturally the test procedures do not guarantee correctness. All we can say is that if the

code fails the tests then there is something wrong. If it passes all tests then it is probably

correct.

2. The aim of this part is to compare the runtimes for NaiveMatcher with those for

KMPMatcher. We would expect that as the pattern gets larger the differences would be

more significant. Note however that there could be patterns for which NaiveMatcher

runs faster (you should think carefully why this can be the case).

However we have a problem in trying to compare these two algorithms. The runtime of

Naive-Matcher depends on both

n and

m while that of KMP-Matcher depends only on

n

.

Of course for any fixed value of

m the runtime of Naive-Matcher is linear in

n, recall that

its runtime is

2. We would expect that for big enough fixed values of

m

the advantage of KMP-Matcher would be clear (but note that if

m is close to

n then the

runtime of Naive-Matcher is again linear in

n). In order to avoid complications we will

2Do not make the mistake of concluding falsely from the statement about fixed values of

m that the runtime of

Naive-Matcher is linear in

n. This is clearly nonsense, e.g., consider

m

13

carry out experiments with the following choices of

m for any given value

n: we use

m = 10

, 10

2,..., 10

r where

r is the smallest integer such that 10

. The

upper bound is chosen because

m has its maximum value at

The fact that we multiply the length of the pattern by 10 each time means that the number

of patterns is not enormous.

The supplied Matcher class method ? public static void getRuntimes(int p, int t, String f)

does the following:

(a) Generates

p patterns for each size from 10

, 10

2,... as discussed above using

n

=

10000 as the initial text size (the patterns are not random so that experiments are

repeatable).

(b) Searches for all occurrences of each pattern within a text of size

n using the classes

NaiveMatcher and KMPMatcher, and takes the cpu times for these.

In fact, the text is always taken from the initial portion of size

n of the same longer

text.

(c) Records the worst case times for each search taken over the

p patterns (note that it is

pefectly possible that the worst case runtimes for the two algorithm implementations

occur for different patterns).

(d) Repeats the above with texts of size 20000

, 30000,..., 10000

t

.

(e) Outputs the result for each iteration on the file named

f in the format

pattern-size text-size naiveMatcher-worst-case-runtime

KMPMatcher-worst-case-runtime

For

f you can give a path name but the simplest thing is to run code from your

working directory and use just the name for

f; the file will be placed in your working

directory.

(Note that the data above are given on a single line of the file, the text is too long to fit on

one line of this document.) Carry out experiments with increasing values of

t until you find

a clear difference between the two algorithms. In order to avoid excessive waiting choose

the value of

p to be fairly moderate. Naturally it makes sense to continue the experiment

a little beyond the point of the first clear difference (indeed due to various factors there

might be a temporary turn around, this doesn’t happen for long).

Finally, run the experiment with p=10

, t=100 and keep the results in a file with the name

matcherTimes.txt. Study the output to see what happens in terms of one algorithm

outperforming the other. [5%]

Use this file to find the crossover point(s) at which KMPMatcher is consistently faster

than naiveMatcher. This is a rough and ready way of obtaining an estimate for the

14

settling in period before the overheads outweigh the benefits. In this case our approach is

open to criticism (think about why this is the case) but is not too misleading.

From now on we will focus on KMPMatcher.

3. We know that the worst case runtime T(n) of KMPMatcher satisfies T(n) ? cn for some

constant c. Of course asymptotic notation allows for a settling in period, i.e., from some

value n0 of n onwards. In fact, this period is not really necessary at least from n = 1

onwards because we can just take a larger constant if needed (with the obvious downside

that this gives a pessimistic performance guarantee for all large enough values of n). It

should be fairly clear from the algorithm that the settling in period is not long; essentially

we just need n to be big enough so that cn dominates the cost of initializing variables and

setting up data structures.

The supplied Matcher method

public static void getRatios(int p, int t, int xOver,

String fRatios)

is similar to getRuntimes except that for each experiment it only uses KMPMatcher.

Once it has found the worst case runtime for a text size n it records the value of that runtime

divided by n. The results are output to the file fRatios (use the name

matcherRatios.txt for fRatios) in the format

integer-size ratio

one per line as before. At the end of this output it also appends four extra lines:

Ignoring ratios before cross over point: xOver

Sorted ratios are: r1, r2,...

Maximum ratio is: max-ratio

Average ratio is: ave-ratio

Thus the final two lines give us the overall maximum as well as the average of all the

ratios. The case for taking the average is that it smooths out to some extent the effects of

any garbage collection. In a more detailed study we need to be more sophisticated but here

we will keep things simple and rely on the plot (see below) to give us an idea of how useful

the average is. You can also use the sorted ratios to get an idea of the effects of garbage

collection (we would expect ratios affected by this to be significantly bigger than the “real”

ones).

Use the following values for the parameters:

p set to 10,

t set to 100.

15

xOver: use the crossover point (TextSize in matcherTimes.txt) you found

where KMPMatcher becomes more efficient than NaiveMatcher.

Finally you will plot the data from the file matcherTimes.txt of the previous part

and the worst case runtime as determined by the theoretical analysis together with the two

estimates for the constant found by the preceding experiment (the maximum and average).

Use the supplied Matcher method

public static void plotRuntimes(double c, double a,

String f)

to produce your plot. For c use the maximum constant found above, for a use the average

and for the file f pass the data in the file matcherTimes.txt from the preceding part.

This will bring up a plot which you should save in a file called plot.jpg. [5%]

Compiling and Running

From the commandline, type cd path-to-src-folder and compile with the command javac

matcher/*.java. Run the program with the command java matcher.ClassName

where ClassName is the .java file where your main() target is.

Notes

1. Java provides pattern matching. You must not use this in your implementation. The point

is to do things from scratch. If you ignore this requirement you will be awarded 0 for the

coding part.

2. As stated above your code must be well laid out showing its logical structure. Just like

good prose or good mathematical writing it must be set out to aid understanding. This way

you and others can maintain it as well as spot any errors more easily.

§4. Submitting the Coursework

You must submit the two parts of your work by the deadlines given at the head of this document.

If you complete the coursework in full, you should be submitting:

Part A: Analysis of algorithms (hardcopy only).

1. Your answer to Exercise 1 of §2.1.

2. Your answer to Exercise 2 of §2.2.

Part B: Software.

1. Your code in the class file StudentClass.java (electronic only).

2. matcherTimes.txt (electronic only).

3. plot.jpg (electronic only).

16

Note: Your answers to the two exercises of Part 1 are best handwritten (unless you have a medical

condition that prevents this) and neatly presented following the guidelines of Lecture Note 2 (and

the notes in general). Your hardcopy submission should be made at the appropriate time to:

ITO, Appleton Tower Room 6.05, Crichton Street, Edinburgh, EH8 9LE

Important: Put your matriculation number very clearly at the top of your submissions. You do

not have to put your name; marks will be returned to the ITO by matriculation number (this is

how the submit command (see below) organises things. If you need to submit more than one

piece of paper for a part please staple the sheets together and put your matriculation number on

the top of each sheet (in case they get separated). Do not use folders to hold several sheets as this

holds up the marking process significantly.

For electronic submission follow these instructions.

First put your submission files (and nothing else) into one directory called inf2b-cw1.

Submit this directory using the following command (when logged into DICE):

submit inf2b cw1 inf2b-cw1

Warning: The rule for courseworks is “We mark what is submitted.” Before you submit your

coursework, make sure that you are submitting the correct files. In some previous years students

submitted the wrong files and lost marks that way.

Bibliography

1. D. E. Knuth, J.H. Morris, Jr. and V.R. Pratt, Fast pattern matching in strings. SIAM Journal

on Computing, 6 (2), 1977, 323–350.

2. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, Introduction

to Algorithms. McGraw-Hill, 2002.

Kyriakos Kalorkoti

Software by Prachya Boonkwan, Xuan Huang,

Srikanth Ronanki and Naums Mogers


Oracle/SQL/PostgreSQL/Pig数据库代写/代做/辅导

你可能感兴趣的:(COURSEWORK 1 FOR INF2B (ADS THREAD, 2018-19))