CSCI-570 Analysis of Algorithms

CSCI-570 Fall 2021
Analysis of Algorithms
Final Project
Due: Dec 8, 2021
I. Guidelines
The project is related to the implementation of the two different solutions provided in chapter 6
of the Kleinberg textbook for the Sequence Alignment problem. You can work on this project in
groups of no more than 3 people.
II. Project Description
Implement the basic Dynamic Programming solution to the Sequence Alignment problem. Run
the test set provided and show your results.
A. Algorithm Description:
Suppose we are given two strings X and Y, where X consists of the sequence of symbols
x1, x2 . . . xm and Y consists of the sequence of symbols y1, y2 . . . yn. Consider the sets {1,
2, . . . , m} and {1, 2, . . . , n} as representing the different positions in the strings X and Y,
and consider a matching of these sets; recall that a matching is a set of ordered pairs with
the property that each item occurs in at most one pair. We say that a matching M of these
two sets is an alignment if there are no “crossing” pairs: if (i, j), (i’ , j’ ) ∈ M and i < i’ ,
then j < j’ . Intuitively, an alignment gives a way of lining up the two strings, by telling
us which pairs of positions will be lined up with one another.
Our definition of similarity will be based on finding the optimal alignment between X
and Y, according to the following criteria. Suppose M is a given alignment between X
and Y:

  1. First, there is a parameter δe > 0 that defines a gap penalty. For each position of X
    or Y that is not matched in M — it is a gap — we incur a cost of δ.
  2. Second, for each pair of letters p, q in our alphabet, there is a mismatch cost of αpq
    for lining up p with q. Thus, for each (i, j) ∈ M, we pay the appropriate mismatch
    cost αxiyj for lining up xi with yj. One generally assumes that αpp = 0 for each letter
    p—there is no mismatch cost to line up a letter with another copy of
    itself—although this will not be necessary in anything that follows.
  3. The cost of M is the sum of its gap and mismatch costs, and we seek an alignment
    of minimum cost.
    B. Input string Generator:
    The input to the program would be a text file, input.txt containing the following
    information:
  4. First base string
  5. Next j lines would consist of indices corresponding the after which the
    previous string to be added to the cumulative string
  6. Second base string
  7. Next k lines would consist of where the base string to be added to the
    cumulative string
    This information would help generate 2 strings from the original 2 base strings.
    This file could be used as an input to your program and your program could use
    the base strings and the rules to generate the actual strings. Also note that the
    numbers j and k corresponds to the first and the second string respectively. Make
    sure you validate the length of the first and the second string to be
    2jstr1.length and 2kstr2.length. Please note that the base strings need
    not have to be of equal length and similarly, j need not be equal to k.

Using the above numbers, the generated strings would be
ACACTGACTACTGACTGGTGACTACTGACTGG and
TATTATACGCTATTATACGCGACGCGGACGCG
Following is the step by step process on how the above strings are generated.
ACTG
ACTGACTG
ACTGACTACTGACTGG
ACACTGACTACTGACTGGTGACTACTGACTGG
TACG
TATACGCG
TATTATACGCGACGCG
TATTATACGCTATTATACGCGACGCGGACGCG
C. Values for Delta and Alphas
Values for α’s are as follows. δe is equal to 30.
A C G T
A 0 110 48 94
C 110 0 118 48
G 48 118 0 110
T 94 48 110 0
D. Programming/Scripting Languages:
Following are the list of languages which could be used:

  1. C
  2. C++
  3. Java
  4. Python
  5. Python3
    E. Bounds:
    Bounds on the length of the base strings and the values of m and n, along with the
    zip file will be released on 17 November 2021, i.e. 3 weeks before the due date.
    III. Goals
    Following are the goals to achieve for your project
    A. Your program should take input:
  6. 2 strings that need to be aligned, should be generated from the string
    generator mentioned above.
  7. Gap penalty (δe).
  8. Mismatch penalty (αpq).
    B. Your solution should output output.txt file containing the following information at
    the respective lines:
  9. The first 50 elements and the last 50 elements of the actual alignment.
  10. The time it took to complete the entire solution.
  11. Memory required.
    C. Implement the memory efficient version of this solution and repeat the tests in
    Part B.
    D. Plot the results of Part A and Part B:
  12. Single plot of CPU time vs problem size for the two solutions.
  13. Single plot of Memory usage vs problem size for the two solutions.
    IV. Notes and Hints
    A. You will be provided a zipfile which has a folder containing some sample test
    cases and corresponding output text files containing the correct answers. Please
    note that we will be having another set of test cases (that are not released) which
    will be used for testing your program/script.
    B. You should provide us with a zipfile containing your programs/scripts along with
    plots and a summary file.
    C. The name of your program/script, folder and all the other meta-data files should
    have the USC IDs (not email ids) of everyone in your group separated by
    underscore, as follows:
  14. Zip filename: 1234567890_1234567890_1234567890.zip, of the original
    foldername 1234567890_1234567890_1234567890.
  15. Program filename: 1234567890_1234567890_1234567890_basic.py and
    1234567890_1234567890_1234567890_efficient.py (if using Python).
  16. Plots: CPUPlot.png and MemoryPlot.png (you can use .jpg as well).
  17. Summaries: Summary.txt.
    D. Your programs/script should take input as input.txt file as an argument and output
    an output.txt file.
    E. Summarize your results and include any insights or observations drawn from your
    results in Summary.txt
    F. You also need to state each group members contribution to the project, e.g.coding,
    testing, report preparation, etc. Do that in Summary.txt.
    V. Grading
    A. Basic Algorithm: 25 points.
    B. If your program outputs a .txt file having all the 3 lines with correct answers: 15
    points.
    C. Memory efficient version: 30 points.
    D. Plots, analysis of results, insights and observations: 30 points.

你可能感兴趣的:(算法)