allway2

Dynamic programming and sequence alignment

Computer science aids molecular biology

Save Like

By Paul Reiners
Published March 11, 2008

Genetics databases hold extremely large amounts of raw data. The human genome alone has approximately 3 billion DNA base pairs. To search through all this data and find meaningful relationships within it, molecular biologists are depending more and more on efficient computer science string algorithms. This article introduces you to three such algorithms, all of which use dynamic programming, an advanced algorithmic technique that solves optimization problems from the bottom up by finding optimal solutions to subproblems. You’ll work through Java™ implementations of these algorithms, and you’ll learn about an open source Java framework for processing biological data.

Genetics and string algorithms

Strands of genetic material — DNA and RNA — are sequences of small units called nucleotides. For purposes of answering some important research questions, genetic strings are equivalent to computer science strings — that is, they can be thought of as simply sequences of characters, ignoring their physical and chemical properties. (Although, strictly speaking, their chemical properties are usually coded as parameters to the string algorithms you’ll be looking at in this article.)

This article’s examples use DNA, which consists of two strands of adenine (A), cytosine (C), thymine (T), and guanine (G) nucleotides. DNA’s two strands are reverse complements of each other. A and T are complementary bases, and C and G are complementary bases. This means that A s in one strand are paired with T s in the other strand (and vice versa), and C s in one strand are paired with G s in the other strand (and vice versa). So, if you know the sequence of one strand’s A s, C s, T s, and G s, you can derive the other strand’s sequence. Hence, you can think of a DNA strand simply as a string of the letters A, C, G, and T.

Dynamic programming

Dynamic programming is an algorithmic technique used commonly in sequence analysis. Dynamic programming is used when recursion could be used but would be inefficient because it would repeatedly solve the same subproblems. For example, consider the Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, … The first and second Fibonacci numbers are defined to be 0 and 1, respectively. The _n_th Fibonacci number is defined to be the sum of the two preceding Fibonacci numbers. So, you can calculate the _n_th Fibonacci number with the recursive function in Listing 1:

Listing 1. Recursive function for calculating nth Fibonacci number

public int fibonacci1(int n) {
   if (n == 0) {
      return 0;
   } else if (n == 1) {
      return 1;
   } else {
      return fibonacci1(n ‑ 1) + fibonacci1(n ‑ 2);
   }
}

But Listing 1’s code is inefficient because it solves some of the same recursive subproblems repeatedly. For example, consider the computation of fibonacci1(5), represented in Figure 1:

Figure 1. Recursive computation of Fibonacci numbers

In Figure 1 you can see, for example, that fibonacci1(2) is computed three times. It would be much more efficient to build the Fibonacci numbers from the bottom up, as shown in Listing 2, rather than from the top down:

Listing 2. Building Fibonacci numbers from the bottom up

public int fibonacci2(int n) {
   int[] table = new int[n + 1];
   for (int i = 0; i < table.length; i++) {
      if (i == 0) {
         table[i] = 0;
      } else if (i == 1) {
         table[i] = 1;
      } else {
         table[i] = table[i ‑ 2] + table[i ‑ 1];
      }
   }

   return table[n];
}

Show less

Listing 2 stores the intermediate results in a table so that you can reuse them, rather than throwing them away and computing them multiple times. It’s true that storing the table is memory-inefficient because you use only two entries of the table at a time, but ignore that fact for now. I’m doing it this way to motivate your use of similar tables (although they will be two-dimensional) in this article’s more complicated later examples. The point is that Listing 2’s implementation is much more time-efficient than Listing 1’s. Listing 2’s implementation runs in O(n) time. I won’t prove this, but the running time of Listing 1’s naive, recursive implementation is exponential in n.

This is exactly how dynamic programming works. You take a problem that could be solved recursively from the top down and solve it iteratively from the bottom up instead. You store your intermediate results in a table for later use; otherwise, you would end up computing them repeatedly — an inefficient algorithm. But dynamic programming is usually applied to optimization problems like the rest of this article’s examples, rather than to problems like the Fibonacci problem. The next example is a string algorithm, like those commonly used in computational biology.

Longest common subsequence problem

You’ll first see how to use dynamic programming to find a longest common subsequence (LCS) of two DNA sequences. Biologists who find a new gene sequence typically want to know what other sequences it is most similar to. Finding an LCS is one way of computing how similar two sequences are: the longer the LCS is, the more similar they are.

The characters in a subsequence, unlike those in a substring, do not need to be contiguous. For example, ACE is a subsequence (but not a substring) of ABCDE. Consider the following two DNA sequences:

S1 = GCCCTAGCG
S2 = GCGCAATG

It turns out that an LCS of these two sequences is GCCAG. (Note that this is an LCS, rather than the LCS, because other common subsequences of the same length might exist. This and the other optimization problems you’ll look at might have more than one solution.)

LCS algorithm

First, think about how you might compute an LCS recursively. Let:

C1 be the right-most character of S1
C2 be the right-most character of S2
S1′ be S1 with C1 “chopped-off”
S2′ be S2 with C2 “chopped-off”

There are three recursive subproblems:

L1 = LCS(S1′, S2)
L2 = LCS(S1, S2′)
L3 = LCS(S1′, S2′)

I won’t prove this, but it can be shown (and it’s not hard to believe) that the solution to the original problem is whichever of these is the longest:

L1
L2
L3 appended with C1 if C1 equals C2, or L3 if C1 is not equal to C2

(The base case is whenever S1 or S2 is a zero-length string. In this case, the LCS of S1 and S2 is clearly a zero-length string.)

However, like the recursive procedure for computing Fibonacci numbers, this recursive solution requires multiple computations of the same subproblems. It can be shown that this recursive solution takes exponential time to run. In contrast, the dynamic programming solution to this problem runs in Θ(mn) time, where m and n are the lengths of the two sequences.

To compute the LCS efficiently using dynamic programming, you start by constructing a table in which you build up partial results. List one of the sequences across the top and the other down the left, as shown in Figure 2:

Figure 2. Initial LCS table

The idea is that you’ll fill up the table from top to bottom, and from left to right, and each cell will contain a number that is the length of an LCS of the two string prefixes up to that row and column. That is, each cell will contain a solution to a subproblem of the original problem. For example, consider the cell in the sixth row and the seventh column; it is to the right of the second C in GCGCAATG and below the T in GCCCTAGCG. This cell will eventually contain a number that is the length of an LCS of GCGC and GCCCT.

First consider what the entries should be for the table’s second row. These are the lengths of LCSs for the zero-length prefix of the sequence going down the left, GCGCAATG, and prefixes of the sequence along the top, GCCCTAGCG. Clearly, the value of any of these LCSs will be 0. Similarly, the values down the second columns will all be 0. This corresponds to the base case of the recursive solution. Now the table looks like Figure 3:

Figure 3. LCS table with base cases filled in

Next, you implement what corresponds to the recursive subcases in the recursive algorithm, but you use values that you’ve already filled in. In Figure 4, I’ve filled in about half of the cells:

Figure 4. LCS table half filled-in

When you fill in a cell, you consider:

The cell directly to the left of it
The cell directly above it
The cell to the above-left of it

The three values below correspond, respectively, to the values returned by the three recursive subproblems I listed earlier.

V1 = the value in the cell to the left
V2 = the value in the cell above
V3 = the value in the cell to the above-left

You fill in the empty cell with the maximum of these three numbers:

V1
V2
V3 + 1 if C1 equals C2, or V3 if C1 is not equal to C2, where C1 is the character above the current cell and C2 is the character to the left of the current cell

Note that I also add arrows that point back to which of those three cells I used to get the value for the current cell. You’ll use these arrows later in “tracing back” to construct an actual LCS (as opposed to just discovering the length of one).

Now fill in the next blank cell in Figure 4 — the one under the third C in GCCCTAGCG and to the right of the second C in GCGCAATG. You have a 2 above it, a 3 to the left of it, and a 2 to the above-left of it. The character above this cell and the character to the left of this cell are equal (they’re both C), so you must pick the maximum of 2, 3, and 3 (2 from the above-left cell + 1). So, the value of this cell will be 3. Draw an arrow back to the cell from which you got this new number. In this case, where the new number could have come from more than one cell, pick an arbitrary one: the one to the above-left, say.

As an exercise, you might want to try filling in the rest of the table. If, in the case of ties, you always choose the cell to the above-left over the cell above and the cell above over the cell to the left, you’ll get the table in Figure 5. (If you make different choices in the case of ties, your arrows will be different, of course, but the numbers will be the same.)

Figure 5. Filled-in LCS table

Recall that the number in any cell is the length of an LCS of the string prefixes above and below that end in the column and row of that cell. Hence, the number in the lower, right-most cell is the length of an LCS of the two strings S1 and S2— GCCCTAGCG and GCGCAATG in this case. So, the length of an LCS for these two sequences is 5.

This is a key point to keep in mind with all of these dynamic programming algorithms. Each cell in the table contains the solution to the problem for the sequence prefixes above and to the left that end at the column and row of that cell.

Tracing back to find an actual LCS

The next thing you want to do is to find an actual LCS. You do this in the traceback step in which you use the cell pointers that you drew. When you’re building up your table, remember that when you have a pointer to the above-left cell, and the value in the current cell is 1 more than the value of the above-left cell, this means that the characters to the left and above are equal. In building up an LCS, this corresponds to adding this character to the LCS. So, the way you construct an LCS is by starting in the lower-right corner cell and then following the pointer arrows backward. Every time you follow a pointer to a diagonal cell to the above-left and the value of the cell that is pointed to is 1 less than the value of the current cell, you prepend the corresponding common character to the LCS you’re constructing. Note that you prepend it because you’re starting at the end of the LCS. (In the case of Figure 5, the 5 in the lower-right cell corresponds to the fifth character you’ve added.)

So, proceed to build up your LCS. Starting in the lower-right cell, you see that you have the cell pointer pointing to the above-left and that the value in the current cell (5) is one more than the value in the cell to the above-left (4). So you prepend the character G to your initial zero-length string. The next arrow, from the cell containing a 4, also points up and to the left, but the value doesn’t change. And the next cell also points to the left and above, but its value also doesn’t change. Finally, that cell also points to the above and left, but the value went from 3 to 4. This means you added the common character in that row and column, which is an A. So, your LCS so far is AG. From there, you follow the pointer to the left (this corresponds to skipping over the T above) to another 3. Then there is a diagonal pointer pointing to a 2. Hence, you add the common letter in the current row and column, which is a C, yielding CAG. You continue in this fashion until you finally reach a 0. Figure 6 shows the entire traceback:

Figure 6. Filled-in LCS table with traceback

From the traceback, you get GCCAG as an LCS.

Dynamic programming implementation in the Java language

Now you’ll use the Java language to implement dynamic programming algorithms — the LCS algorithm first and, a bit later, two others for performing sequence alignment. In each example you’ll somehow compare two sequences, and you’ll use a two-dimensional table to store the solutions to subproblems. You’ll define an abstract DynamicProgramming class that contains code common to all the algorithms. All of this article’s sample code is available for Download.

To start, you need a class representing cells in the table, as shown in Listing 3:

Listing 3. Cell class (partial listing)

public class Cell {
   private Cell prevCell;
   private int score;
   private int row;
   private int col;
}

The first step in all the algorithms is to initialize the scores and sometimes the pointers in the table. What you set the initial scores and pointers to differs from algorithm to algorithm, which is why the DynamicProgramming class, as shown in Listing 4, defines two abstract methods:

Listing 4. DynamicProgramming initialization code

protected void initialize() {
   for (int i = 0; i < scoreTable.length; i++) {
      for (int j = 0; j < scoreTable[i].length; j++) {
         scoreTable[i][j] = new Cell(i, j);
      }
   }
   initializeScores();
   initializePointers();

   isInitialized = true;
}

protected void initializeScores() {
   for (int i = 0; i < scoreTable.length; i++) {
      for (int j = 0; j < scoreTable[i].length; j++) {
         scoreTable[i][j].setScore(getInitialScore(i, j));
      }
   }
}

protected void initializePointers() {
   for (int i = 0; i < scoreTable.length; i++) {
      for (int j = 0; j < scoreTable[i].length; j++) {
         scoreTable[i][j].setPrevCell(getInitialPointer(i, j));
      }
   }
}

protected abstract int getInitialScore(int row, int col);

protected abstract Cell getInitialPointer(int row, int col);

Next, you fill in each cell of the table with a score and a pointer. Again, how you do this varies from algorithm to algorithm, so you use an abstract method, fillInCell(Cell, Cell, Cell, Cell). Listing 5 shows DynamicProgramming‘s methods for filling in the table:

Listing 5. DynamicProgramming code for filling in the table

protected void fillIn() {
   for (int row = 1; row < scoreTable.length; row++) {
      for (int col = 1; col < scoreTable[row].length; col++) {
         Cell currentCell = scoreTable[row][col];
         Cell cellAbove = scoreTable[row ‑ 1][col];
         Cell cellToLeft = scoreTable[row][col ‑ 1];
         Cell cellAboveLeft = scoreTable[row ‑ 1][col ‑ 1];
         fillInCell(currentCell, cellAbove, cellToLeft, cellAboveLeft);
      }
   }
}

protected abstract void fillInCell(Cell currentCell, Cell cellAbove,
      Cell cellToLeft, Cell cellAboveLeft);

Finally, you get the traceback. How you do this varies across algorithms. Listing 6 shows the DynamicProgramming.getTraceback() method:

Listing 6. DynamicProgramming.getTraceback() method

abstract protected Object getTraceback();

Java LCS implementation

Now, you’re ready to code a Java implementation for the LCS algorithm.

Initializing the scores in the cells is easy: you just set them all initially to 0 (you’ll reset some of them later), as shown in Listing 7:

Listing 7. LCS initialization method

protected int getInitialScore(int row, int col) {
   return 0;
}

Listing 8 shows the code for filling in the score and pointer for an individual cell in the table:

Listing 8. LCS method for filling in a cell’s score and pointer

protected void fillInCell(Cell currentCell, Cell cellAbove, Cell cellToLeft,
      Cell cellAboveLeft) {
   int aboveScore = cellAbove.getScore();
   int leftScore = cellToLeft.getScore();
   int matchScore;
   if (sequence1.charAt(currentCell.getCol() ‑ 1) == sequence2
         .charAt(currentCell.getRow() ‑ 1)) {
      matchScore = cellAboveLeft.getScore() + 1;
   } else {
      matchScore = cellAboveLeft.getScore();
   }
   int cellScore;
   Cell cellPointer;
   if (matchScore >= aboveScore) {
      if (matchScore >= leftScore) {
         // matchScore >= aboveScore and matchScore >= leftScore
         cellScore = matchScore;
         cellPointer = cellAboveLeft;
      } else {
         // leftScore > matchScore >= aboveScore
         cellScore = leftScore;
         cellPointer = cellToLeft;
      }
   } else {
      if (aboveScore >= leftScore) {
         // aboveScore > matchScore and aboveScore >= leftScore
         cellScore = aboveScore;
         cellPointer = cellAbove;
      } else {
         // leftScore > aboveScore > matchScore
         cellScore = leftScore;
         cellPointer = cellToLeft;
      }
   }
   currentCell.setScore(cellScore);
   currentCell.setPrevCell(cellPointer);
}

Finally, you construct an actual LCS using the traceback:

Listing 9. LCS traceback code

protected Object getTraceback() {
   StringBuffer lCSBuf = new StringBuffer();
   Cell currentCell = scoreTable[scoreTable.length ‑ 1][scoreTable[0].length ‑ 1];
   while (currentCell.getScore() > 0) {
      Cell prevCell = currentCell.getPrevCell();
      if ((currentCell.getRow() ‑ prevCell.getRow() == 1 && currentCell
            .getCol()
            ‑ prevCell.getCol() == 1)
            && currentCell.getScore() == prevCell.getScore() + 1) {
         lCSBuf.insert(0, sequence1.charAt(currentCell.getCol() ‑ 1));
      }
      currentCell = prevCell;
   }

   return lCSBuf.toString();
}

It’s pretty easy to see that this algorithm takes Θ(mn) time (and space) to compute, where m and n are the lengths of the two sequences. Filling in each cell takes constant time — just a bounded number of additions and comparisons — and you must fill in mn cells. Also, the traceback runs in O(m + n) time.

The next two Java examples implement-sequence alignment algorithms: Needleman-Wunsch and Smith-Waterman.

Sequence alignment

Homology

Homology is an important biological concept. Two species are considered homologous if they share a common evolutionary ancestor. Homologous species have many parts of their DNA in common. Conversely, if two species have a similar substring of DNA, you can often infer that this similar DNA comes from a common ancestor. Sequence-alignment algorithms can be used to find such similar DNA substrings.

A major theme of genomics is comparing DNA sequences and trying to align the common parts of two sequences. If two DNA sequences have similar subsequences in common — more than you would expect by chance — then there is a good chance that the sequences are homologous (see ” Homology” sidebar). In aligning two sequences, you consider not only characters that match identically, but also spaces or gaps in one sequence (or, conversely, insertions in the other sequence) and mismatches, both of which can correspond to mutations. In sequence alignment, you want to find an optimal alignment that, loosely speaking, maximizes the number of matches and minimizes the number of spaces and mismatches. More formally, you can determine a score for each possible alignment by adding points for matching characters and subtracting points for spaces and mismatches.

Global and local sequence alignment

Global sequence alignment tries to find the best alignment between an entire sequence S1 and another entire sequence S2. Consider these two DNA sequences:

S1 = GCCCTAGCG
S2 = GCGCAATG

If you award matches one point, penalize spaces by two points, and penalize mismatches by one point, the following is an optimal global alignment:

S1′ = GCCCTAGCG
S2′ = GCGC-AATG

A dash (-) denotes a space. There are five matches, one space in S2′ (or, conversely, one insertion in S1′), and three mismatches. This yields a score of (5 1) + (1 -2) + (3 * -1) = 0, which is the best you can do.

With local sequence alignment, you’re not constrained to aligning the whole of both sequences; you can just use parts of each to obtain a maximum score. Using the same sequences S1 and S2 and the same scoring scheme, you obtain the following optimal local alignment S1” and S2”:

S1 = GCCCTAGCGS1= GCCCTAGCG
S1” = GCGS1'' = GCG
S2” = GCGS2'' = GCG
S2 = GCGCAATGS2= GCGCAATG

This local alignment doesn’t happen to have any mismatches or spaces, although, in general, local alignments can have them. This local alignment has a score of (3 1) + (0 -2) + (0 * -1) = 3. (The score of the best local alignment is greater than or equal to the score of the best global alignment, because a global alignment is a local alignment.)

Needleman-Wunsch algorithm

The Needleman-Wunsch algorithm is used for computing global alignments. The idea is similar to the LCS algorithm. Again, you have a two-dimensional table with one sequence along the top and one along the left side. Again, you can arrive at each cell in one of three ways:

From the cell above, which corresponds to aligning the character to the left with a space
From the cell to the left, which corresponds to aligning the character above with a space
From the cell diagonally to the above-left, which corresponds to aligning the characters to the left and above (which might or might not match)

I’ll first give you the whole table (see Figure 7), and you can refer back to it as I explain how it was filled in:

Figure 7. Filled-in Needleman-Wunsch table with traceback

First, you must initialize the table. This means filling in the scores and pointers for the second row and second column. Traveling to the right in the second row corresponds to using a character in the first sequence along the top and using a space, rather than the first character of the sequence going down the left. The space penalty is -2, so, each time you do this, you add -2 to the previous cell. The previous cell is the one to the left. So, this explains how you get the 0, -2, -4, -6, … sequence in the second row. Similarly, you obtain the scores and pointers going down the second column. Listing 10 shows initialization code for the Needleman-Wunsch algorithm:

Listing 10. Needleman-Wunsch initialization code

protected Cell getInitialPointer(int row, int col) {
   if (row == 0 && col != 0) {
      return scoreTable[row][col ‑ 1];
   } else if (col == 0 && row != 0) {
      return scoreTable[row ‑ 1][col];
   } else {
      return null;
   }
}

protected int getInitialScore(int row, int col) {
   if (row == 0 && col != 0) {
      return col ∗ space;
   } else if (col == 0 && row != 0) {
      return row ∗ space;
   } else {
      return 0;
   }
}

Next, you need to fill in the remaining cells. As with the LCS algorithm, for each cell you have three choices and pick the maximum one. You can come at each cell from above, from the left, or from the above-left. Let S1 and S2 be the strings you’re trying to align, and S1′ and S2′ be the strings in the resulting alignment. Coming at the cell from above is the same as adding the character at the left from S2 to S2′, while skipping the character in S1 above for now and introducing a space in S1′. Because a space has a score of -2, you would obtain a score for the current cell by subtracting 2 from the cell above. Similarly, you could come to the blank cell from the left by subtracting 2 from the score in the cell to the left. Finally, you could add the character above to S1′ and the character to the left to S2′. This corresponds to entering the blank cell from the above-left. These two characters will match, in which case the new score is the score in the cell to the above-left plus 1; or they won’t match, in which case the new score is the score in the cell to the above-left minus 1. Of these three possibilities, you pick one that gives you the maximum score (picking an arbitrary high-scoring cell, if there is a tie). If you look at the pointers in Figure 7, you can find examples of each of these three possibilities.

Listing 11 shows the code for filling in the blank cells:

Listing 11. Needleman-Wunsch code for filling in the table

protected void fillInCell(Cell currentCell, Cell cellAbove, Cell cellToLeft,
      Cell cellAboveLeft) {
   int rowSpaceScore = cellAbove.getScore() + space;
   int colSpaceScore = cellToLeft.getScore() + space;
   int matchOrMismatchScore = cellAboveLeft.getScore();
   if (sequence2.charAt(currentCell.getRow() ‑ 1) == sequence1
         .charAt(currentCell.getCol() ‑ 1)) {
      matchOrMismatchScore += match;
   } else {
      matchOrMismatchScore += mismatch;
   }
   if (rowSpaceScore >= colSpaceScore) {
      if (matchOrMismatchScore >= rowSpaceScore) {
         currentCell.setScore(matchOrMismatchScore);
         currentCell.setPrevCell(cellAboveLeft);
      } else {
         currentCell.setScore(rowSpaceScore);
         currentCell.setPrevCell(cellAbove);
      }
   } else {
      if (matchOrMismatchScore >= colSpaceScore) {
         currentCell.setScore(matchOrMismatchScore);
         currentCell.setPrevCell(cellAboveLeft);
      } else {
         currentCell.setScore(colSpaceScore);
         currentCell.setPrevCell(cellToLeft);
      }
   }
}

Next, you need to obtain the actual alignment strings —S1′ and S2′— and the alignment score. The score in the bottom-right cell contains the maximum alignment score for S1 and S2, just as it contains the length of an LCS in the LCS algorithm. And, similarly to the LCS algorithm, to obtain S1′ and S2′, you trace back from this bottom-right cell, following the pointers, and build up S1′ and S2′ in reverse. From constructing the table, you know that going down corresponds to adding the character to the left from S2 to S2′ while adding a space to S1′; going right corresponds to adding the character above from S1 to S1′ while adding a space to S2′; and going down and to the right means adding a character from S1 and S2 to S1′ and S2′, respectively.

The traceback code that you use for Needleman-Wunsch turns out to be identical to that used for Smith-Waterman for local alignment, except for determining which cell you start in and how you know when to finish the traceback. Listing 12 shows the code that the two algorithms share:

Listing 12. Traceback code used for both Needleman-Wunsch and Smith-Waterman

protected Object getTraceback() {
   StringBuffer align1Buf = new StringBuffer();
   StringBuffer align2Buf = new StringBuffer();
   Cell currentCell = getTracebackStartingCell();
   while (traceBackIsNotDone(currentCell)) {
      if (currentCell.getRow() ‑ currentCell.getPrevCell().getRow() == 1) {
         align2Buf.insert(0, sequence2.charAt(currentCell.getRow() ‑ 1));
      } else {
         align2Buf.insert(0, '‑');
      }
      if (currentCell.getCol() ‑ currentCell.getPrevCell().getCol() == 1) {
         align1Buf.insert(0, sequence1.charAt(currentCell.getCol() ‑ 1));
      } else {
         align1Buf.insert(0, '‑');
      }
      currentCell = currentCell.getPrevCell();
   }

   String[] alignments = new String[] { align1Buf.toString(),
         align2Buf.toString() };

   return alignments;
}

protected abstract boolean traceBackIsNotDone(Cell currentCell);

protected abstract Cell getTracebackStartingCell();

Listing 13 shows the traceback code specific to Needleman-Wunsch:

Listing 13. Needleman-Wunsch traceback code

protected boolean traceBackIsNotDone(Cell currentCell) {
   return currentCell.getPrevCell() != null;
}

protected Cell getTracebackStartingCell() {
   return scoreTable[scoreTable.length ‑ 1][scoreTable[0].length ‑ 1];
}

The original Needleman-Wunsch algorithm

Strictly speaking, I haven’t shown you the Needleman-Wunsch algorithm. The original algorithm published by Needleman-Wunsch runs in cubic time and is no longer used. However, the quadratic algorithm discussed here is still commonly referred to as the Needleman-Wunsch algorithm.

Tracing backwards gives you the optimal global alignment I mentioned at the beginning of this section:

S1′ = GCCCTAGCG
S2′ = GCGC-AATG

Clearly, this algorithm runs in O(mn) time.

Smith-Waterman algorithm

In the Smith-Waterman algorithm, you’re not constrained to aligning the entire sequences. This, and the fact that two zero-length strings is a local alignment with score of 0, means that in building up a local alignment you don’t need to “go into the red” and have partial scores that are negative. That would cause further alignments to have a score lower than you could get by “resetting” with two zero-length strings. Also, your local alignment doesn’t need to end at the end of either sequence, so you don’t need to start your traceback in the bottom-right corner; you can start it in the cell with the highest score.

This leads to three ways that the Smith-Waterman algorithm differs from the Needleman-Wunsch algorithm. First, in the initialization stage, the first row and first column are all filled in with 0s (and the pointers in the first row and first column are all null). Listing 14 shows the Smith-Waterman initialization code:

Listing 14. Smith-Waterman initialization

protected int getInitialScore(int row, int col) {
   return 0;
}

protected Cell getInitialPointer(int row, int col) {
   return null;
}

Second, when you fill in the table, if a score becomes negative, you put in 0 instead, and you add the pointer back only for cells that have positive scores. Note in Listing 15 that you also keep track of which cell has the high score; you’ll need that for the traceback:

Listing 15. Smith-Waterman code for filling in a cell

protected void fillInCell(Cell currentCell, Cell cellAbove, Cell cellToLeft,
      Cell cellAboveLeft) {
   int rowSpaceScore = cellAbove.getScore() + space;
   int colSpaceScore = cellToLeft.getScore() + space;
   int matchOrMismatchScore = cellAboveLeft.getScore();
   if (sequence2.charAt(currentCell.getRow() ‑ 1) == sequence1
         .charAt(currentCell.getCol() ‑ 1)) {
      matchOrMismatchScore += match;
   } else {
      matchOrMismatchScore += mismatch;
   }
   if (rowSpaceScore >= colSpaceScore) {
      if (matchOrMismatchScore >= rowSpaceScore) {
         if (matchOrMismatchScore > 0) {
            currentCell.setScore(matchOrMismatchScore);
            currentCell.setPrevCell(cellAboveLeft);
         }
      } else {
         if (rowSpaceScore > 0) {
            currentCell.setScore(rowSpaceScore);
            currentCell.setPrevCell(cellAbove);
         }
      }
   } else {
      if (matchOrMismatchScore >= colSpaceScore) {
         if (matchOrMismatchScore > 0) {
            currentCell.setScore(matchOrMismatchScore);
            currentCell.setPrevCell(cellAboveLeft);
         }
      } else {
         if (colSpaceScore > 0) {
            currentCell.setScore(colSpaceScore);
            currentCell.setPrevCell(cellToLeft);
         }
      }
   }
   if (currentCell.getScore() > highScoreCell.getScore()) {
      highScoreCell = currentCell;
   }
}

Finally, in the traceback, you start with the cell that has the highest score and work back until you reach a cell with a score of 0. Otherwise, the traceback works exactly the same as in the Needleman-Wunsch algorithm. Listing 16 shows the Smith-Waterman traceback code:

Listing 16. Smith-Waterman traceback code

protected boolean traceBackIsNotDone(Cell currentCell) {
   return currentCell.getScore() != 0;
}

protected Cell getTracebackStartingCell() {
   return highScoreCell;
}

Figure 8 illustrates running the Smith-Waterman algorithm on the S1 and S2 sequences that you’ve been using throughout this article:

Figure 8. Filled-in Smith-Waterman table with traceback

As with the Needleman-Wunsch algorithm, the optimal local alignment that you get from running the Smith-Waterman code (or from reading from Figure 8) is:

S1 = GCCCTAGCGS1= GCCCTAGCG
S1” = GCGS1'' = GCG
S2” = GCGS2'' = GCG
S2 = GCGCAATGS2= GCGCAATG

ALIGN, FASTA, and BLAST

This article shows you basic implementations of the Needleman-Wunsch and Smith-Waterman algorithms, without optimizations, for finding global and local alignments in O(mn) time. Real-world researchers are usually not comparing two sequences, but are instead trying to find all sequences similar to a particular sequence. If one of the similar sequences they find has a known biological function, then there is a good chance that the original sequence has a similar function because similar sequences are likely to have similar functions. ALIGN, FASTA, and BLAST (Basic Local Alignment Search Tool) are industrial-grade applications that find global (ALIGN) and local (FASTA and BLAST) alignments. BLAST searches large sequence databases for sequences that are similar (and possibly homologous) to a user-input sequence and ranks the results by similarity. BLAST doesn’t use Smith-Waterman directly because, even with a quadratic running time, it would be too slow at comparing a sequence against each sequence in extremely large databases of gene sequences, each of which may consist of as many as 3 billion base pairs (or more). Instead, BLAST first uses a process called seeding to find seeds, which are the beginnings of possible matches or hits. BLAST then uses a dynamic programming algorithm to extend the possible hits found to actual local alignments with the input sequence. Finally, it finds which of the matches are statistically significant and ranks them. This partly heuristic process isn’t as sensitive (accurate) as Smith-Waterman, but it’s much quicker.

BioJava

BioJava is an open source project developing a Java framework for processing biological data. Its features include objects for manipulating biological sequences, tools for making sequence-analysis GUIs, and analysis and statistical routines that include a dynamic-programming toolkit.

Listing 17 shows how to run the BioJava implementations of Needleman-Wunsch and Smith-Waterman on the same sequences and scoring scheme this article’s earlier examples use:

Listing 17. BioJava code sequence alignment code (based on BioJava example code by Andreas Dräger)

// The alphabet of the sequences. For this example DNA is chosen.
FiniteAlphabet alphabet = (FiniteAlphabet) AlphabetManager
      .alphabetForName("DNA");
// Use a substitution matrix with equal scores for every match and every
// replace.
int match = 1;
int replace = ‑1;
SubstitutionMatrix matrix = new SubstitutionMatrix(alphabet, match,
      replace);
// Firstly, define the expenses (penalties) for every single operation.
int insert = 2;
int delete = 2;
int gapExtend = 2;
// Global alignment.
SequenceAlignment aligner = new NeedlemanWunsch(match, replace, insert,
      delete, gapExtend, matrix);
Sequence query = DNATools.createDNASequence("GCCCTAGCG", "query");
Sequence target = DNATools.createDNASequence("GCGCAATG", "target");
// Perform an alignment and save the results.
aligner.pairwiseAlignment(query, // first sequence
      target // second one
      );
// Print the alignment to the screen
System.out.println("Global alignment with Needleman‑Wunsch:\n"
      + aligner.getAlignmentString());

// Perform a local alignment from the sequences with Smith‑Waterman.
aligner = new SmithWaterman(match, replace, insert, delete, gapExtend,
      matrix);
// Perform the local alignment.
aligner.pairwiseAlignment(query, target);
System.out.println("\nLocal alignment with Smith‑Waterman:\n"
      + aligner.getAlignmentString());

The BioJava methods have a little more generality to them. First, note the use of a SubstitutionMatrix. The examples so far have naively assumed that the penalty for a mismatch between DNA bases should be equal — for example, that a G is as likely to mutate into an A as a C. But this isn’t true in real biological sequences, especially amino acids in proteins. You want to penalize unlikely mismatches more than likely mismatches. A substitution matrix lets you assign match scores individually to each pair of symbols. In a sense, substitution matrices code up chemical properties. For example, the BLOSUM (BLOcks SUbstitution Matrix) matrices for proteins are commonly used in BLAST searches; the values in the BLOSUM matrices were empirically determined.

Next, note the use of insert and delete scores, rather than just a single space score. As I’ve said, you can think of a space as an insertion in the sequence without the space, or as a deletion in the sequence with the space. In general, there are two complementary ways to compare two sequences. You’ve been looking at them in a “static” manner and seeing how they differ. You can also compare them by finding the minimum number of insertions, deletions, and changes of individual symbols you’d have to make to one sequence to transform it into the other. This minimum number of changes is called the edit distance. When calculating the edit distance, you might want to assign different values to insertions and deletions. For example, maybe insertions are more common and you’d want to penalize them less than deletions.

Use of Perl in bioinformatics

Much of the big-server bioinformatics software is written in C or C . BLAST was originally written in C, and now there’s a C version. But many of the small applications written by researchers — who, in many cases, might be professional biologists first and programmers a distant second — are written in Perl. This could be because the biggest open source bioinformatics library, Bioperl, is written in Perl. If you want to get a job doing bioinformatics programming, you’ll probably need to learn Perl and Bioperl at some point.

Now note the gapExtend variable. Technically, a gap is a maximal sequence of contiguous spaces. However, some of the literature uses the term gap when it really means a space. You’ve scored all spaces equally even when they’re part of a larger gap. However, in nature, once a gap has started, the chance of it extending by another space is greater than the chance of it starting to begin with. So, to get meaningful results, you would want to penalize subsequent spaces in a gap less than the initial space in the gap. This is what the gapExtend variable is for. Keep in mind that, algorithmically speaking, all these scoring schemes are somewhat arbitrary, but obviously you want the string edit distances you’re computing to conform to evolutionary distances in nature as closely as possible. (Coming up with appropriate scoring schemes for different situations is quite an interesting and complicated subfield in itself.)

Finally, the insert, delete, and gapExtend variables have positive values, rather than the negative values you used earlier because they are defined as expenses (costs or penalties).

When you run the code in Listing 17, you get the following output:

Listing 18. Output

Global alignment with Needleman‑Wunsch:

  Length:9
  Score:‑0.0
  Query:query,Length:9
  Target:target,Length:8

Query: 1 gccctagcg 9
          || | |  |
Target: 1 gcgc‑aatg 8


Local alignment with Smith‑Waterman:

  Length:3
  Score:3.0
  Query:query,Length:9
  Target:target,Length:8

Query: 7 gcg 9
          |||
Target: 1 gcg 3

For both local and global alignment, you get the same scores as you did earlier. This implementation of Smith-Waterman gives you the same local alignment you obtained earlier. This implementation of Needleman-Wunsch gives you a different global alignment, but with the same score, from the one you obtained earlier. However, they’re both maximal global alignments. Recall that when you’re filling out your table, you can sometimes get a maximum score in a cell from more than one of the previous cells. Depending on which one you choose to point back to, you will end up with different alignments (but all with the same score).

Conclusion

This article has looked at three examples of problems that can be solved using dynamic programming. They all share these characteristics:

The solution to each of them could be expressed as a recurrence relation.
The naive implementation of this recurrence relation as a recursive method would have led to an inefficient solution involving multiple computations of subproblems.
An optimal solution to the problem could be constructed from optimal solutions to subproblems of the original problem.

Dynamic programming is also used in matrix-chain multiplication, assembly-line scheduling, and computer chess programs. It’s often needed to solve tough problems in programming contests. Interested readers can consult the book Introduction to Algorithms for more details on when dynamic programming is applicable and how the correctness of dynamic programming algorithms is usually proved.

Dynamic programming is maybe the most important use of computer science in biology, but certainly not the only one. Bioinformatics and computational biology are interdisciplinary fields that are quickly becoming disciplines in themselves with academic programs dedicated to them. Many molecular biologists now know a little programming, and there’s much interesting and important work to be done by programmers who can learn a little biology.

Acknowledgments

Thanks to Sonna Bristle, who got me interested in computational biology, and Carlos P. Sosa of IBM for reviewing a version of this article and giving helpful suggestions.

你可能感兴趣的:(Dynamic programming and sequence alignment)

老系统改造增加初始化，自动化数据源配置（tomcat+jsp+springmvc）
老系统改造增加初始化，自动化数据源配置一、前言二、改造描述1、环境说明2、实现步骤简要思考三、开始改造1、准备sql初始化文件2、启动时自动读取jdbc文件，创建数据源，如未配置，需要一个默认的临时数据源2.1去掉spingmvc原本配置的固定dataSource，改为动态dataSource2.2代码类，这里是示例，我就不管规范了，放到一起2.2.1DynamicDataSourceConfig
InnoDB引擎行存储结构
InnoDB引擎行存储结构文章目录InnoDB引擎行存储结构1.存储引擎2.InnoDB页的概念3.InnoDB行格式3.1指定行格式3.2COMPACT格式3.3REDUNDANT行格式3.4溢出列3.5DYNAMIC行格式和COMPRESSED行格式1.存储引擎[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Y7BY5kOU-1643188470321)(C:\U
Dart 语言知识点总结小李飞飞砖 javascript 开发语言 ecmascript
Dart语言知识点总结Dart是Flutter框架的编程语言，是一种面向对象的、强类型的、支持垃圾回收的语言。以下是Dart语言的核心知识点：一、基础语法1.变量与常量//变量声明varname='Alice';//类型推断Stringname='Alice';//显式类型dynamicdynamicVar='String';//动态类型//常量finalfinalVar='不可修改';//运行时
Android-kotlin之Flow基础实战应用每次的天空 android kotlin 开发语言
一、Flow是什么？Flow是一种用于处理异步数据流的强大工具，它基于协程实现，支持响应式编程模式。Flow是一个冷流（ColdStream），即只有在被收集（collect）时才会开始执行，类似于Kotlin序列（Sequence）的惰性求值特性。它可以异步地发射多个值，支持背压（Backpressure）机制。核心特点异步/非阻塞：Flow中的代码可以挂起而不阻塞线程。支持协程上下文：可以在不
web 系统对接飞书三方登录完整步骤实战使用示例慧一居士架构总结架构系统架构
下面我将详细说明Web系统对接飞书三方登录的完整步骤，并提供实战示例（基于Node.js/Express）：一、完整对接流程注册飞书开放平台应用登录飞书开放平台创建企业自建应用→获取AppID和AppSecret配置安全域名和重定向URL（如https://yourdomain.com/auth/feishu/callback）OAuth2.0授权流程sequenceDiagram用户->>你的应
【Python】OpenAI API 宅男很神经 python 开发语言
【Python与OpenAIAPI深度探索：从基础到未来】第一章：OpenAIAPI概览与核心概念1.1OpenAIAPI是什么？能做什么？OpenAIAPI(ApplicationProgrammingInterface，应用程序编程接口)是一套允许开发者通过编程方式访问和使用OpenAI开发的各种先进人工智能模型的服务。这些模型经过海量数据的训练，能够在多种任务上达到甚至超越人类水平。通过AP
动态时间规整（Dynamic Time Warping，DTW）介绍 EmorZhong 机器学习人工智能深度学习数据结构算法
在时序数据分析中，动态时间规整（DynamicTimeWarping，DTW）是一种经典的用于度量两个时间序列相似度的算法。它的核心价值在于解决了传统距离度量（如欧氏距离）在处理时间序列时的局限性——尤其是当序列存在时间错位（如节奏快慢不同）或长度差异时，仍能准确捕捉它们的“形状相似性”。一、为什么需要DTW？传统的距离度量（如欧氏距离）要求两个时间序列必须长度相同且时间点严格对齐。但实际场景中，
Dynamics 365 核心技术深度分析洁辉架构
Dynamics365核心技术深度分析一、实体创建与设置（核心基础）实体是Dynamics365数据模型的核心单元，相当于数据库中的表结构1.实体创建流程确定业务需求创建新实体定义字段/属性设置关系配置视图/表单设置安全性发布自定义项2.关键设置项字段类型：单行文本、选项集（下拉菜单）、两个选项（布尔值）货币、日期时间、查找（关联其他实体）图像、文件（D365v9.0+）高级属性：//字段属性示例
动态时间规整（Dynamic Time Warping，DTW）补充案例 EmorZhong python 人工智能机器学习算法动态规划
DTW的边界条件是确保累积距离矩阵计算“有起点、有规则”的基础，它规定了矩阵中第一行和第一列的累积距离如何计算（因为这两行/列是路径的“起点边缘”，没有“上一步”的全部选择）。下面结合具体场景和例子展开说明：为什么需要边界条件？累积距离矩阵(D[i][j])的核心递归公式是：[D[i][j]=\text{dist}[i][j]+\min\left(D[i-1][j],\D[i][j-1],\D[i
@Api(description = “产品-后台“)：深度剖析 listByBrand 接口的设计与实现 ✨ 小丁学Java Spring Data JPA java
一次API(ApplicationProgrammingInterface,应用程序接口)请求的“奇幻漂流”：深度剖析listByBrand接口的设计与实现大家好！在日常开发中，我们每天都在编写和调用各种API(ApplicationProgrammingInterface,应用程序接口)。一个看似简单的列表查询接口，其背后可能隐藏着精妙的权限控制、清晰的分层设计和灵活的数据处理逻辑。今天，我们就
【论文阅读】SSCL-AMC：一种基于动态增强和集成学习的自监督自动调制分类方法
SSCL-AMC:ASelf-supervisedAutomaticModulationClassificationMethodviaDynamicAugmentationandEnsembleLearning摘要：与传统的手工自动调制分类（AMC）方法相比，深度学习已经显示出有希望的结果，AMC作为信号检测和调制之间的中间步骤发挥着关键作用。然而，获取大规模标记数据仍然具有挑战性，因为数据质量和
lstm 输入数据维度_keras中关于输入尺寸、LSTM的stateful问题 weixin_39856269 lstm 输入数据维度
补充：return_sequence,return_state都是针对一个时间切片(步长)内的h和c状态，而stateful是针对不同的batch之间的。多层LSTM需要设置return_sequence=True,后面再设置return_sequence=False.最近在学习使用keras搭建LSTM的时候，遇到了一些不明白的地方。有些搞懂了，有些还没有搞懂。现在记下来，因为很快就会忘记!-_
torch 填充补齐 AI算法网奇 python宝典 python
目录行填充补齐1.填充长度（Padding）2.掩码（Masking）3.排序优化（可选）行填充补齐importtorchfromtorch.nn.utils.rnnimportpad_sequence#原始序列（每个序列是二维张量，行数不同）batch_data=[torch.tensor([[1,2,3]])#1行#torch.tensor([[4,5,6],[7,8,9]]),#2行#tor
lstm 数据输入问题 AI算法网奇 python基础 lstm 人工智能
lstm我有20*6条数据，20个样本，每个样本6条历史数据，每条数据有5个值，我送给网络输入时应该是20*6*5还是6*20*5你的数据是：20个样本（batchsize=20）每个样本有6条历史数据（sequencelength=6）每条数据有5个值（inputsize=5）✅正确的输入形状是：(20,6,5)#即batch_size=20,seq_len=6,input_size=5前提是你
Spring Boot + Spring JPA + JDBC + Druid实现动态数据源切换 Apr01Chell 代码片段 spring java 数据库
SpringBoot+SpringJPA+JDBC+Druid实现动态数据源切换目录SpringBoot+SpringJPA+JDBC+Druid实现动态数据源切换AbstractRoutingDataSource源码分析需求代码实现DynamicDataSourceDBContextHolderDruidDbConfigDataSourcePropertiesAllDataSourcesExec
RAG 之 Prompt 动态选择的三种方式 2301_79306982 prompt rag ai
“如果我有5个prompt模板，我想只选择一个每次都自动五选一能做到吗怎么做？”完全可以做到。这在复杂的RAG或Agentic工作流中是一个非常普遍且关键的需求，通常被称为“条件路由（ConditionalRouting）”或“动态调度（DynamicDispatching）”。其核心思想是系统需要根据输入的上下文（Query）或其他中间状态，智能地判断哪一个Prompt模板最适合用于生成最终答案
使用STM32CubeMX在嵌入式系统中实现通过FMC读写SDRAM 程序员杨弋嵌入式开发 stm32 嵌入式硬件单片机嵌入式
嵌入式系统中的存储器是非常重要的组成部分，为了满足大容量和高速度要求，SDRAM（SynchronousDynamicRandomAccessMemory）是常用的选择之一。本文将介绍如何使用STM32CubeMX配置硬件FMC（FlexibleMemoryController）以实现在STM32微控制器上读写SDRAM。1、STM32CubeMX配置FMC和SDRAM首先，我们需要打开STM32
华为OD机试专栏--1.3 算法基础：1.3.3 动态规划入门 xiaoheshang_123 华为OD机试真题题库解析华为od 面试职场和发展算法
目录1.3算法基础1.3.3动态规划入门一、动态规划的核心思想1.1什么是动态规划？1.2动态规划的特点二、动态规划的基本步骤三、经典动态规划问题3.1斐波那契数列（FibonacciSequence）问题描述动态规划解法代码实现（Python）3.2背包问题（KnapsackProblem）问题描述动态规划解法代码实现（Python）3.3最长公共子序列（LongestCommonSubsequ
量子化学仿真软件：NWChem_（7）.分子动力学模拟 kkchenjj 化工仿真2 化工仿真模拟人工智能化工仿真
分子动力学模拟分子动力学（MolecularDynamics,MD）模拟是量子化学和材料科学中常用的一种计算方法，用于研究分子系统在不同时间和空间尺度下的行为。通过MD模拟，可以观察分子的运动轨迹、能量变化、温度和压力等物理性质，从而深入了解分子间的相互作用和系统的动力学性质。在NWChem中，MD模拟可以通过多种方法实现，包括经典MD和量子MD。本节将详细介绍如何在NWChem中进行分子动力学模
iOS 各种demo链接汇总~其它UI
//联系人:石虎QQ:1224614774昵称:嗡嘛呢叭咪哄一、其他UIAwesomeMenu-最多人用的Path菜单。DCPathButton-Path，4.0的弹出菜单，呼出或者关闭菜单时，多个小图标会分别按照逆时针和顺时针的方向进行滚动。SphereMenu-利用UIDynamicAnimator的有趣的菜单，path类似。KYGooeyMenu-KYGooeyMenu是一个具有GooeyE
OpenRocket 开发环境搭建指南邓朝昌Estra
OpenRocket开发环境搭建指南openrocketModel-rocketryaerodynamicsandtrajectorysimulationsoftware项目地址:https://gitcode.com/gh_mirrors/op/openrocket前言OpenRocket是一款开源的火箭设计与仿真软件，采用Java语言开发。本文将详细介绍如何搭建OpenRocket的开发环境，
2024年12月30日Github流行趋势
项目名称：free-programming-books项目地址url：https://github.com/EbookFoundation/free-programming-books项目语言：HTML历史star数：343,398今日star数：246项目维护者：vhf,eshellman,davorpa,MHM5000,kadhirash项目简介：免费提供的编程书籍。项目名称：public-a
深入探讨Qt智能指针的用法码农飞飞 QT+QML qt 智能指针指针转换内存泄漏内存管理 QSharedPointer QPointer
文章目录QPointerQSharedPointerQWeakPointerQScopedPointerQScopedArrayPointer类型转换qSharedPointerCastqSharedPointerDynamicCastqSharedPointerConstCastqWeakPointerCastqSharedPointerObjectCastQt的智能指针提供了方便的资源管理工具
Cesium billboard 实现动态效果 swift1224L Cesium 前端开发语言 webgl
Cesiumbillboard实现动态效果自转、跳动、闪烁、旋转等动态效果加载GIF自转、跳动、闪烁、旋转等动态效果/***利用BillboardGraphics的scale、rotation、width、position属性和和CallbackProperty实现动态改变entitybillboard*/loadDynamicBillboardByEntity(viewer:Cesium.Vie
人工智能发展简史——未来是属于AI人工智能的。 AI天才研究院 ChatGPT AI人工智能与大数据人工智能
目录人工智能发展简史第一章：起步期-20世纪50年代及以前1.1计算机象棋博弈（Programmingacomputerforplayingchess）1.2图灵测试（TuringTest）1.3达特茅斯学院人工智能夏季研讨会（DartmouthSummerResearchConferenceonArtificialIntelligence）1.4感知机（Perceptrons）第二章：第一次浪潮
集训DAY7之线性dp与前缀优化/stl优化心之所向凉月空 c++开发语言数据结构算法
集训DAY7之线性DP与前缀优化/STL优化目录DP的概念与思想核心DP的题目类型线性DP详解DP的优化策略后记DP的概念与思想核心DP的定义DP也就是动态规划(DynamicProgramming)是求解决策过程最优化的过程动态规划主要用于求解以时间划分阶段的动态过程的优化问题DP的基本思想动态规划算法通常用于求解具有某种最优性质的问题。在这类问题中我们常常需要在多个可行解中寻找最优解，其基本思
Java手动打印执行过的sql GoodStudyAndDayDayUp java sql 开发语言
1.拦截器packagecom.xxx.platform.common.interceptor;importcom.baomidou.dynamic.datasource.toolkit.DynamicDataSourceContextHolder;importcom.xxx.platform.common.aop.OLAPQuery;importcom.xxx.platform.constant
r语言改变数据框列名_数据决定离线强化学习将如何改变我们的语言习惯杨_明 python 大数据人工智能 java 机器学习
r语言改变数据框列名重点(Tophighlight)Aridesharingcompanycollectsadatasetofpricinganddiscountdecisionswithcorrespondingchangesincustomeranddriverbehavior,inordertooptimizeadynamicpricingstrategy.Anonlinevendorrec
Spark RDD 及性能调优 Aurora_NeAr spark wpf c#
RDDProgrammingRDD核心架构与特性分区（Partitions）：数据被切分为多个分区；每个分区在集群节点上独立处理；分区是并行计算的基本单位。计算函数（ComputeFunction）：每个分区应用相同的转换函数；惰性执行机制。依赖关系（Dependencies）窄依赖：1个父分区→1个子分区（map、filter）。宽依赖：1个父分区→多个子分区（groupByKey、join）。
蜜罐的工作原理和架构
大家读完觉得有帮助记得关注和点赞！！！蜜罐（Honeypot）是一种**主动防御技术**，通过部署虚假系统、服务或数据，诱骗攻击者入侵，从而**捕获攻击行为、分析攻击工具、收集威胁情报**。以下从工作原理到架构的深度解析：---###一、蜜罐核心工作原理####**欺骗三部曲**```mermaidsequenceDiagramattacker->>+honeypot:探测与攻击honeypot-
Algorithm 香水浓 java Algorithm
冒泡排序 public static void sort(Integer[] param) { for (int i = param.length - 1; i > 0; i--) { for (int j = 0; j < i; j++) { int current = param[j]; int next = param[j + 1];
mongoDB 复杂查询表达式开窍的石头 mongodb
1:count Pg: db.user.find().count(); 统计多少条数据 2:不等于$ne Pg: db.user.find({_id:{$ne:3}},{name:1,sex:1,_id:0}); 查询id不等于3的数据。 3：大于$gt $gte(大于等于) &n
Jboss Java heap space异常解决方法, jboss OutOfMemoryError : PermGen space 0624chenhong jvm jboss
转自 http://blog.csdn.net/zou274/article/details/5552630 解决办法： window->preferences->java->installed jres->edit jre 把default vm arguments 的参数设为-Xms64m -Xmx512m ----------------
文件上传下载解析相对路径不懂事的小屁孩文件上传
有点坑吧，弄这么一个简单的东西弄了一天多，身边还有大神指导着，网上各种百度着。下面总结一下遇到的问题：文件上传，在页面上传的时候，不要想着去操作绝对路径，浏览器会对客户端的信息进行保护，避免用户信息收到攻击。在上传图片，或者文件时，使用form表单来操作。前台通过form表单传输一个流到后台，而不是ajax传递参数到后台，代码如下: <form action=&
怎么实现qq空间批量点赞换个号韩国红果果 qq
纯粹为了好玩！！逻辑很简单 1 打开浏览器console；输入以下代码。先上添加赞的代码 var tools={}; //添加所有赞 function init(){ document.body.scrollTop=10000; setTimeout(function(){document.body.scrollTop=0;},2000);//加
判断是否为中文灵静志远中文
方法一： public class Zhidao { public static void main(String args[]) { String s = "sdf灭礌 kjl d{';\fdsjlk是"; int n=0; for(int i=0; i<s.length(); i++) { n = (int)s.charAt(i); if((
一个电话面试后总结 a-john 面试
今天，接了一个电话面试，对于还是初学者的我来说，紧张了半天。面试的问题分了层次，对于一类问题，由简到难。自己觉得回答不好的地方作了一下总结：在谈到集合类的时候，举几个常用的集合类，想都没想，直接说了list,map。然后对list和map分别举几个类型： list方面：ArrayList,LinkedList。在谈到他们的区别时，愣住了
MSSQL中Escape转义的使用 aijuans MSSQL
IF OBJECT_ID('tempdb..#ABC') is not null drop table tempdb..#ABC create table #ABC ( PATHNAME NVARCHAR(50) ) insert into #ABC SELECT N'/ABCDEFGHI' UNION ALL SELECT N'/ABCDGAFGASASSDFA' UNION ALL
一个简单的存储过程 asialee mysql 存储过程构造数据批量插入
今天要批量的生成一批测试数据，其中中间有部分数据是变化的，本来想写个程序来生成的，后来想到存储过程就可以搞定，所以随手写了一个，记录在此： DELIMITER $$ DROP PROCEDURE IF EXISTS inse
annot convert from HomeFragment_1 to Fragment 百合不是茶 android 导包错误
创建了几个类继承Fragment, 需要将创建的类存储在ArrayList<Fragment>中; 出现不能将new 出来的对象放到队列中,原因很简单; 创建类时引入包是:import android.app.Fragment; 创建队列和对象时使用的包是:import android.support.v4.ap
Weblogic10两种修改端口的方法 bijian1013 weblogic 端口号配置管理 config.xml
一.进入控制台进行修改 1.进入控制台: http://127.0.0.1:7001/console 2.展开左边树菜单域结构->环境->服务器-->点击AdminServer(管理) &
mysql 操作指令征客丶 mysql
一、连接mysql 进入 mysql 的安装目录； $ bin/mysql -p [host IP 如果是登录本地的mysql 可以不写 -p 直接 -u] -u [userName] -p 输入密码，回车，接连；二、权限操作［如果你很了解mysql数据库后，你可以直接去修改系统表，然后用 mysql> flush privileges; 指令让权限生效］ 1、赋权 mys
【Hive一】Hive入门 bit1129 hive
Hive安装与配置 Hive的运行需要依赖于Hadoop，因此需要首先安装Hadoop2.5.2，并且Hive的启动前需要首先启动Hadoop。 Hive安装和配置的步骤 1. 从如下地址下载Hive0.14.0 http://mirror.bit.edu.cn/apache/hive/ 2.解压hive，在系统变
ajax 三种提交请求的方法 BlueSkator Ajax jqery
1、ajax 提交请求 $.ajax({ type:"post", url : "${ctx}/front/Hotel/getAllHotelByAjax.do", dataType : "json", success : function(result) { try { for(v
mongodb开发环境下的搭建入门 braveCS 运维
linux下安装mongodb 1）官网下载mongodb-linux-x86_64-rhel62-3.0.4.gz 2）linux 解压 gzip -d mongodb-linux-x86_64-rhel62-3.0.4.gz; mv mongodb-linux-x86_64-rhel62-3.0.4 mongodb-linux-x86_64-rhel62-
编程之美-最短摘要的生成 bylijinnan java 数据结构算法编程之美
import java.util.HashMap; import java.util.Map; import java.util.Map.Entry; public class ShortestAbstract { /** * 编程之美最短摘要的生成 * 扫描过程始终保持一个[pBegin,pEnd]的range,初始化确保[pBegin,pEnd]的ran
json数据解析及typeof chengxuyuancsdn js typeof json解析
// json格式 var people='{"authors": [{"firstName": "AAA","lastName": "BBB"},' +' {"firstName": "CCC&
流程系统设计的层次和目标 comsci 设计模式数据结构 sql 框架脚本
流程系统设计的层次和目标
RMAN List和report 命令 daizj oracle list report rman
LIST 命令使用RMAN LIST 命令显示有关资料档案库中记录的备份集、代理副本和映像副本的信息。使用此命令可列出： • RMAN 资料档案库中状态不是AVAILABLE 的备份和副本 • 可用的且可以用于还原操作的数据文件备份和副本 • 备份集和副本，其中包含指定数据文件列表或指定表空间的备份 • 包含指定名称或范围的所有归档日志备份的备份集和副本 • 由标记、完成时间、可
二叉树:红黑树 dieslrae 二叉树
红黑树是一种自平衡的二叉树,它的查找,插入,删除操作时间复杂度皆为O(logN),不会出现普通二叉搜索树在最差情况时时间复杂度会变为O(N)的问题. 红黑树必须遵循红黑规则,规则如下 1、每个节点不是红就是黑。 2、根总是黑的 &
C语言homework3，7个小题目的代码 dcj3sjt126com c
1、打印100以内的所有奇数。 # include <stdio.h> int main(void) { int i; for (i=1; i<=100; i++) { if (i%2 != 0) printf("%d ", i); } return 0; } 2、从键盘上输入10个整数，
自定义按钮, 图片在上, 文字在下, 居中显示 dcj3sjt126com 自定义
#import <UIKit/UIKit.h> @interface MyButton : UIButton -(void)setFrame:(CGRect)frame ImageName:(NSString*)imageName Target:(id)target Action:(SEL)action Title:(NSString*)title Font:(CGFloa
MySQL查询语句练习题，测试足够用了 flyvszhb sql mysql
http://blog.sina.com.cn/s/blog_767d65530101861c.html 1.创建student和score表 CREATE TABLE student ( id INT(10) NOT NULL UNIQUE PRIMARY KEY , name VARCHAR
转：MyBatis Generator 详解 happyqing mybatis
MyBatis Generator 详解 http://blog.csdn.net/isea533/article/details/42102297 MyBatis Generator详解 http://git.oschina.net/free/Mybatis_Utils/blob/master/MybatisGeneator/MybatisGeneator.
让程序员少走弯路的14个忠告 jingjing0907 工作计划学习
无论是谁，在刚进入某个领域之时，有再大的雄心壮志也敌不过眼前的迷茫：不知道应该怎么做，不知道应该做什么。下面是一名软件开发人员所学到的经验，希望能对大家有所帮助 1.不要害怕在工作中学习。只要有电脑，就可以通过电子阅读器阅读报纸和大多数书籍。如果你只是做好自己的本职工作以及分配的任务，那是学不到很多东西的。如果你盲目地要求更多的工作，也是不可能提升自己的。放
nginx和NetScaler区别流浪鱼 nginx
NetScaler是一个完整的包含操作系统和应用交付功能的产品，Nginx并不包含操作系统，在处理连接方面，需要依赖于操作系统，所以在并发连接数方面和防DoS攻击方面，Nginx不具备优势。 2.易用性方面差别也比较大。Nginx对管理员的水平要求比较高，参数比较多，不确定性给运营带来隐患。在NetScaler常见的配置如健康检查，HA等，在Nginx上的配置的实现相对复杂。 3.策略灵活度方
第11章动画效果（下） onestopweb 动画
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
FAQ - SAP BW BO roadmap blueoxygen BO BW
http://www.sdn.sap.com/irj/boc/business-objects-for-sap-faq Besides, I care that how to integrate tightly. By the way, for BW consultants, please just focus on Query Designer which i
关于java堆内存溢出的几种情况 tomcat_oracle java jvm jdk thread
【情况一】：　　 java.lang.OutOfMemoryError: Java heap space：这种是java堆内存不够，一个原因是真不够，另一个原因是程序中有死循环；　　如果是java堆内存不够的话，可以通过调整JVM下面的配置来解决：　　<jvm-arg>-Xms3062m</jvm-arg> 　　<jvm-arg>-Xmx
Manifest.permission_group权限组阿尔萨斯 Permission
结构继承关系 public static final class Manifest.permission_group extends Object java.lang.Object android. Manifest.permission_group 常量 ACCOUNTS 直接通过统计管理器访问管理的统计 COST_MONEY可以用来让用户花钱但不需要通过与他们直接牵涉的权限 D