A DNA sequence consists of four letters, A, C, G, and T. The GC-ratio of a DNA sequence is the number of Cs and Gs of the sequence divided by the length of the sequence. GC-ratio is important in gene finding because DNA sequences with relatively high GC-ratios might be good candidates for the starting parts of genes. Given a very long DNA sequence, researchers are usually interested in locating a subsequence whose GC-ratio is maximum over all subsequences of the sequence. Since short subsequences with high GC-ratios are sometimes meaningless in gene finding, a length lower bound is given to ensure that a long subsequence with high GC-ratio could be found. If, in a DNA sequence, a 0 is assigned to every A and T and a 1 to every C and G, the DNA sequence is transformed into a binary sequence of the same length. GC-ratios in the DNA sequence are now equivalent to averages in the binary sequence.
Position | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||||||||
Index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Sequence | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 |
For the binary sequence above, if the length lower bound is 7, the maximum average is 6/8 which happens in the subsequence [7,14]. Its length is 8, which is greater than the length lower bound 7. If the length lower bound is 5, then the subsequence [7,11] gives the maximum average 4/5. The length is 5 which is equal to the length lower bound. For the subsequence [7,11], 7 is its starting index and 11 is its ending index.
Given a binary sequence and a length lower bound L, write a program to find a subsequence of the binary sequence whose length is at leastL and whose average is maximum over all subsequences of the binary sequence. If two or more subsequences have the maximum average, then find the shortest one; and if two or more shortest subsequences with the maximum average exist, then find the one with the smallest starting index.
Your program is to read from standard input. The input consists of T test cases. The number of test cases T is given in the first line of the input. Each test case starts with a line containing two integersn(1n100, 000) andL (1L1, 000) which are the length of a binary sequence and a length lower bound, respectively. In the next line, a string, binary sequence, of lengthn is given.
Your program is to write to standard output. Print the starting and ending index of the subsequence.
The following shows sample input and output for two test cases.
2 17 5 00101011011011010 20 4 11100111100111110000
7 11 6 9
找一个长度不小于L的子串让里面1的比例最大,如果有多个,找长度最短的,如果还有多个,找最靠前的。
我也是想到用sum存前面的数字之和,找最大的(sum[b]-sum[a])/(b-a)。可是没想出来怎么找a,遍历一遍肯定是不行的会超时。。还是在网上搜了做法,以前没见过这样的。
这相当于一个非递减数列,i是x坐标,sum[i]是y坐标,在这个图上找y-x>=L的斜率最大的两个点。用一个队列保存当前可能用到的起点,设i从L递增到N,我们每次在队列中加入i-L作为起点,设rear为当前队列中最后一个元素。如果发现rear和i-L这两个点的斜率小于等于rear-1和rear这两个点的斜率,也就是rear-1,rear,i-L三个点一次连成的曲线不是下凸的,就可以把rear这个点删掉了。这是因为后面再出现的点怎么都不可能和rear这个点构成最大斜率(可以自己画个图看一下)。从后删到不能删后,把i-L加到最后。不光可以从后删掉一些没用的点,还可以从前删掉一些。如果front那个点和i点的斜率小于front+1和i的斜率,那么front这个点可以删掉,因为y是单调不减的,后面再出现的点如果可能更新最大斜率的话,明显和front+1构成的斜率更优。这样又从前删掉了一些点。删完后i和当前的front就是i点作为end的最优答案。
总之,队列中的点都是当前有可能作为起点的,是一条下凸曲线。这样就比遍历一遍省时间,省去了一些没用的点。
这个题还是要画图才看得清楚。。
#include
#include
#include
#include
#include
#include
#include
#include