讲解：JavaDNAProgramming AssignmentJava

Introduction此作业着重于数组和文件/文本处理。打开一个名为DNA.java的文件。您还需要课程网站上的两个输入文件dna.txt和ecoli.txt。将这些文件保存在与您的程序相同的文件夹中。分配涉及处理来自基因组文件的数据。你的程序应该与两个给定的输入文件一起工作。如果你很好奇（这不是必需的），国家生物技术信息中心出版了许多其他细菌基因组文件。最后一页告诉你如何使用你的程序来处理其他已发布的基因组文件。关于DNA的背景信息注意：本节解释了与生物学领域有关的一些信息。这仅供参考。你不需要完全理解它来完成任务。脱氧核糖核酸（DNA）是一种复杂的生物化学大分子，为细胞生命形式和一些病毒携带遗传信息。 DNA也是生殖过程中父母遗传信息传递的机制。 DNA由称为核苷酸的化学化合物的长链组成。 DNA中存在四种核苷酸：腺嘌呤（A），胞嘧啶（C），鸟嘌呤（G）和胸腺嘧啶（T）。 DNA具有双螺旋结构（见下图），其含有通过氢键连接的这四个核苷酸的互补链。DNA的某些区域被称为基因。大多数基因编码构建蛋白质的指令（它们被称为“蛋白质编码”基因）。这些蛋白质负责执行生物体的大部分生命过程。基因中的核苷酸被组织成密码子。密码子是三个核苷酸的组，并且被写为其核苷酸的第一个字母（例如TAC或GGA）。每个密码子独特地编码单个氨基酸，蛋白质的构建块。从DNA构建蛋白质的过程有两个主要阶段，称为转录和翻译，其中一个基因被复制成称为mRNA的中间形式，然后由称为核糖体的结构处理，以构建由密码子编码的氨基酸链的基因。编码蛋白质的DNA序列出现在起始密码子（我们假定为ATG）和终止密码子（它是TAA，TAG或TGA中的任何一个）之间。并非DNA的所有区域都是基因; 不在有效的起始密码子和终止密码子之间的大部分称为基因间DNA，并具有其他（可能未知的）功能。计算生物学家检查大型DNA数据文件以查找模式和重要信息，例如哪些区域是基因。有时，他们对由四种核苷酸类型中每一种占据的质量百分比感兴趣。胞嘧啶（C）和鸟嘌呤（G）的高百分比常常是重要遗传数据的指标。在这项任务中，您阅读了一个包含命名的核苷酸序列的输入文件，并生成关于它们的信息。对于每个核苷酸序列，您的程序计算四个核苷酸（A，C，G和T）中每一个的出现次数。该程序还计算每种核苷酸类型占据的质量百分比，四舍五入到小数点后的一位数。接下来该程序报告每个序列中存在的密码子（三个核苷酸）并预测该序列是否是蛋白质编码基因。对于我们来说，蛋白质编码基因是一个匹配以下所有约束的字符串：•从有效起始密码子（ATG）开始•以有效终止密码子结尾（以下之一：TAA，TAG或TGA）•至少包含5个密码子（包括其起始密码子和终止密码子）•胞嘧啶（C）和鸟嘌呤（G）至少占其总质量的30％这些是我们任务的近似值，而不是计算生物学中用于鉴定蛋白质的确切约束。DNA输入数据由线对组成。第一行有核苷酸序列的名称，第二行是核苷酸序列本身。核苷酸序列中的每个字符将为A，C，G，T或短划线字符“-”。输入中的核苷酸可以是大写或小写。输入文件dna.txt（部分）：破折号“ - ”字符表示序列中的“垃圾”区域。对于大多数程序来说，在计算中它们应该被忽略，尽管它们确实对后面所述的序列总质量做出了贡献。程序行为您的程序从介绍开始并提示输入和输出文件名。您可以假设用户将键入正确格式的现有输入文件的名称。您的程序读取输入文件以处理其核苷酸序列，并将结果输出到给定的输出文件中。请注意核苷酸序列以大写形式输出，并且核苷酸计数和质量百分比以A，C，G，T顺序显示。一个给定的密码子如GAT可能会以相同的顺序出现一次以上。执行日志（用户输入带下划线）：上述执行后输出文件output.txt（部分）：实施指南，提示和发展战略此赋值的主要目的是为了演示您对for循环的数组和数组遍历的理解。因此，您应该使用数组来存储每个序列的各种数据。特别是，您的核苷酸计数，质量百分比和密码子都应该使用阵列进行存储。此外，您应该使用数组和循环来将数据从一种形式转换为另一种形式，如下所示：•从原始核苷酸序列字符串到核苷酸计数;•从核苷酸计数到质量百分比;•从原始核苷酸序列字符串到密码子三联体。这些转换总结如下图使用“治愈癌症”蛋白质数据：回想一下，您可以使用Arrays.toString方法打印任何数组。例如：要计算质量百分比，请使用以下值作为每个核苷酸的质量（克/摩尔）。表示“垃圾”区域的破折号被排除在计算的许多部分之外，但它们确实对总量有贡献。•腺嘌呤（A）：135.128•胞嘧啶（C）：111.103•鸟嘌呤（G）：151.128•胸腺嘧啶（T）：125.107•垃圾（ - ）：100.000例如，序列ATGG-AC的质量是（135.128 + 125.107 + 151.128 + 151.128 + 100.000 + 135.128 + 111.103）或908.722。其中，270.256（29.7％）来自两个腺嘌呤; 111.103（12.2％）来自胞嘧啶; 302.256（33.3％）来自两种鸟嘌呤; 125.107（13.8％）来自胸腺嘧啶; 和100.000（11.0％）来自“垃圾”-。我们建议您通过编写代码来读取输入文件来启动该程序。尝试编写代码来简单阅读每个蛋白质的名称和核苷酸序列并打印出来。使用Scanner的nextLine方法从输入文件中读取每行。这将读取整行输入并将其作为字符串返回。接下来，编写代码以传递核苷酸序列并计算As，Cs，Gs和Ts的数量。您可以使用字符串的charAt方法来获取单个字符。将您的计数放入大小为4的数组中。要在核苷酸和数组索引之间进行映射，您可能希望编写一种将单个字符（即A，C，T，G）转换为索引（即0到3）的方法。一旦您的计数器正常工作，您可以使用前面的核苷酸质量值将您的计数转换为每个核苷酸的质量百分比的新数组。如果您编写了代码来映射核苷酸字母和数组索引之间的映射关系，它还可以帮助您在数组中查找质量值，如下所示：您可以将您的质量百分比保存在小数点后一位数上，或者使用printf打印质量百分比数组时可以舍入。如果您选择存储预先舍入的百分比，请使用Math.round，如下所示：请记住，“垃圾”破折号确实对总量有贡献。对于程序的其他部分，您可能需要从输入中删除破折号; 考虑使用核苷酸串上的替换方法来消除这些字符。计算质量百分比后，您必须将序列分解成密码子并检查每个密码子。您可能希望查看第3章和第4章中介绍的String对象的方法，例如substring，charAt，indexOf，replace，toUpperCase和toLowerCase。我们还建议您在将输出保存到文件之前先让程序正确地将其输出打印到控制台。一旦您的程序将正确的输出打印到控制台，请使用教科书第6.4节中所述的PrintStream将输出保存到文件中。您可能会认为输入文件存在，可读，并且包含有效的输入。（换句话说，您不应该再次提示输入或输出文件名）。您可能会假设每个序列的核苷酸数（没有破折号）将是3的倍数，尽管一行上的核苷酸可能是大写的或小写或组合。您的程序应该覆盖输出文件中的任何现有数据（这是默认的PrintStream行为）。风格指南对于此任务，您需要具有以下四个类常量：•有效蛋白质必须具有的最小密码子数量，为整数（默认值为5）•为了使蛋白质有效，从C和G开始的质量百分比（整数）（默认为30）•核苷酸的数量（4，代表A，C，G和T）•每个密码子的核苷酸数量（3）要获得完整的学分，应该可以更改前两个常数值（最小密码子和最小质量百分比），并使程序改变其行为以评估蛋白质有效性。其他两个常量不会改变，但对于提高程序的可读性仍然有用。请参阅代码中的这些常量，而不要直接引用诸如4或3之类的纯数字。如果你的代码更清晰，你可以使用额外的常量。我们将严格按照此作业对您的方法结构进行分级。除main之外，至少使用四个非平凡方法。适当时，这些方法应使用参数和返回值，包括数组。这些方法应该井结构并避免冗余。没有一种方法应该在整体任务中分得太多。第7章末尾的教科书案例研究是一个较大程序的一个很好的例子，其中包含将数组作为参数传递的方法。特别是，我们要求您有一个方法可以被精确调用一次，以打印给定潜在蛋白质的所有文件输出（核苷酸，计数，％，是蛋白质等）换句话说，所有输出到文件的输出都应该通过一种方法在输入中的每个核苷酸序列上调用。您的其他方法应执行计算以收集传递给此输出方法的信息。你的main方法应该是整个程序的简要总结。 main可以包含一些代码，比如println语句。但是，main不应该在整个工作本身中执行太大份额，例如检查输入行的每个字符。当许多方法互相调用而不回到main时，也要避免“链接”。我们也将严格检查这项任务的冗余。如果你有一段非常相似的代码，在你的程序中重复多次，通过创建一个方法，通过在数组元素上使用for循环，和/或通过分解if / else代码来消除冗余教科书第4.3节。由于数组是这个任务的关键组成部分，所以你的成绩的一部分来自正确使用数组。例如，您应该通过使用遍历数组（对于数组元素上的循环）适当地减少冗余。这比写出每个数组元素的单独语句（元素[0]的语句，然后另一个用于[1]，然后用于[2]等）更可取。还要仔细考虑在分解程序时应如何将数组作为参数传递和/或从方法返回。回想一下，数组在传递参数时使用引用语义，这意味着传递给方法的数组可以通过该方法修改，并且调用者可以看到更改。您仅限于第1章到第7章中的功能。按照过去的样式准则，例如缩进，名称，变量，类型，行长度和注释（在程序的开始，每种方法以及复杂的代码段）。其他输入文件（可选）：如果你想生成额外的输入文件来测试你的程序，你可以从实际的NCBI基因数据中创建它们。以下网站有许多数据文件，其中包含病毒的完整基因组：该网站包含许多有机体名称的目录。进入目录后，您可以查找并保存基因组文件（名称以.fna结尾的文件）和蛋白质表（名称以.ptt结尾的文件）。在课程网站上，我们将为您提供一个程序，将这些.fna和.ptt文件转换为适合您功课的输入文件。Requirement1 of 4CSE 142, Winter 2018Programming Assignment #7: DNA (20 points)Due Tuesday, February 27th, 9:00 PMSpecial thanks to UW CSE professor Martin Tompa for his help with the development of this assignment.This assignment focuses on arrays and file/text processing. Turn in a file named DNA.java . You will also need the twoinput files dna.txt and ecoli.txt from the course web site. Save these files in the same folder as your program.The assignment involves processing data from genome files. Your program should work with the two given input files. Ifyou are curious (this is not required), the National Center for Biotechnology Information publishes many other bacteriagenome files. The last page tells you how to use your program to process other published genome files.Background Information About DNA:Note: This section explains some information from the field of biology that is related to this assignment.It is for your information only; you do not need to fully understand it to complete the assignment.Deoxyribonucleic acid (DNA) is a complex biochemical macromolecule that carries genetic information for cellular lifeforms and some viruses. DNA is also the mechanism through which genetic information from parents is passed on duringreproduction. DNA consists of long chains of chemical compounds called nucleotides. Four nucleotides are present inDNA: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). DNA has a double-helix structure (see diagram below)containing complementary chains of these four nucleotides connected by hydrogen bonds.Certain regions of the DNA are called genes. Most genes encode instructions for building proteins (they’re called“protein-coding” genes). These proteins are responsible for carrying out most of the life processes of the organism.Nucleotides in a gene are organized into codons. Codons are groups of three nucleotides and are written as the first lettersof their nucleotides (e.g., TAC or GGA). Each codon uniquely encodes a single amino acid, a building block of proteins.The process of building proteins from DNA has two major phases called transcription and translation, in which a gene isreplicated into an intermediate form. called mRNA, which is then processed by a structure called a ribosome to build thechain of amino acids encoded by the codons of the gene.The chemical structure of DNA.DNA translation.The sequences of DNA that encode proteins occur between a start codon (which we will assume to be ATG) and a stopcodon (which is any of TAA, TAG, or TGA). Not all regions of DNA are genes; large portions that do not lie between avalid start and stop codon are called intergenic DNA and have other (possibly unknown) functions. Computationalbiologists examine large DNA data files to find patterns and important information, such as which regions are genes.Sometimes they are interested in the percentages of mass accounted for by each of the four nucleotide types. Often highpercentages of Cytosine (C) and Guanine (G) are indicators of important genetic data.For more information, visit the Wikipedia page about DNA: 2 of 4In this assignment you read an input file containing named sequences of nucleotides and produce information about them.For each nucleotide sequence, your program counts the occurrences of each of the four nucleotides (A, C, G, and T). Theprogram also computes the mass percentage occupied by each nucleotide type, rounded to one digit past the decimalpoint. Next the program reports the codons (trios of nucleotides) present in each sequence and predicts whether or not thesequence is a protein-coding gene. For us, a protein-coding gene is a string that matches all of the following constraints:• begins with a valid start codon (ATG)• ends with a valid stop codon (one of the following: TAA, TAG, or TGA)• contains at least 5 total codons (including its initial start codon and final stop codon)• Cytosine (C) and Guanine (G) combined account for at least 30% of its total mass(These are approximations for our assignment, not exact constraints used in computational biology to identify proteins.)The DNA input data consists of line pairs. The first line has the name of the nucleotide sequence, and the second is thenucleotide sequence itself. Each character in a sequence of nucleotides will be A, C, G, T, or a dash character, “-“ . Thenucleotides in the input can be either upper or lowercase.Input file dna.txt (partial):cure for cancer proteinATGCCACTATGGTAGcaptain picard hair growth proteinATgCCAACATGgATGCCcGATAtGGATTgAbogus proteinCCATt-AATgATCa-CAGTt…The dash “-“ characters represent “junk” or “garbage” regions in the sequence. For most of the program they should beignored in your computations, though they do contribute to the total mass of the sequence as described later.Program Behavior.:Your program begins with an introduction and prompts for input and output file names. You may assume the user willtype the name of an existing input file that is in the proper format. Your program reads the input file to process itsnucleotide sequences and outputs the results into the given output file. Notice the nucleotide sequence is output inuppercase, and that the nucleotide counts and mass percentageJava代写DNAProgramming Assignment代写留学生Java语言s are shown in A, C, G, T order. A given codon such asGAT might occur more than once in the same sequence.Log of execution (user input underlined):This program reports information about DNAnucleotide sequences that may encode proteins.Input file name? dna.txtOutput file name? output.txtOutput file output.txt after above execution (partial):Region Name: cure for cancer proteinNucleotides: ATGCCACTATGGTAGNuc. Counts: [4, 3, 4, 4]Total Mass%: [27.3, 16.8, 30.6, 25.3] of 1978.8Codons List: [ATG, CCA, CTA, TGG, TAG]Is Protein?: YESRegion Name: captain picard hair growth proteinNucleotides: ATGCCAACATGGATGCCCGATATGGATTGANuc. Counts: [9, 6, 8, 7]Total Mass%: [30.7, 16.8, 30.5, 22.1] of 3967.5Codons List: [ATG, CCA, ACA, TGG, ATG, CCC, GAT, ATG, GAT, TGA]Is Protein?: YESRegion Name: bogus proteinNucleotides: CCATT-AATGATCA-CAGTTNuc. Counts: [6, 4, 2, 6]Total Mass%: [32.3, 17.7, 12.1, 29.9] of 2508.1Codons List: [CCA, TTA, ATG, ATC, ACA, GTT]Is Protein?: NO3 of 4Implementation Guidelines, Hints, and Development Strategy:The main purpose of this assignment is to demonstrate your understanding of arrays and array traversals with for loops.Therefore, you should use arrays to store the various data for each sequence. In particular, your nucleotide counts, masspercentages, and codons should all be stored using arrays. Additionally you should use arrays and for loops totransform. the data from one form. to another as follows:• from the original nucleotide sequence string to nucleotide counts;• from nucleotide counts to mass percentages; and• from the original nucleotide sequence string to codon triplets.These transformations are summarized by the following diagram using the “cure for cancer” protein data:Nucleotides: “ATGCCACTATGGTAG”What is computed Output to fileCounts: {4, 3, 4, 4} Nuc. Counts: [4, 3, 4, 4]Mass %: {27.3, 16.8, 30.6, 25.3} Total Mass%: [27.3, 16.8, 30.6, 25.3] of 1978.8Codons: {ATG, CCA, CTA, TGG, TAG} Codons List: [ATG, CCA, CTA, TGG, TAG]Is protein?: YESRecall that you can print any array using the method Arrays.toString . For example:int[] numbers = {10, 20, 30, 40};System.out.println(“my data is “ + Arrays.toString(numbers)); // my data is [10, 20, 30, 40]To compute mass percentages, use the following as the mass of each nucleotide (grams/mol). The dashes representing“junk” regions are excluded from many parts of your computations, but they do contribute mass to the total.• Adenine (A): 135.128• Cytosine (C): 111.103• Guanine (G): 151.128• Thymine (T): 125.107• Junk (-): 100.000For example, the mass of the sequence ATGG-AC is (135.128 + 125.107 + 151.128 + 151.128 + 100.000 + 135.128 +111.103) or 908.722. Of this, 270.256 (29.7%) is from the two Adenines; 111.103 (12.2%) is from the Cytosine; 302.256(33.3%) is from the two Guanines; 125.107 (13.8%) is from the Thymine; and 100.000 (11.0%) is from the “junk” dash.We suggest that you start this program by writing the code to read the input file. Try writing code to simply read eachprotein’s name and sequence of nucleotides and print them. Read each line from the input file using Scanner ’s nextLinemethod. This will read an entire line of input and return it as a String .Next, write code to pass over a nucleotide sequence and count the number of As, Cs, Gs, and Ts. You can use a String ‘scharAt method to get individual characters. Put your counts into an array of size 4. To map between nucleotides andarray indexes, you may want to write a method that converts a single character (i.e. A, C, T, G) into indices (i.e. 0 to 3).Once you have the counts working correctly, you can convert your counts into a new array of percentages of mass foreach nucleotide using the preceding nucleotide mass values. If you’ve written code to map between nucleotide letters andarray indexes, it may also help you to look up mass values in an array such as the following:double[] masses = {135.128, 111.103, 151.128, 125.107};You may store your mass percentages already rounded to one digit past the decimal or you can round when printing themass percentages array using printf . If you choose to store the percentages pre-rounded, use Math.round as follows:double num = 1.6666667;double rounded = Math.round(num * 10.0) / 10.0;System.out.print(“the answer is “ + rounded); // the answer is 1.7Remember that the “junk” dashes do contribute mass to the total. For other parts of your program you may want toremove dashes from the input; consider using the replace method on the nucleotide string to eliminate these characters.After computing mass percentages, you must break apart the sequence into codons and examine each codon. You maywish to review the methods of String objects as presented in Chapters 3 and 4, such as substring , charAt , indexOf ,replace , toUpperCase , and toLowerCase .4 of 4We also suggest that you first get your program working correctly printing its output to the console before you save theoutput to a file. Once you have your program printing correct output to the console, save the output to a file by using aPrintStream as described in Section 6.4 of the textbook.You may assume that the input file exists, is readable, and contains valid input. (In other words, you should not re-promptfor input or output file names.) You may assume that each sequence’s number of nucleotides (without dashes) will be amultiple of 3, although the nucleotides on a line might be in either uppercase or lowercase or a combination. Yourprogram should overwrite any existing data in the output file (this is the default PrintStream behavior).Style. Guidelines:For this assignment you are required to have the following four class constants:• one for the minimum number of codons a valid protein must have, as an integer (default of 5)• a second for the percentage of mass from C and G in order for a protein to be valid, as an integer (default of 30)• a third for the number of unique nucleotides (4, representing A, C, G, and T)• a fourth for the number of nucleotides per codon (3)For full credit it should be possible to change the first two constant values (minimum codons and minimum masspercentage) and cause your program to change its behavior. for evaluating protein validity. The other two constants won’tever be changed but are still useful to make your program more readable. Refer to these constants in your code and do notrefer to the bare number such as 4 or 3 directly. You may use additional constants if they make your code clearer.We will grade your method structure strictly on this assignment. Use at least four nontrivial methods besides main .These methods should use parameters and returns, including arrays, as appropriate. The methods should be well-structured and avoid redundancy. No one method should do too large a share of the overall task. The textbook’s casestudy at the end of Chapter 7 is a good example of a larger program with methods that pass arrays as parameters.In particular, we require that you have a method that can be called exactly once to print all file output for a givenpotential protein (nucleotides, counts, %, is it a protein, etc.)In other words, all output to the file should be done through one method called on each nucleotide sequence from theinput. Your other methods should do the computations to gather information to be passed to this output method.Your main method should be a concise summary of the overall program. It is okay for main to contain some code such asprintln statements. But main should not perform. too large a share of the overall work itself, such as examining eachcharacter of an input line. Also avoid “chaining,” when many methods call each other without ever returning to main .We will also check strictly for redundancy on this assignment. If you have a very similar piece of code that is repeatedseveral times in your program, eliminate the redundancy such as by creating a method, by using for loops over theelements of arrays, and/or by factoring if/else code as described in section 4.3 of the textbook.Since arrays are a key component of this assignment, part of your grade comes from using arrays properly. For example,you should reduce redundancy as appropriate by using traversals over arrays ( for loops over the array’s elements). Thisis preferable to writing out a separate statement for each array element (a statement for element [0] , then another for [1] ,then for [2] , etc.). Also carefully consider how arrays should be passed as parameters and/or returned from methods asyou are decomposing your program. Recall that arrays use reference semantics when passed as parameters, meaning thatan array passed to a method can be modified by that method and the changes will be seen by the caller.You are limited to features in Chapters 1 through 7. Follow past style. guidelines such as indentation, names, variables,types, line lengths, and comments (at the beginning of your program, on each method, and on complex sections of code).Additional Input Files (Optional):If you would like to generate additional input files to test your program, you can create them from actual NCBI geneticdata. The following web site has many data files that contain complete genomes for viruses:ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/The site contains many directories with names of organisms. After entering a directory, you can find and save a genomefile (a file whose name ends with .fna) and a protein table (a file whose name ends with .ptt). On the course web site wewill provide you with a program to convert these .fna and .ptt files into input files suitable for your homework.& 转自：http://ass.3daixie.com/2018052696731851.html

讲解：JavaDNAProgramming AssignmentJava

你可能感兴趣的:(讲解：JavaDNAProgramming AssignmentJava)