Winter 2020 (Jan6-Apr6)Faculty of Computer ScienceCSCI 4152/6509 — Natural Language ProcessingAssignment 2Assignment Instructions:The submission process for Assignment 2 is mostly based on the submit-nlp commandon bluenose as discussed in the lab, or in the equvalent way by using the course web site,where you need to follow ‘Login’ and then the ‘File Submission’ menu option. Some questionsmay have some specific different submission instructions, in which case you should followthose.Important: You must make sure that your course files on bluenose are not readable byother users. For example, if you keep your files in the directory csci6509 or csci4152 youcan check its permission using the command:ls -ld csci6509or ls -ld csci4152and the output must start with drwx------. If it does not, for example if it starts withdrwxr-xr-x or similar, then the permissions should fixed using the command:chmod 700 csci6509or chmod 700 csci41521) (22 marks) Complete Lab 3 as instructed. In particular, you will need to properly:a) (3 marks) Submit the example file ‘array-examples.pl’ as instructed.b) (3 marks) Submit the example file ‘test-hash.pl’ as instructed.c) (5 marks) Submit the program file ‘letter_counter_blanks.pl’ and the output file‘out_letters.txt’ (or out_letters.txt.gz) as instructed.d) (6 marks) Submit the program ‘word_counter.pl’ and the output file ‘out_word_counter.txt’as instructed.e) (5 marks) Submit the program ‘word_counter2.pl’ as instructed (Step 6).Note that any program that you submit needs to compile. Even if a complete source codeis given in the lab, you need to type it instead of using cut-and-paste, and you need to make1sure not to introduce any errors into the program. This follows from the lab instructionsthat programs must be tested before submitting.2) (26 marks) Complete Lab 4 (GitLab and Git) as instructed. In particular, you will needto properly:a) (5 marks) Prepare and submit via Git the file README.md as instructed.b) (5 marks) Prepare and submit via Git the directory lab5g and your public keyid_rsa.pub as instructed.c) (6 marks) Prepare and submit via Git the files explore.pl (commits for version 1.0and 1.1) and the Hamlet file as instructed.d) (5 marks) Create the branch ada-main-program with required commits and mergedlater into master as instructed.e) (5 marks) Create the branch bob-function-explore with required commits andmerged later into master as instructed.3) (12 marks) Complete the Lab 5 (Python NLTK 1) as instructed, and submit files usingthe command submit-nlp as instructed. In particular, you will need to properly:a) (3 marks) Submit the file ‘list_merge.py’ as instructed.b) (3 marks) Submit the file ‘stop_word_removal.py’ as instructed.c) (3 marks) Submit the file ‘explore_corpus.py’ as instructed.d) (3 marks) Submit the file ‘movie_rev_classifier.py’ as instructed.4) (25 marks) Your solution should be in a file named a2q4.pdf, a2q4.jpg, or a2q4.txtand it should be submitted the submit-nlp command.Your solution can be submitted as a plain-text, PDF, or image JPG file. The parts b)and c) require drawing of the curves, which you can create in any software that you like, oreven draw it by hand using a ruler and take picture of it. It should be clear to read andverify that it is correct.Suppose that a search engine returned 16 ranked results to our query, and when wechecked them, the following are our judgements on their relevance:1. relevant2. relevant3. relevant4. not relevant5. relevant6. relevant27. not relevant8. not relevant9. relevant10. relevant11. not relevant12. not relevant13. relevant14. not relevant15. not relevant16. not relevantAssuming that the total number of relevant documents in the collection is 25, do thefollowing tasks:a) (5 marks) Calculate precision, recall, and F-measure for the returned results.b) (10 marks) Draw the Precision-Recall curve for these results. Show how the appropriatecoordinates were calculated.c) (10 marks) Draw the interpolated precision curve. Show how the appropriate coordinateswere calculated.5) (35 marks) Submit your solution as a file named a2q5.txt or a2q5.pdf using thecommand submit-nlp. Let us assume that you work on a problem of detecting positivemicroblog messages about financial markets. After analyzing a set of messages you foundthat three features F1, F2, and F3 of messages are particularly useful in recognizing whethera message is positive or negative. You decided to work on creating a small Na¨ıve Bayesclassifier to classify messages into positive and not-positive classes. To summarize, yourmodel uses the following features:• The feature F1 ∈ {t, f}, which is set to ‘t’ (true) if feature F1 is present in a message,and otherwise it is set to ‘f’ (false).• The feature F2 ∈ {t, f}, which is set to ‘t’ (true) if feature F2 is present in a message,and otherwise it is set to ‘f’ (false).• The feature F3 ∈ {t, f}, which is set to ‘t’ (true) if feature F3 is present in a message,and otherwise it is set to ‘f’ (false).The class itself is represented using the class variable C ∈ {p, n}, where p stands for apositive message, and n stands for a non-positive message.The training data is presented in the following table:3messages F1 F2 F3 C8 t t t p2 t t t nCSCI 4152作业代做、代写Computer Science作业、Python程序语言作业调试 代写R语言程序|帮做42 t t f p1 t t f n4 t f t p31 t f t n22 t f f p6 t f f n1 f t t p11 f t t n3 f t f p3 f t f n1 f f t p90 f f t n4 f f f p21 f f f na) (15 marks) Calculate the conditional probability tables (CPTs)for the Na¨ıve Bayes model.b) (5 marks) Calculate P(C = p | F1 = f, F2 = t, F3 = f) usingthe Na¨ıve Bayes model and briefly describe what this conditionalprobability represents.c) (5 marks) What is the most likely value of the class variable Cfor the partial configuration (F1 = f, F2 = t, F3 = f) according tothe Na¨ıve Bayes model discussed in a) and b)?d) (5 marks) What is P(C = p | F1 = f, F2 = t, F3 = f) if we usethe Joint Distribution Model?e) (5 marks) What is P(C = p | F1 = f, F2 = t, F3 = f) if we usethe Fully Independent Model?Note: In assignments, always include intermediate results and sufficient details aboutthe way the results are obtained.6) (30 marks) You need to write and submit a program in one of the languages: Perl,Python, C, C++, or Java, according to the specifications given below. Depending on thelanguage that you use, the program must be named either a2q6.pl (for Perl), a2q6.py(for Python), a2q6.c (for C), a2q6.cc (for C++), or a2q6.java (for Java). Python 2 andPython 3 versions of the program should be distinguished by the first line, which shouldbe either ‘#!/local/bin/python2’ for Python 2 or ‘#!/local/bin/python3’ for Python 3(according to the bluenose environment). The program must be submitted using the commandsubmit-nlp (or via the equivalent form on the web site). The program must read thestandard input, write to the standard output, and not open any files or use any other kindof interaction with the system or network.A frequent task that we need to do when working with HTML files from internet isremoval of the HTML tags and comments. You need to write a program similar to that,but in order to test it, we do not want simply to remove the tags but to hide them in away. We want to replace any tag, such as with the string .Any character between delimiters should be replaced with a period (.), exceptnew-line characters (\n). Additionally, we also want to recognize HTML comments in text,which start with , and similarly replace all characters except new-lineinside comments with periods. For example, should be replaced with. In case that input contains a tag that starts with or a comment that starts with non-new-line characters with period from that position to the end of input.4Since tags and comments may overlap in different ways, you must follow the followingprinciples. Always detect the leftmost start of a tag or comment: if you detect the leftmostcharacter opening tag. After that, you must search for a corresponding closing string, either fora comment, or > for a tag, and then processing continues. It is not really important howexactly you do processing, as long as the output is exactly as specified.Below, you can find some short examples of input and output:Input: Start 123Output: Start 123Input: Start maybe?> endOutput: Start maybe?> endInput: Start ? maybe?> endOutput: Start ? maybe?> endInput: Start ? maybe?> endOutput: Start ? For example, if we use the following file test1.html.txt as input:This is a headerNormal text link.This is a multiline href=this_should_all_be hidden>And so on link.Start comment: this is allcomment we can use > and until :this is out of comment.Check in browser if you want.we should get the following output:This is a headerNormal text link.This is a multiline ...................................>And so on link.5Start comment: .......................................... :this is out of comment.Check in browser if you want.These sample input and output files (test.html.txt and test1.out) are provided in theassignment directory, and you can test your program with commands:./a2q6.pl test1.newdiff -s test1.out test1.newand if there are no differences, it means that your program works correctly on this input.Additionally, if you test the number of lines and characters in both files with:wc test1.html.txt test1.outyou will notice that both files have exactly the same lenght and the same number of lines(output):13 53 339 test1.html.txt13 37 339 test1.out26 90 678 totalNote: You can also change the name of the file test1.html.txt to test1.html. Weadded the extension .txt so that it does not open as an HTML file when access via a webbrowser.You can find another test case test2.html.txt and test2.out, which is based on aDalhousie course timetable page.6转自:http://www.3zuoye.com/contents/3/4816.html