CITS 2401 计算机分析

Computer Analysis
and Visualisation
Assignment 1
Tweet Analysis
Worth: 5% of the unit
Submission: Answer the questions on the quiz server.
Deadline: 11 March 2021 5pm
Late submissions: late submissions attract 5% raw penalty per day up to 7 days (i.e., 18 March 2021 5pm). After, the mark
will be 0 (zero). Also, any plagiarised work will be marked zero.

  1. Outline
    Natural language processing (NLP) is useful yet a difficult task. Our UWA Cybersecurity Research Group
    has been focusing on rumour detection and generation in order to prevent rumours causing harm to the
    society. As a first step, we built an Automated Rumour Generation Hub (ARGH) that uses various machine
    learning (ML) and NLP techniques to generate rumours that are difficult to be identified by both humans
    and machines. In particular, Twitter has been used as the source dataset as we often observe different
    rumours circulating this social media platform. However, they don't provide analytical functions for us to
    summarise the data, so we have to do that ourselves.
    In this assignment, you will be carrying out simple data analysis tasks using tweets as outlined is the
    Tasks section below, mostly just to test your basic Excel competency. More complex tasks will be carried
    out in other assignments (stay tuned!).
    Note1: This is an individual assignment, please don't share your solution/code/files with others (only
    high-level discussion is allowed, e.g., the syntax of the formula, use of array formula with other examples
    etc.). If it is found to be not your original work, then you may be penalised.
    Note2: You may use intelligent formatting and colour combinations to display your worksheet in an
    understandable manner. However, don't "pimp" up the worksheet.
    Note3: You can find ARGH here: https://github.com/argh-rumor...,
    where you can run ARGH yourself using Google Colab.
    CITS 2401
    Computer Analysis
    and Visualisation
  2. Tasks
    Task 1
    Import the original.txt into excel word by word. Here the term "word" refers to any sequence of letters
    separated by a space. Note, the text qualifier should be set to {none} when you import the text. This
    sheet should be named words_data. Finally, the whole data range should be named words. Figure 1
    shows the example output of what it would look like if this task is done correctly.
    Figure 1. words_data sheet snippet.
    Task 2
    Create a new sheet named uniques_data. Import the list of unique words from the uniques.txt file
    provided. The words should be located from Cell A1. The whole range should be named uniques.
    Task 3
  3. In Column B: Calculate the frequency of the unique words from the words_data sheet. You must
    use an array formula to do this. Name the cell range as freq.
  4. In Column C: Calculate the number of letters used for each word from the words_data. This can
    be calculated by simply multiplying the number of letters by its frequency count. Name the cell
    range as letters.
  5. In Column D: Calculate the rank based on the frequency values. You must use an array formula
    to do this. Name the cell range as rank.
    In addition, apply conditional formatting on rank where the bottom 10 ranked values (i.e., the 10
    smallest values) are formatted with light red filled with dark red text.
    Task 4
    Create a new sheet named stats. Add the following columns From A2 to A7:
    CITS 2401
    Computer Analysis
    and Visualisation
  6. Average
  7. Max
  8. Min
  9. Median
  10. Mode
  11. SD
    Note, SD stands for standard deviation. Also, Average and SD should be rounded to 2 decimal places.
    Next, add labels as follows (Cell: Value):
  12. B1: Frequency
  13. C1: Letters
  14. D1: >Average
  15. E1:  The "Frequency" category (Column B) is using freq to calculate the statistics of the data.
     The "Letters" category (Column C) is using letters.
     The ">Average" category (Column D) uses letters where values are greater than the average (i.e.,
    value in cell C2).
     The "Populate all the statistics fields (i.e., B2:C7). You must not use any other supporting cells (i.e., you
    should calculate all those stats directly using Excel formulas using previously populated cells only).
    Note: the values in the uniques_data tab (i.e., freq and letters) should be treated as the entire
    population.
    Task 5
    Create a new sheet named charts. In this sheet, create a histogram of letters. The bin size should
    start from 0 and the gap between the bins are 20. For this, you must use an array formula.
    Then, format the chart as follows:
  16. The gap width is set to 0%.
  17. The series outline is solid black line.
  18. The series data labels are set to "Outside End".
  19. The title is removed.
    Task 6
    In "charts" sheet, create a scatter chart using freq in x-axis and letters in y-axis. Then, format the
    chart as follows:
  20. Remove the title.
  21. Add a linear trend line and display the R
  22. value.
  23. Add x-axis label "Frequency".
  24. Add y-axis label "Letters".
    The sample image of the charts sheet is provided in Figure 2, and your solution may look similar to this.
    However, the data shown in the image is sample data and is not the correct result (I.e., your figures
    may look different). The yellow blocks are added to the image to hide the sample data to avoid confusion.
    Task 7
    Insert a Treemap of uniques and letters into the "charts" sheet. Remove the title and the legend.
    Change the size to 15cm height and 25cm width.
    CITS 2401
    Computer Analysis
    and Visualisation
    Figure 2. Sample image of the charts sheet (treemap not shown)
  25. Submission
    You should answer the questions related to the tasks above on the quiz server by the due date - 11 March
  26. 5pm (drop dead due date 18 March 2021 with 5% raw penalty per day).
    Submit your Excel workbook on the quiz server, you should name the file as A1_[student id].xlsx.
    For example, if your student ID is 12345678, then your file name is A1_12345678.xlsx.
    Fail to follow this will result in a penalty of 50%.
    CITS 2401

你可能感兴趣的:(算法)