FIT5145 Assignment 1: Description
Due date: Sunday 26th April 2020 - 11:55pm
The aim of this assignment is to investigate and visualise data using various data science tools. It will test
your ability to:
- read data files in Python and extract related data from those files;
- wrangle and process data into the required formats;
- use various graphical and non-graphical tools to performing exploratory data analysis and
visualisation; - use basic tools for managing and processing big data; and
- communicate your findings in your report.
You will need to submit two separate files (Note: Submitting a zipped file will attract penalty of 10%): - A report in PDF containing your answers to all the questions. Note that you can use Word or other
word processing software to format your submission. Just save the final copy to a PDF before
submitting. Make sure to include code, the output and any screenshots/images of the graphs
you generate in order to justify your answers to all the questions. (Marks will be assigned to reports
based on their correctness and clarity. -- For example, higher marks will be given to reports
containing graphs with appropriately labelled axes.) - The Python code is a Jupyter notebook file (idnumber_FIT5145_A1.ipynb) that you wrote to
analyse and plot the data. (Note that the entire assignment should be completed using python)
Assignment Tasks:
The way we supply and use energy in Australia is changing. To understand these changes, to plan for
Australia’s energy future, and to make sound policy and investment decisions, we need timely, accurate,
comprehensive and readily-accessible energy data. The Department of Industry, Science, Energy and
Resources is responsible for compiling and publishing Australia’s official energy statistics and balances1.
The is updated annually and consists of historical energy consumption, production and trade statistics.
In this task, you are required to explore the statistics covering all electricity generation in Australia. This
includes by power plants, and by businesses and households for their own use, in all states and territories.
This also includes both on and off grid generation. We have extracted the data from the original files and
restricted it to a specific time period. Please download the dataset for this assignment from the following
link:
https://lms.monash.edu/mod/fo... ➤ energy_data.xlsx - https://www.energy.gov.au/gov...
2
The Data File
The data file you have downloaded is in xlsx format. Each sheet contains the energy generation
statistics of each Australian State/Territory in GWh (Gigawatt hours) for the year 2009 to 2018.
Field description
● State: Names of different Australian states.
● Fuel_Type: The type of fuel which was used.
● Category: Classification of fuel type: renewable or non-renewable.
● Years: Year which the energy generation are recorded.
There are two tasks (Task A and B) that you need to complete for this assignment. You need to use
Python v3.5+ to complete the tasks.
Task A: Exploratory Data Analysis of the Energy Dataset
This assessment aims to guide you in exploring the Australian Energy generation data set through the
process of exploratory data analysis (EDA), primarily through visualisation of that data using various
data science tools. You will need to draw on what you have learnt and will continue to learn, in class.
You are also encouraged to seek out alternative information from reputable sources. If you use or are
’inspired’ by any source code from one of these sources, you must reference this.
A1. Investigating the Energy Generation data for Victoria - First, read the data for Victoria state into a dataframe. You will observe that some values for the
fuel types (eg. Black coal etc.) are missing or have ‘Nan’. To handle it, replace these values with
zero (using appropriate python code) before proceeding with the rest of the questions.
a. Using Python, plot the total energy generation in Victoria over the time period covered in
the dataset (2009 to 2018). Describe the trend you see in the overall energy generation for
the given time period.
b. Draw a new plot showing the trend in total renewable and non-renewable energy
generation for the same time period? What trend can you observe from this graph?
c. Draw a bar chart showing the breakdown of the different fuel types used for energy
generation in 2009 vs in 2018? Explain your observation.
d. What was the most used energy resource (fuel-type) in 2015? Which renewable fuel type
was the least used in 2015?
e. Draw a plot showing the percentage of Victoria's energy generation coming from
Renewable vs Non-Renewable energy sources over the period 2009 to 2018. What can
you say about the trend you observe?
f. Using a linear regression model, predict what percentage of Victoria’s energy generation
will come from Renewable energy sources in the year 2030, 2100? Do the predictions
seem reasonable?
3
A2. Investigating the Energy Generation data for Australia. - Let’s do some further investigation by combining the data for all the states and territories in
Australia. Read the data for the rest of the states and merge them in a single dataframe. (Hint: you
can use a combination of merge, melt or concat operators to get your data in a format suitable for
answering the following questions)
a. Plot a column chart showing the total energy generated in Australia by fuel type in the year
2018.
b. Which state had the highest energy production in 2018? What is the ratio (percentage
breakdown) of renewable vs non-renewable energy production for that state in 2018.
c. Draw a plot showing the percentage of energy generation from renewable energy sources
for each state over the period 2009 to 2018. From your graph, which state do you think is
making the most progress towards adopting green energy? Provide a reason for your
answer.
A3. Visualising the Relationship over Time
Now let's look at the relationship between all variables impacting the energy generation over time. Ensure
that you have combined all the data from the different states. Ensure that your data is aggregated by year,
state, the total energy produced (total_production), and has a separate column for each of the fuel types. - Use Python to build a Motion Chart, that visualises the energy production trend for Australia over
time. The motion chart should show the units of energy production using Wind on the x-axis, the
energy production using Natural gas on the y-axis, the colour represents the states/territories the
bubble size should show the total_production. (HINT: A Jupyter notebook containing a tutorial
on building motion charts in Python is available here) - Run the visualisation from start to end. (Hint: In Python, to speed up the animation, set the timer
bar next to the play/pause button to the minimum value.) And then answer the following questions:
a. Comment generally on the trend you see on reliance on wind energy vs reliance on natural
gas for each Australian state overtime. Is it logical to say if there is a relationship between
the two variables?
b. Which state relied most on natural gas for energy production in 2013? Please support your
answer with any relevant python code and the motion chart screenshot.
c. Comment on Queensland (QLD) states reliance trend on Natural gas between 2009 to
2018? What could be the reason contributing to this?
4
Sample snapshot of the expected Motion Chart
5
Task B: Exploratory Analysis of Data
In this task, you are presented with some pre-processed tweets about bushfires in Australia. The dataset
is available via the following link:
https://lms.monash.edu/mod/fo... > twitter_data.csv
Please refer to Table 1 if you want to know the meaning of each feature/column. For example, nFollows
shows the number of followers a user has. A user which has more than a thousand followers can be
considered as a popular user. It should be noted that NOT every tweet in the data set is relevant to the
bushfires in Australia, as represented by the value in the last column (1 denotes relevant and 0 irrelevant
tweet).
Table 1: Description of Columns in the Data File
You are required to investigate the features of the twitter dataset. Please clearly label and comment your
Python code used to answer each question.
6
B1. Investigating the Data
Please make sure to understand the data set and it’s variables properly before answering the
following questions. You need to have a good insight into the dataset to be able to understand
some of the questions properly and avoid confusion. - How many tweets are there all together in the data file? How many of these tweets were posted
from a verified account? - Draw a histogram showing the distribution of #entities extracted from the tweets. Set an
appropriate bin size to present this information. - Compute the descriptive statistics (mean, std, quartile1, median, quartile3 and max ) of #entities
of relevant (ie. relevanceJudge = 1) and non-relevant (ie. with relevanceJudge = 0) tweets in the
dataset. (Hint: You may use the describe() function for simplicity). Explain any interesting findings. - What is the average length of the tweets (in characters) that are judged as relevant? What is the
average length of a non-relevant tweet? - To gain further insights into the twitter age of the users, it would be better to group the twitterAge
in categorical bins. Create a new column twitter age group in your dataframe based on twitterAge
by converting it into the following groupings or categories [‘0-1’,’1-2’,’2-3’,’3-4’, ‘4-5’, ‘5+’] (Hint:
You can use the cut() method to bin (categorise) your data in these suggested categories)
a. Generate boxplots summarising the distribution of each twitter age group against their
tweet length. What do you observe? Is there much variation in tweet length across the age
groups?
b. Which age group has the lowest median tweet length and which one has the highest? State
these median values.
c. According to the current bushfire tweet dataset, which age group is more active on twitter
(has posted most tweets - from the current processed set tweets in your dataframe)? (Note:
Each record in the dataframe is a tweet).
d. Create a plot showing the total number of tweets posted by each age group (from Part [c]
above).
e. Which age group on average has the highest number of followers on twitter?
B2. Exploring correlation in the Data
In this task, you are required to explore the above (twitter) dataset and report on any interesting
relationship/correlations you discover amongst the tweet variables. Your analysis should form a logical
story. The answer should contain visualisations (plots to represent the trend or correlation),
interpretation of your findings and an example of a prediction task (using simple linear regression).
[Note: There should be a clear reason behind each visualisation you create, followed by a concise
explanation of what message the visualisation is conveying.]
All the Best!!!