
groupname_assignment_bNote that this is a group assignment - you are required to work in a group of 2-4 students for thewhole assignment. One set of peer evaluation forms (submitted via blackboard) is required forAssignment B.1.1 BackgroundYou have been employed by a company that sells apps and devices to help drivers reduce theirrisk of infringing on road rules (and getting caught!). The marketing department has come upwith a two-pronged campaign that it wants to target to different demographics.Theyplantomarketthefinancialimplicationsofinfringementstouniversitystudents,throughan education campaign about the type and cost of infringements that are most likely to occur.They plan to market the safety aspects of their products to young families, focussing on situa-tions where child safety is at risk.Your job is to help support these campaigns: First, to establish the market for them, and sec-ondly to provide information that will be used in the education pieces of the campaigns.NB: The data set used in this assignment is both real and very recent (from the NSWOffice of State Revenue, see [http://data.gov.au/] for all open government data sets, or[http://www.revenue.nsw.gov.au/info/statistics] for this particular one - the &"Penalty NoticeData Set&"). That means you may be the first person in the world to uncover an error, quirkyfact, or meaningful result. Good luck!
1.2 Submission Instructions
1. Each group needs to submit a single Jupyter notebook (.ipynb file) which contains all of their code and analysis, via the link on Blackboard (Assessment, Assignment B Submission).
2. The provided material is a zip file containing a template notebook (this document), two data files, and an excel spreadsheet describing the data set.
3. Complete the template notebook with your code. You may make extra cells as you prefer, but please leave the question cells there for ease of reading.
4. The notebook will be run using the menu "Cell->Run All" (using the latest Python 3 based Anaconda Python installation available on the date the assignment is posted), with the penalty data file in the same folder as the notebook.
5. All of your outputs (.csv files) need to be written to that same directory, with the filename and format as requested by the question.
6. The correctness of produced .csv files will be assessed automatically (by a python script), so specifications must be followed precisely. The most important thing is to have the (exact) correct column names and row ordering. The bold numbers (index) of the data frame will be ignored, so don't worry about them.
7. Use Markdown Cells for longer explanation of your work and analysis, as required by some of the questions.
8. A short assessment of the content of the notebook will be made (for code style, clarity of explanation, and validity of your approach).
1.3 Marking Criteria
1. Correctness of results as per the given training / validation split.
2. Correctness of results on a different random training / validation split (to be determined by the marker after the assignment is handed in). This means that excessively tuning your results for the exact training/test data is not a good idea.
3. Clear, well commented code (using the "#" symbol to add comments to explain your thinking). This is particularly important when a result is incorrect, as you may still be able to get partial marks for your answer.
4. Specific marking criteria as described in the questions below.
1.4 Suggested Resources
While posting the questions online is strictly forbidden by the University's academic honesty policy, you may find help in a variety of ways:
• You should be able to do the whole assignment with the following packages, which have very helpful documentation on their websites:
• pandas: http://pandas.pydata.org/
• scikit-learn: http://scikit-learn.org/stable/index.html
• There are many helpful online forums where python developers and data scientists discuss the best ways of solving particular problems. http://stackoverflow.com is the biggest, and will likely appear in any googling you do.
• If you still feel stuck with the basics, there are many free online resources to help you get up and running with the basics, e.g. http://datacamp.com, and inexpensive e-books such as those on O'Reilly.
1.5 Errors
If you believe there are any errors with the assignment please email the lecturer immediately.
1.6 Setup
The code below reads the data file, creates a training and test data set, and displays the first five rows for you. Add code in the cells below (make more cells if you like) to answer the questions.
DO NOT EDIT (except for adding in your group name) List them in a dataframe with the description, the number of occurrences, and the total revenue brought in by that offence. Order from highest to lowest total revenue.
The format for the data frame "df_top_offences" before saving to csv should be:
OFFENCE_CODE OFFENCE_DESC TOTAL_NUMBER TOTAL_VALUE
79053 Use unregistered registrable Class A motor veh... 185602 116218644
6963 Disobey no stopping sign 456009 103148508
1.7.1 Marking Guide
• 1 mark - Partially correct solution (fails automatic verification, but passes some manual inspection of code and results).
• 2 marks - Passes automatic verification for correct results Take the data frame "df_top_offences" from Question 1, and restrict it to only those entries that mention the colour "red" (careful!). Save it as a csv in the same format as per Question 1.
1.8.1 Marking Guide
• 1 mark - Fails automatic verification, but solution has some correct aspects.
• 2 marks - Fails automatic verification with minor errors, e.g. text search not quite accurate, or wrong order.
• 3 marks - Passes automatic verification for correct results ("df") and find any offence (regardless of number of occurrences) that relates to children (or school zones), based on the text. You'll have to come up with your own definition for what this means - please explain it in a comment. Add a new boolean column called CHILD_RELATED that is True when the OFFENCE_DESC matches your search, and False when it does not. Leave rows in the same order as the df_train that you read in at the start.
Save data in csv of the following format:
CHILD_RELATED OFFENCE_DESC
False Proceed through red traffic light - Camera Det...
False Stop on/near marked foot crossing
False Enter restricted area without offering ticket ...
1.9.1 Marking Guide
• 1-2 marks - Solution incorrect, but some correct aspects.
• 3-4 marks - Some minor errors with the solution.
• 5 marks - Passes automatic verification for correct results (some leniency given to differing interpretations of "child related"). They want to "simplify" the data by removing precise details of the infringements:
• The OFFENCE_CODE and OFFENCE_DESC columns will no longer be given in future.
• The FACE_VALUE and TOTAL_NUMBER of infringement columns will be removed (but the TOTAL_VALUE column will stay).
2. The SCHOOL_ZONE_IND column will no longer be available in future.
Your marketing team panics that this data set, which is core to their "child related" strategy, is about to become useless for ongoing campaigns. You assure them that you can build a predictive model which can make a reasonable guess whether a line entry in the new data set is about a child related offence, based on the remaining columns that will be left in the data.
Build a model that predicts whether a line represents a CHILD_RELATED infringement, as defined previously, using the remaining variables in df_train. Hint: Using dates in prediction is probably unwise.
Write the predictions for the test data set to a csv file in the following format, preserving the same row order as df_test, where CHILD_RELATED is the same as in your answer to Question 3, and CHILD_RELATED_PREDICTION is the binary (True/False) output of your predictive model for each row:
CHILD_RELATED CHILD_RELATED_PREDICTION
False False
False False
False ...
1.10.1 Marking Guide
• 1-4 marks - Code exhibits some aspects of a correct model build, but either no scores are produced, or the model is no better than random guessing.
• 5 marks - Model achieves fair (better than random) performance on the provided test set
• 6-8 marks - Model achieves fair to good performance on a different random split of the training/test data.
• 9-10 marks - Model achieves good to outstanding performance on an undisclosed test method.
NB: Questions about what a "good" model performance is, will not be answered, other than the generic "and 100% is a perfect model". We are simulating a "real world" model build, where you are not provided with a definition of "good enough" prior to building it! A range of binary performance metrics will be used in the assessment. If you're lucky, you may have been asked by a business to do one fairly straightforward analysis, but you discover something else important along the way. More commonly, you will be provided with some data and a vague business goal, and expected to come up with something insightful that impacts the business. This is your job for Question 5 - still pretending you're working for the same company described above, perform an unsupervised learning analysis, and write a report (in markdown cells below) documenting what you find. You must use either PCA or k-means, do some visualisation of results, and explain what you see.
Pretend companies aside, this is a real, up-to-date open government data set. If you find out something important that relates to the real world, you're playing for more than just uni marks. Maybe you'll alert the government to policy error or fraud. You could discover something juicy that's of interest to the Australian media. 