CSE 231 Spring 2021
Computer Project #07
Assignment Overview
This assignment focuses on the implementation of Python programs to read files and process data by
using lists and functions.
It is worth 55 points (5.5% of course grade) and must be completed no later than 11:59 PM on
Monday, March 15.
Assignment Deliverable
The deliverable for this assignment is the following file:
proj07.py – the source code for your Python program
Be sure to use the specified file name and to submit it for grading via Mimir before the project
deadline.
Assignment Background
One commonly hears reference to “the one percent” referring to the people whose income is in the
top 1% of incomes. What is the data behind that number and where do others fall? Using the
National Average Wage Index (AWI), an index used by the Social Security Administration to gauge
individual's earnings for the purpose of calculating their retirement benefit, we can answer such
questions.
In this project, you will process AWI data. Example data for 2019 is provided in the file
year2019.txt (2019 is the most recent year of complete data). The data is a table with the first
row as the title and the second row defining the data fields; remaining rows are data. The URL for
the data is: https://www.ssa.gov/cgi-bin/n...
Here is the second line of data from the file followed by descriptions of the data. Notice that some
data are ints and some are floats:
5,000.00 — 9,999.99 12,620,757 32,801,513 19.37150 93,403,927,820.81 7,400.82
Column 0 is bottom of this income range.
Column 1 is the dash separating the bottom of the range from the top (see note below).
Column 2 is the top of this income range (see note below).
Column 3 is the number of individuals in the income range.
Column 4 is the cumulative number of individuals in this income range and all lower ranges.
Column 5 is the Column 4 value represented as a cumulative percentage of all individuals.
Column 6 is the combined income of all the individuals in this range of income.
Column 7 is the average income of individuals in this range of income.
Note: The final row of the file is different than all the others. You must account for that.
Assignment Specifications
The program must provide following functions to extract some statistics.
a) def open_file():
Prompts the user to enter a year number for the data file. The program will check whether
the year is between 1990 and 2019 (both inclusive). If year number is valid, the program will
try to open data file with file name ‘yearXXXX.txt’, where XXXX is the year. Appropriate
error message should be shown if the data file cannot be opened or if the year number is
invalid. The year is invalid if it is not a number between 1990 and 2019, inclusively. The
invalid year error is shown in this case. If the loop is correct but the file does not exist, the
other error will be output. This function will loop until it receives proper input and
successfully opens the file. It returns a file pointer and year. Hint: use string concatenation
to construct the file name.
i. Parameters: None
ii. Display: prompt and error message
iii. Return: file pointer and int
b) def handle_commas(s,T) int or float or None
The parameters are s, a string, and T, a string. The expected values of T are int and
float; any other value returns None. If the value of T is int, the string s will be
converted to an int and that int value will be returned. Similar for float. If a value of
s cannot be converted to an int or float, None will be returned (hint: use tryexcept).
Note: this is the same function we had in Project 5.
i. Parameters: str, str
ii. Display: nothing
iii. Returns: int or float or None
c) def read_file(fp):
The function uses the file pointer parameter to read the data file. This function returns a list
of tuples where each tuple is the data on one line of the file, and is a mix of ints and floats as
follows:
tup = ((float, float), int, int, float, float, float)
the tuple is filled with the following data:
( (column 0, column 2), column 3, column 4, column 5, column 6, column 7)
Note that the numbers have commas that you should handle (Hint: use the handle_commas
function). There are also two header lines to skip. Also, the last line of the file has words
where data is supposed to be. Find which column this affects, and record that column as
None
i. Parameter: file pointer
ii. Display: nothing
iii. Return: list of tuples
d) def get_range(data_list, percent):
Takes a list of data (output from the read_file function) and a percent and returns data
for the first data line whose cumulative percentage (Column 5 in the data file) is greater than
or equal to the percent parameter. The function should return a tuple of the salary range
(Columns 0 and 2 in the file data) the cumulative percentage value (Column 5 in the data
file) and the average income (Column 7 in the data file):
( (column 0, column 2), column 5, column 7)
For testing using the 2014 data and a percent value of 90 your function will return
((90000.0, 94999.99), 90.80624, 92420.5)
i. Parameters: list of tuples, float
ii. Display: nothing
iii.Return: tuple
e) def get_percent(data_list, income):
Takes a list of data (output from the read_file function) and an income and returns the
income range (Columns 0 and 2 in the file) that the specified income is in the income range
(Columns 0 and 2 in the file) and the corresponding cumulative percentage (Column 5 in the
file).( (column 0, column 2), column 5 )
For testing using the 2014 data and an income value of 150,000 your function will return
((150000.0, 154999.99), 96.87301)
i. Parameters: list of tuples, float
ii. Display: nothing
iii. Return: tuple
f) def find_average(data_list):
Takes a list of data (output from the read_file function) and returns the average salary.
Round the result to cents (i.e. two decimal places) before returning the value.
Hints:
i. This is NOT (!) the average of the last column of data. It is not mathematically valid to
find an average by finding the average of averages—for example, in this case there are
many more in the lowest category than in the highest category.
ii. How many wage earners are considered in finding the average (denominator)? There
are a couple of ways to determine this. I think the easiest uses the “cumulative number”
column (Column 4 in the file), but using Column 3 is not hard and may make more
sense to some students.
iii. How does one find the total dollar value of income (numerator)? Notice that Column 6
in the file is the combined income of all the individuals in this range of income.”
For testing your function notice that for the 2014 data the average should be $44,569.20.
That value is listed on the web page referenced above.
iv. Parameters: list of tuples
v. Display: nothing
vi. Return: float # rounded to two decimal places
g) def find_median(data_list):
Takes a list of data (output from the read_file function) and returns the median income.
Unfortunately, this file of data is not sufficient to find the true median so we need to
approximate it (at least 50%).
i. Here is the rule we will use: find the data line whose cumulative percentage (Column 5)
is closest to 50% and return its average income (Column 7). If two data lines are equally
close, return the smaller.
ii. Hint: Python’s abs() function (absolute value) is potentially useful here.
iii. Hint: your get_range() function should be useful here. The get_range()
function returns the first tuple where the cumulative percentage is higher than a
particular percentage. For the median the percentage is 50%.
iv. For testing your function, using our rule, the median income for the 2014 data is
$27,457.00
v. Parameters: list of tuples
vi. Display: nothing
vii. Return: float
h) def do_plot(x_vals,y_vals,year) provided by us takes two equal-length lists of
numbers and plots them. You have to fill the two labels (replace the empty string with the
appropriate string. Note that if you plot the whole file of data, the income ranges are so
skewed that the result is a nearly vertical plot at the leftmost edge so close to the edge that
you cannot see it in the plot—it looks like nothing was plotted. Plotting the lowest 40
income ranges results in a more easily readable plot.
i) def main():
a) Open the file
b) Print the year.
c) Read the file
d) Print the average income.
e) Print the median income.
f) Prompt for plotting (yes/no).
If yes, plot the data: cumulative percentage (Column 5 in the file (y values)) vs. income
(Column 0 in the file (x values)). Call the do_plot() function to plot the data. Plot the
lowest 40 income ranges.
g) Loop, prompting for either “r” for range , “p” for percent, or nothing
i. r: prompt for a percent and output the income that is below that percent. The percent
needs to be valid (between 0 and 100 inclusive). Hint: Call the get_range()
function to get the range of income about that percentage. The bottom income range
is what we are looking for.
ii. p: prompt for an income and output the percent that earned more. The income needs
to be valid (positive). Hint: Call the get_percent() function to get the
corresponding cumulative percentage.
iii. if only a carriage-return is entered, halt the program
This is a new and different requirement. Hint: if someone simply hits the Enter key,
what will be the value input?
Assignment Notes
- Items 1-9 of the Coding Standard will be enforced for this project.
- Files for year2000.txt, year2014.txt and year2019.txt are provided so that you
can test your program. For output you need to insert commas. There is a format specification, e.g. if you might have
formatted a floating-point value without commas as {:<12.2f} you can simply insert a comma
before the dot as in {:<12,.2f}.
Sample Output
Test 1
Enter a year where 1990 <= year <= 2019: 2019
For the year 2019:
The average income was $51,916.27
The median income was $32,452.59
Do you want to plot the data (yes/no): no
Enter a choice to get (r)ange, (p)ercent, or nothing to stop: r
Enter a percent: 90
90.00% of incomes are below $100,000.00 .
Enter a choice to get (r)ange, (p)ercent, or nothing to stop: p
Enter an income: 100000
An income of $100,000.00 is in the top 90.01% of incomes.
Enter a choice to get (r)ange, (p)ercent, or nothing to stop:
Test 2 (no plotting)
Enter a year where 1990 <= year <= 2019: 2000
For the year 2000:
The average income was $30,846.09
The median income was $22,458.80
Do you want to plot the data (yes/no): no
Enter a choice to get (r)ange, (p)ercent, or nothing to stop: r
Enter a percent: 40
40.00% of incomes are below $15,000.00 .
Enter a choice to get (r)ange, (p)ercent, or nothing to stop: p
Enter an income: 50000
An income of $50,000.00 is in the top 87.41% of incomes.
Enter a choice to get (r)ange, (p)ercent, or nothing to stop:
Test 2 (plotting)
Enter a year where 1990 <= year <= 2019: 2000
For the year 2000:
The average income was $30,846.09
The median income was $22,458.80
Do you want to plot the data (yes/no): yes
Enter a choice to get (r)ange, (p)ercent, or nothing to stop:
Test 3
Enter a year where 1990 <= year <= 2019: xxx
Error in year. Please try again.
Enter a year where 1990 <= year <= 2014: 1900
Error in year. Please try again.
Enter a year where 1990 <= year <= 2014: 1999
Error in file name: year1999.txt Please try again.
Enter a year where 1990 <= year <= 2014: 2014
For the year 2014:
The average income was $44,569.20
The median income was $27,457.00
Do you want to plot the data (yes/no): no
Enter a choice to get (r)ange, (p)ercent, or nothing to stop: r
Enter a percent: 70
70.00% of incomes are below $45,000.00 .
Enter a choice to get (r)ange, (p)ercent, or nothing to stop: p
Enter an income: 150000
An income of $150,000.00 is in the top 96.87% of incomes.
Enter a choice to get (r)ange, (p)ercent, or nothing to stop:
Function Test: read_data
year2014.txt
[((0.01, 4999.99), 22574440, 22574440, 14.27075, 46647919125.68, 2066.4),
((5000.0, 9999.99), 13848841, 36423281, 23.02549, 102586913092.61, 7407.62),
((10000.0, 14999.99), 12329270, 48752551, 30.81961, 153566802438.45, 12455.47),
((15000.0, 19999.99), 11505776, 60258327, 38.09315, 200878198035.07, 17458.9),
((20000.0, 24999.99), 10918555, 71176882, 44.99547, 245317570246.88, 22467.95),
((25000.0, 29999.99), 10192863, 81369745, 51.43903, 279865461187.05, 27457.0),
((30000.0, 34999.99), 9487840, 90857585, 57.4369, 307828947411.16, 32444.58),
((35000.0, 39999.99), 8578215, 99435800, 62.85974, 321200755103.44, 37443.78),
((40000.0, 44999.99), 7553972, 106989772, 67.63509, 320563569965.15, 42436.43),
((45000.0, 49999.99), 6542882, 113532654, 71.77126, 310391706424.23, 47439.6),
((50000.0, 54999.99), 5723269, 119255923, 75.38931, 300016377448.51, 52420.46),
((55000.0, 59999.99), 4846517, 124102440, 78.4531, 278354367841.41, 57433.9),
((60000.0, 64999.99), 4201232, 128303672, 81.10897, 262203932128.68, 62411.2),
((65000.0, 69999.99), 3573471, 131877143, 83.36799, 240948179180.4, 67426.93),
((70000.0, 74999.99), 3094739, 134971882, 85.32437, 224145278103.36, 72427.85),
((75000.0, 79999.99), 2684481, 137656363, 87.0214, 207853372824.62, 77427.77),
((80000.0, 84999.99), 2297338, 139953701, 88.4737, 189370862869.17, 82430.56),
((85000.0, 89999.99), 1975400, 141929101, 89.72248, 172719042418.7, 87434.97),
((90000.0, 94999.99), 1714370, 143643471, 90.80624, 158442931588.44, 92420.5),
((95000.0, 99999.99), 1486636, 145130107, 91.74604, 144858203365.61, 97440.26),
((100000.0, 104999.99), 1309068, 146439175, 92.57358, 134083282259.67,
102426.52), ((105000.0, 109999.99), 1117128, 147556303, 93.27979,
120020513136.11, 107436.67), ((110000.0, 114999.99), 977055, 148533358, 93.89745,
109855105705.14, 112434.93), ((115000.0, 119999.99), 865889, 149399247, 94.44483,
101693061676.62, 117443.53), ((120000.0, 124999.99), 773339, 150172586, 94.93371,
94660281091.31, 122404.64), ((125000.0, 129999.99), 673971, 150846557, 95.35977,
85886152964.93, 127433.01), ((130000.0, 134999.99), 595827, 151442384, 95.73643,
78899843713.01, 132420.73), ((135000.0, 139999.99), 527341, 151969725, 96.0698,
72476546845.3, 137437.72), ((140000.0, 144999.99), 466992, 152436717, 96.36501,
66519743635.12, 142443.0), ((145000.0, 149999.99), 419003, 152855720, 96.62989,
61787674520.19, 147463.56), ((150000.0, 154999.99), 384581, 153240301, 96.87301,
58607775121.57, 152393.84), ((155000.0, 159999.99), 335391, 153575692, 97.08503,
52801735517.69, 157433.37), ((160000.0, 164999.99), 296048, 153871740, 97.27218,
48087213596.86, 162430.46), ((165000.0, 169999.99), 265309, 154137049, 97.4399,
44426198104.69, 167450.78), ((170000.0, 174999.99), 239515, 154376564, 97.59131,
41304379348.95, 172450.07), ((175000.0, 179999.99), 216255, 154592819, 97.72802,
38370042895.27, 177429.62), ((180000.0, 184999.99), 200592, 154793411, 97.85483,
36588064085.78, 182400.42), ((185000.0, 189999.99), 179005, 154972416, 97.96799,
33554727208.93, 187451.34), ((190000.0, 194999.99), 165277, 155137693, 98.07247,
31807897759.84, 192452.05), ((195000.0, 199999.99), 154070, 155291763, 98.16987,
30425466536.83, 197478.2), ((200000.0, 249999.99), 1039897, 156331660, 98.82726,
230863458226.21, 222006.08), ((250000.0, 299999.99), 565105, 156896765, 99.1845,
153945762663.99, 272419.75), ((300000.0, 349999.99), 333584, 157230349, 99.39537,
107708119615.81, 322881.55), ((350000.0, 399999.99), 219923, 157450272, 99.5344,
82117070706.61, 373390.1), ((400000.0, 449999.99), 151162, 157601434, 99.62996,
63997346472.5, 423369.28), ((450000.0, 499999.99), 108881, 157710315, 99.69879,
51583042398.64, 473756.14), ((500000.0, 999999.99), 345935, 158056250, 99.91748,
230331407862.96, 665822.79), ((1000000.0, 1499999.99), 65548, 158121798,
99.95892, 78672933288.58, 1200233.92), ((1500000.0, 1999999.99), 24140,
158145938, 99.97418, 41431838733.52, 1716314.78), ((2000000.0, 2499999.99),
12137, 158158075, 99.98185, 26997226154.27, 2224373.91), ((2500000.0,
2999999.99), 6871, 158164946, 99.98619, 18747446313.27, 2728488.77), ((3000000.0,
3499999.99), 4799, 158169745, 99.98923, 15507304422.66, 3231361.62), ((3500000.0,
3999999.99), 3258, 158173003, 99.99129, 12166741762.34, 3734420.43), ((4000000.0,
4499999.99), 2353, 158175356, 99.99277, 9970953222.98, 4237549.18), ((4500000.0,
4999999.99), 1822, 158177178, 99.99393, 8633941395.34, 4738716.46), ((5000000.0,
9999999.99), 6468, 158183646, 99.99802, 43887775808.42, 6785370.41),
((10000000.0, 19999999.99), 2230, 158185876, 99.99942, 30065006121.19,
13482065.53), ((20000000.0, 49999999.99), 776, 158186652, 99.99992,
22450911983.01, 28931587.61), ((50000000.0, None), 134, 158186786, 100.0,
11564829969.82, 86304701.27)]
Function Test: find_average
Instructor: 44569.2
Student: 44569.2
Function Test: find_median
year2014.txt
Instructor: 27457.0Student: 27457.0
year2019.txt
Instructor: 32452.59
Student: 32452.59
Function Test: get_range
year2014.txt; get_range(data,90)
Instructor: ((90000.0, 94999.99), 90.80624, 92420.5)Student: ((90000.0, 94999.99), 90.80624, 92420.5)
year2014.txt,get_range(data,50)
Instructor: ((25000.0, 29999.99), 51.43903, 27457.0)Student: ((25000.0, 29999.99), 51.43903, 27457.0)
year2000.txt,get_range(data,90)
Instructor: ((60000.0, 64999.99), 91.31401, 62377.2)
Student: ((60000.0, 64999.99), 91.31401, 62377.2)
Function Test: get_percent
year2014.txt; get_precent(data,150000)
Instructor: ((150000.0, 154999.99), 96.87301)Student: ((150000.0, 154999.99), 96.87301)
year2014.txt,get_percent(data,50000)
Instructor: ((50000.0, 54999.99), 75.38931)Student: ((50000.0, 54999.99), 75.38931)
year2000.txt,get_percent(data,150000)
Instructor: ((150000.0, 154999.99), 98.72567)
Student: ((150000.0, 154999.99), 98.72567)
Function Test: handle_commas
s,T: 5 int
Instructor: 5Student : 5
s,T: 5.3 float
Instructor: 5.3Student : 5.3
s,T: 1,234 int
Instructor: 1234Student : 1234
s,T: 1,234.56 float
Instructor: 1234.56Student : 1234.56
s,T: 5.3 xxx
Instructor: NoneStudent : None
s,T: aaa int
Instructor: NoneStudent : None
s,T: 1,234.56 int
Instructor: NoneStudent : None
Scoring Rubric
Computer Project #07 Scoring Summary
General Requirements
__ 5 pts Coding Standard 1-9
(descriptive comments, function header, etc...)
Implementation:
0 (5 pts) open_file (manual grading)
0 (3 pts) Function Test handle_commas
0 (8 pts) Function Test read_file
0 (5 pts) Function Test find_average
0 (6 pts) Function Test find_median
0 (5 pts) Function Test get_range
0 (5 pts) Function Test get_percent
0 (5 pts) Test 1
0 (2 pts) Test 2 (no plotting)
0 (2 pts) Test 2 (plotting) (manual grading)
0 (4 pts) Test 3