Basic Measures Concepts: Ratio, Proportion, Percentage, and Rate
From <
Regardless of the measurement scale, when the data are gathered we need to analyze them to extract meaningful information. Various measures and statistics are available for summarizing the raw data and for making comparisons across groups. In this section we discuss some basic measures such as ratio, proportion, percentage, and rate, which are frequently used in our daily lives as well as in various activities associated with software development and software quality. These basic measures, while seemingly easy, are often misused. There are also numerous sophisticated statistical techniques and methodologies that can be employed in data analysis. However, such topics are not within the scope of this discussion.
A ratio results from dividing one quantity by another. The numerator and denominator are from two distinct populations and are mutually exclusive. For example, in demography, sex ratio is defined as
[Number of males]/[Number of females]X100%
If the ratio is less than 100, there are more females than males; otherwise there are more males than females.
Ratios are also used in software metrics. The most often used, perhaps, is the ratio of number of people in an independent test organization to the number of those in the development group. The test/development head-count ratio could range from 1:1 to 1:10 depending on the management approach to the software development process. For the large-ratio (e.g., 1:10) organizations, the development group usually is responsible for the complete development (including extensive development tests) of the product, and the test group conducts system-level testing in terms of customer environment verifications. For the small-ratio organizations, the independent group takes the major responsibility for testing (after debugging and code integration) and quality assurance.
Proportion is different from ratio in that the numerator in a proportion is a part of the denominator:
p = a/(a+b)
Proportion also differs from ratio in that ratio is best used for two groups, whereas proportion is used for multiple categories (or populations) of one group. In other words, the denominator in the preceding formula can be more than just a + b. If
a+b+c+d+e=N
then we have
a/N+b/N+c/N+d/N+e/N=1
When the numerator and the denominator are integers and represent counts of certain events, then p is also referred to as a relative frequency. For example, the following gives the proportion of satisfied customers of the total customer set:
[Number of satisfied customers]/[Total number of customers of a software product]
The numerator and the denominator in a proportion need not be integers. They can be frequency counts as well as measurement units on a continuous scale (e.g., height in inches, weight in pounds). When the measurement unit is not integer, proportions are called fractions.
A proportion or a fraction becomes a percentage when it is expressed in terms of per hundred units (the denominator is normalized to 100). The word percent means per hundred. A proportion p is therefore equal to 100p percent (100p%).
Percentages are frequently used to report results, and as such are frequently misused. First, because percentages represent relative frequencies, it is important that enough contextual information be given, especially the total number of cases, so that the readers can interpret the information correctly. Jones (1992) observes that many reports and presentations in the software industry are careless in using percentages and ratios. He cites the example:
Requirements bugs were 15% of the total, design bugs were 25% of the total, coding bugs were 50% of the total, and other bugs made up 10% of the total.
Had the results been stated as follows, it would have been much more informative:
The project consists of 8 thousand lines of code (KLOC). During its development a total of 200 defects were detected and removed, giving a defect removal rate of 25 defects per KLOC. Of the 200 defects, requirements bugs constituted 15%, design bugs 25%, coding bugs 50%, and other bugs made up 10%.
A second important rule of thumb is that the total number of cases must be sufficiently large enough to use percentages. Percentages computed from a small total are not stable; they also convey an impression that a large number of cases are involved. Some writers recommend that the minimum number of cases for which percentages should be calculated is 50. We recommend that, depending on the number of categories, the minimum number be 30, the smallest sample size required for parametric statistics. If the number of cases is too small, then absolute numbers, instead of percentages, should be used. For instance,
Of the total 20 defects for the entire project of 2 KLOC, there were 3 requirements bugs, 5 design bugs, 10 coding bugs, and 2 others.
When results in percentages appear in table format, usually both the percentages and actual numbers are shown when there is only one variable. When there are more than two groups, such as the example in Table 3.1, it is better just to show the percentages and the total number of cases (N) for each group. With percentages and N known, one can always reconstruct the frequency distributions. The total of 100.0% should always be shown so that it is clear how the percentages are computed. In a two-way table, the direction in which the percentages are computed depends on the purpose of the comparison. For instance, the percentages in Table 3.1 are computed vertically (the total of each column is 100.0%), and the purpose is to compare the defect-type profile across projects (e.g., project B proportionally has more requirements defects than project A).
In Table 3.2, the percentages are computed horizontally. The purpose here is to compare the distribution of defects across projects for each type of defect. The inter-pretations of the two tables differ. Therefore, it is important to carefully examine percentage tables to determine exactly how the percentages are calculated.
Type of Defect
Project A (%)
Project B (%)
Project C (%)
Requirements
115.0
141.0
120.3
Design
125.0
121.8
122.7
Code
150.0
128.6
136.7
Others
110.0
118.6
120.3
Total
100.0
100.0
100.0
(N)
(200)
(105)
(128)
Project
Type of Defect
A
B
C
Total
(N)
Requirements (%)
30.3
43.4
26.3
100.0
(99)
Design (%)
49.0
22.5
28.5
100.0
(102)
Code (%)
56.5
16.9
26.6
100.0
(177)
Others (%)
36.4
16.4
47.2
100.0
(55)
Ratios, proportions, and percentages are static summary measures. They provide a cross-sectional view of the phenomena of interest at a specific time. The concept of rate is associated with the dynamics (change) of the phenomena of interest; generally it can be defined as a measure of change in one quantity (y) per unit of another quantity (x) on which the former (y) depends. Usually the x variable is time. It is important that the time unit always be specified when describing a rate associated with time. For instance, in demography the crude birth rate (CBR) is defined as:
Crude birth rate = B/P X K
where B is the number of live births in a given calendar year, P is the mid-year population, and K is a constant, usually 1,000.
The concept of exposure to risk is also central to the definition of rate, which distinguishes rate from proportion. Simply stated, all elements or subjects in the denominator have to be at risk of becoming or producing the elements or subjects in the numerator. If we take a second look at the crude birth rate formula, we will note that the denominator is mid-year population and we know that not the entire population is subject to the risk of giving birth. Therefore, the operational definition of CBR is not in compliance with the concept of population at risk, and for this reason, it is a "crude" rate. A better measurement is the general fertility rate, in which the denominator is the number of women of childbearing age, usually defined as ages 15 to 44. In addition, there are other more refined measurements for birth rate.
In literature about quality, the risk exposure concept is defined as opportunities for error (OFE). The numerator is the number of defects of interest. Therefore,
Defect Rate = [Number of defects]/OFE X K
In software, defect rate is usually defined as the number of defects per thousand source lines of code (KLOC or KSLOC) in a given time unit (e.g., one year after the general availability of the product in the marketplace, or for the entire life of the product). Note that this metric, defects per KLOC, is also a crude measure. First, the opportunity for error is not known. Second, while any line of source code may be subject to error, a defect may involve many source lines. Therefore, the metric is only a proxy measure of defect rate, even assuming no other problems. Such limitations should be taken into account when analyzing results or interpreting data pertaining to software quality.