Statistics

  1. Population: It is the collection of all items of interest. Denote with an upper case n the numbers we have obtained when using a population are called parameters.
  2. Sample: It is a subset of the population and is denoted with a lowercase N and the numbers we have obtained when working with the sample are called statistics
  3. Populations are hard to define and hard to observe in real life.
  4. A sample is much easier to gather. It is less time consuming and less costly time and resources are the main reasons we prefer drawing samples compared to analyzing an entire population.
  5. A sample must be both random and representative for an insight to be precise a random sample is collected when each member of the sample is chosen from the population strictly by chance.
  6. A Representative sample is a subset of the population that accurately reflects the members of the entire population.
  7. A variable represents the gender of a person. What is the level of measurement? — nominal.
  8. A variable represents the weight of a person. What is the level of measurement? – ratio. It is quantitative and there is a true zero point.
  9. There is categorical and numerical data.
  10. Categorical data describes categories or groups. Another instance is answers to yes and no questions.
  11. Numerical data, on the other hand as its name suggests represents numbers it is further divided into two subsets discrete and continuous.
  12. Discrete data can usually be counted in a finite matter. A good example would be the number of children that you want to have. Another instance is grades on the exam.
  13. It is the opposite of continuous data. Continuous data is infinite and impossible to count. For example, your weight can take on every value in some range. Examples of continuous: Height, area, distance, and time. Time on a clock is discrete. Time in general is continuous.
  14. The measurement levels can be split into two groups qualitative and quantitative data.
  15. Qualitative data can be nominal or ordinal. Nominal variables or like the categories Mercedes, BMW or Audi or like the four seasons, they aren’t numbers and cannot be ordered.
  16. Ordinal data, on the other hand, consists of groups in categories which follow a strict order. Imagine you have been asked to rate your lunch and the options are disgusting, unappetising, neutral tasty and delicious. Although we have words and not numbers. It is obvious that these preferences are ordered from negative to positive. Thus the level of measurement is qualitative ordinal.
  17. The quantitative divided into two groups, interval and ratio. Intervals and ratios are both represented by numbers but have one major difference ratios have a true zero and intervals don’t most things we observe in the real world are ratios. Their name comes from the fact that they can represent ratios of things. For instance, if I have two apples and you have six apples, you would have three times as many as I do out and I find that out. Three other examples are a number of objects, in general distance and time. Interval are not as common. Temperature is the most common example of an interval variable. It can not represent a ratio of things and doesn’t have a true zero. Usually temperature is expressed in Celsius or Fahrenheit. They are both interval variables.
  18. Variables shown in Kelvins are ratios as we have a true zero and we can make the claim that one temperature is two times more than another Celsius and Fahrenheit have no true zero and are intervals.
  19. Visualizing data is the most intuitive way to interpret it. Some of the most common ways to visualize categorical variables are frequency distribution tables, bar charts, pie charts and pareto diagrams.
  20. Frequency distribution tables has two columns. The category itself and the corresponding frequency. Imagine you own a car shop and you sell only German cars. Using the same table, we can construct a bar chart also known as column chart. The vertical axis shows the number of units sold while each bar represents a different category indicated on the horizontal axis. Represent the same data as a pie chart in order to build one. We need to calculate what percentage of the total each brand represents in statistics. This is know as relative frequency naturally all relative frequencies add up to 100% percent. Pie charts are especially useful when we want to not only compare items among each other but also see their share of the total.
  21. Market share is so predominantly represented by pie charts that is you search for market share and google images you would only get pie charts.
  22. Pareto diagram is a special type of bar chart, where categories are shown in descending order of frequency. By frequency statistician’s mean the number of occurrences of each item. The diagram combines the strong sides of the bar and the pie chart. It is easy to compare the data both between categories and as a part of the total.
  23. Cumulative frequency is the sum of the relative frequencies.
  24. Pareto Principle states that 80% of the effects come from 20% of the causes. It is designed to show how subtotals change with each additional category and provide us with a better understanding of our data.
  25. A number is included in the particular interval if that number: 1) is greater than the lower bound; 2) is lower or equal to the upper bound.
  26. For many analyses it is useful to calculate the relative frequency of the data points in each interval. The relative frequency is the frequency of a given interval as part of the total.
  27. The most common graph used to represent numerical data is the histogram. As in the bar chart the vertical axis is a numerical type and shows the absolute frequency. This time though the horizontal axis is numerical too. Each bar has with equal to the interval and height equal to the frequency. This is to show that there is continuation between the intervals each interval ends where the next one starts. In the bar chart, different bar bars represented different categories. The bars were completely separate. Sometimes it is useful to plot the intervals against the relative rather than the absolute frequency. As you can see the histogram looks the same visually but gives different information to the audience. Relative frequency is made up of percentages. There is no way to do that in Excel but as a useful piece of information. Create a histogram with an equal intervals. An design with an age group. You have likely completed some survey where you were asked about your age and the possible answers were 18 to 25 then 26 to 30, 31 to 35 and so on until 60 plus clearly the interval width very and reflect different forcus groups for the experiment and had an explanation for that choice may be young adults under 25 cannot afford the product while adults over 60 have no interest in the product.
  28. Variables is divided into two parts, categorical and numerical variables.
  29. The most common way to represent categorical variables is using cross tables or as some statisticians call them contingency tables.
  30. The cross table we can proceed by visualizing the data onto a plane a very useful chart in such cases is a variation of the bar chart called the side by side bar chart. It represents the holding of each investor in the different types of assets stocks are in green bonds or in red. All graphs are very easy to create and read, once you have identified the type of data you are dealing with and decided on the best way to visualize it.
  31. The scatterplot it it used in representing two numerical variables. Scatterplot usually represent lots and lots of observations. When interpreting a scatterplot a statistician is not expected to look into single data points. He will be much interested in getting the main idea of how the idea is distributed.
  32. Outlier as it goes against the logic of the whole data set.
  33. Mean known as the simple average. We can find the mean of a data set by adding up all of its components and then dividing them by their number. The mean is the most common measure of central tendency. It is easily effected by outliers.
  34. The median is basically the middle number in an ordered dataset. The median of the data set is the number at position and plus 1 divided by 2 in the ordered list where n is the number of observations.
  35. The mode is the value that occurs most often. It can be used for both numerical and categorical data.
  36. The most commonly used tool to measure a symmetry is skewness. Skewness indicates whether the observations in a dataset are concentrated on one side.
  37. Mean > median. Right skewness means that the outliers are to the right. Right units the mean is bigger than the median and the mode is the value with the highest visual representation.
  38. Mean = median = mode. The frequency of occurrence is completely symmetrical and we call this a zero or no skew. Most often you will hear people say that the distribution is symmetrical for the data set.
  39. Mean < median. We say that there is a negative or a left skew. The outliers are to the left.
  40. Skewness tells us a lot about where the data is situated.
  41. Measures of asymmetry like skewness or the link between central tendency measures and probability theory which ultimately allows us to get a more complete understanding of the data we are working with.
  42. The most common ways to measure variability. Variance and standard deviation and coefficient of variation.
  43. When you have the whole population each data point is known. You are 100% sure of the measures you were calculating when you take a sample of this population and you compute a sample statistic. It is interpreted as an approximation of the population parameter.
  44. The sample mean is the average of the sample data points. The population mean is the average of the population data points. Technically there are two different formulas but they are computed in the same way.
  45. Varians measures the dispersion of a set of data points around their mean value. Population variance denoted by Sigma squared is equal to the sum of squared differences between the observed values and the population mean divided by the total number of observations.
  46. Sample varians on the other hand is denoted by a squared and is equal to the sum of squared differences between the observed sample values and the sample mean divided by the number of sample observations minus one.
  47. Squaring the differences has two main purposes. First by squaring the numbers we always get non-negative computations. Without going to deep into the mathematics of it. It is intuitive that dispersion cannot be negative. Dispersion is about distance and distance cannot be negative.
  48. If on the other hand we calculate the difference and do not elevate to the second degree we would obtain both positive and negative values that when summed would cancel out leaving us with no information about the dispersion. Second squaring amplifies the effect of large differences.
  49. Sample variance formula is used when our set of observations is a sample drawn from a bigger population.
  50. Why is the sample variance bigger than the population variance? In the first case we know of the population that is we had all the data and we calculated the variance. In the second case, we had a sample, but did not know the population. Therefore, there is more uncertainty.
  51. Standard deviation: The easy to fix is to calculate its square root and obtain a statistic. In most analyses you perform standard deviation will be much more meaningful than variance.
  52. The formulas are the square root of the population variance and square root of the sample variance respectively.
  53. The coefficient of variation (CV) is equal to the standard deviation divided by the mean another name for the term is relative standard deviation. This uses to compare two or more dataset.
  54. It is simply the standard deviation relative to the mean as you probably guessed. There is a population and sample formula.
  55. Standard deviation is the most common measure of variability for a single dataset.
  56. Varians gives results in squared units while standard deviation in original units.
  57. Standard deviation is the preference measure of variability, as it is directly interpretable.
  58. The coefficient of variation dividing the standard deviations by the respective means we get the two coefficient of variation.
  59. The two variables are correlated and the main statistic to measure this correlation is called covariance.
  60. Covariance may be positive, equal to zero or negative. Covariance gives a sense of direction. > 0: the two variables move together. < 0: the two variables move in opposite directions. = 0: the two variables are independent.
  61. Correlation adjusts covariance so that the relationship between the two variables becomes easy and intuitive to interpret the formulas for the correlation coefficient are the covariance divided by the product of the standard deviations of the two variables.
  62. A correlation of zero between two variables means that they are absolutely independent from each other.
  63. Negative correlation coefficient: It can be perfect negative correlation of minus one or much more likely an imperfect negative correlation of a value between minus 1 and zero.
  64. The correlation between two variables x and y is the same as the correlation between y and x. The formula is completely symmetrical with respect to both variables. It is very important for any analysis or researcher to understand the direction of causal relationships.
  65. Correlation does not imply causation.

你可能感兴趣的:(Data,Science,Bootcamp)