木东居士学习计划:第四周 数据分布-Python实战

本周的题目是判断一个数据集是否符合正态分布。
本文章只是单纯考虑数据是否符合正态分布,至于数据中的意义本文不考虑。

  • 数据集地址:http://jse.amstat.org/datasets/normtemp.dat.txt
  • 数据集描述:总共只有三列:体温、性别、心率
  • 数据集详细描述:Journal of Statistics Education, V4N2:Shoemaker

思路:

  1. 粗看图形形状
  2. 调用科学计算包中的函数查看是否符合正态分布
  • kstest https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html
  • shapirohttps://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html?highlight=shapiro
  • normaltest https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html?highlight=normaltest#scipy.stats.normaltest
  • lillieforshttps://www.statsmodels.org/devel/generated/statsmodels.stats.diagnostic.lilliefors.html
  • anderson https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html

准备:加载数据

import numpy as np
data = np.loadtxt('temp_data.txt')
temp = data[:,0]

验证正态分布

  1. kstest
from scipy.stats import kstest
def check_normality(testData):
       if len(testData)>300
           p_value=stats.kstest(testData,'norm')[1]
           if p_value<0.05:
               print("use kstest:")
               print("data are not normal distrubuted")
               return False
          else:
              print("use kstest:")
              print("data are normal distributed")
              return True
#验证
print(check_normality(temp))
  1. shapiro
    样本数小于50,用shapiro-wiki
from scipy import stats
          def check_normality(testData):
                  if len(testData)<50:
                      p_value = stats.shapiro(testData)[1]
                      if p_value<0.05:
                          print("use shapiro:")
                          print("data are not normal distributed")
                          return False
                      else:
                          print("use shapiro:")
                          print("data are normal distributed")
                          return True
#验证
print(check_normality(temp))
  1. normaltest
    样本数在(20,50)之间,用normal test算法检测正态分布性
from scipy.stats import normaltest
          def check_normality(teestData):
                  if 20
  1. lilliefors
    样本在[50,300]适用此验证方法
from statsnodels.stats.diagnostic importlilliefors
          def check_normality(testData):
                if 300>=len(testData)>=50
                p_value=lilliefors(testData)[1]
                if p_value<0.05
                    print("use lillifors:")
                    print("data are not normal distributed")
                    return False
               else:
                    print("use lillifors:")
                    print("data are normal distributed")
                    return True
#check
print(check_normality(temp))
  1. anderson
from scipy.stats import anderson
    anderson(temp)

ref:

  1. scipy帮助文档:https://docs.scipy.org/doc/
  2. 知识星球夜跑分享 https://blog.csdn.net/YEPAO01/article/details/99197487
  3. 知识星球:追寻原风景的分享
  4. anderson https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html

你可能感兴趣的:(木东居士学习计划:第四周 数据分布-Python实战)