此文用于记录在Allen B. Downey所著的《统计思维-程序员数学之概率统计》这本书的学习过程和一些理解
第一个孩子是否大多数会在预产期之后出生?
将NSFG的数据处理代码survey.py与NSFG放于同一目录下运行,程序会读取数据文件并显示每个文件的记录数
>>>Number of respondents 7643
>>>Number of pregnancies 13593
survey.py中定义了以下六个类
类名 | 描述 |
---|---|
Record | 表示一个记录的对象 |
Respondent | Record的子类,表示被调查者记录的对象 |
Pregnancy | Record的子类,表示怀孕者记录的对象 |
Table | 表示若干记录集合的表对象 |
Respondents | Table的子类,表示被调查者记录集合的表对象 |
Pregnancys | Table的子类,表示怀孕者记录集合的表对象 |
Table
函数原型 | 功能描述 | 参数描述 | 返回 |
---|---|---|---|
ReadFile(self, data_dir, filename, fields, constructor, n=None) | 读取压缩数据文件,每个记录生成一个对象 | data_dir:字符串的目录名; filename:要读取的文件的字符串名称; fields:(name, start, end, cast) 元组指定序列要提取的字段; constructor:创建什么样的对象 | |
MakeRecord(self, line, fields, constructor) | 扫描一行并返回一个具有适当字段的对象 | line:从数据文件的字符串行; fields:(name, start, end, cast) 元组指定序列要提取的字段; constructor:可调用的,它为记录创建对象 | 用适当的字段记录 |
AddRecord(self, record) | 向该表添加记录 | record:记录类型之一的对象 | |
ExtendRecords(self, records) | 向该表添加记录序列 | record:记录对象的序列 | |
Recode(self) | 子类可以重写该记录的值 |
Respondents
函数原型 | 功能描述 | 返回 |
---|---|---|
ReadRecords(self, data_dir=’.’, n=None) | 读取记录构建被调查者表 | |
GetFilename(self) | 返回数据文件名 | 2002FemResp.dat.gz |
GetFields(self) | 返回指定记录字段的元组列表,这些字段就是Record对象的属性 | caseid:被调查者的整数ID |
Pregnancies
函数原型 | 功能描述 | 返回 |
---|---|---|
ReadRecords(self, data_dir=’.’, n=None) | 读取记录构建怀孕者表 | |
GetFilename(self) | 返回数据文件名 | 2002FemPreg.dat.gz |
GetFields(self) | 返回指定记录字段的元组列表,这些字段就是Record对象的属性 | caseid:被调查者的整数ID;prglength:怀孕周期,单位是周;outcome:怀孕结果的整数代码,1表示活婴;birthord:正常出生的婴儿的顺序;finalwgt:被调查者的统计权重;nbrnaliv;babysex;birthwgt_lb;birthwgt_oz;agepreg |
"""This file contains code for use with "Think Stats", by Allen B. Downey, available from greenteapress.com
Copyright 2010 Allen B. Downey
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
"""
import sys
import gzip
import os
class Record(object):
"""Represents a record."""
class Respondent(Record):
"""Represents a respondent."""
class Pregnancy(Record):
"""Represents a pregnancy."""
class Table(object):
"""Represents a table as a list of objects"""
def __init__(self):
self.records = []
def __len__(self):
return len(self.records)
def ReadFile(self, data_dir, filename, fields, constructor, n=None):
"""Reads a compressed data file builds one object per record.
Args:
data_dir: string directory name
filename: string name of the file to read
fields: sequence of (name, start, end, case) tuples specifying
the fields to extract
constructor: what kind of object to create
"""
filename = os.path.join(data_dir, filename)
if filename.endswith('gz'):
fp = gzip.open(filename)
else:
fp = open(filename)
for i, line in enumerate(fp):
if i == n:
break
record = self.MakeRecord(line, fields, constructor)
self.AddRecord(record)
fp.close()
def MakeRecord(self, line, fields, constructor):
"""Scans a line and returns an object with the appropriate fields.
Args:
line: string line from a data file
fields: sequence of (name, start, end, cast) tuples specifying
the fields to extract
constructor: callable that makes an object for the record.
Returns:
Record with appropriate fields.
"""
obj = constructor()
for (field, start, end, cast) in fields:
try:
s = line[start-1:end]
val = cast(s)
except ValueError:
# If you are using Visual Studio, you might see an
# "error" at this point, but it is not really an error;
# I am just using try...except to handle not-available (NA)
# data. You should be able to tell Visual Studio to
# ignore this non-error.
val = 'NA'
setattr(obj, field, val)
return obj
def AddRecord(self, record):
"""Adds a record to this table.
Args:
record: an object of one of the record types.
"""
self.records.append(record)
def ExtendRecords(self, records):
"""Adds records to this table.
Args:
records: a sequence of record object
"""
self.records.extend(records)
def Recode(self):
"""Child classes can override this to recode values."""
pass
class Respondents(Table):
"""Represents the respondent table."""
def ReadRecords(self, data_dir='.', n=None):
filename = self.GetFilename()
self.ReadFile(data_dir, filename, self.GetFields(), Respondent, n)
self.Recode()
def GetFilename(self):
return '2002FemResp.dat.gz'
def GetFields(self):
"""Returns a tuple specifying the fields to extract.
The elements of the tuple are field, start, end, case.
field is the name of the variable
start and end are the indices as specified in the NSFG docs
cast is a callable that converts the result to int, float, etc.
"""
return [
('caseid', 1, 12, int),
]
class Pregnancies(Table):
"""Contains survey data about a Pregnancy."""
def ReadRecords(self, data_dir='.', n=None):
filename = self.GetFilename()
self.ReadFile(data_dir, filename, self.GetFields(), Pregnancy, n)
self.Recode()
def GetFilename(self):
return '2002FemPreg.dat.gz'
def GetFields(self):
"""Gets information about the fields to extract from the survey data.
Documentation of the fields for Cycle 6 is at
http://nsfg.icpsr.umich.edu/cocoon/WebDocs/NSFG/public/index.htm
Returns:
sequence of (name, start, end, type) tuples
"""
return [
('caseid', 1, 12, int),
('nbrnaliv', 22, 22, int),
('babysex', 56, 56, int),
('birthwgt_lb', 57, 58, int),
('birthwgt_oz', 59, 60, int),
('prglength', 275, 276, int),
('outcome', 277, 277, int),
('birthord', 278, 279, int),
('agepreg', 284, 287, int),
('finalwgt', 423, 440, float),
]
def Recode(self):
for rec in self.records:
# divide mother's age by 100
try:
if rec.agepreg != 'NA':
rec.agepreg /= 100.0
except AttributeError:
pass
# convert weight at birth from lbs/oz to total ounces
# note: there are some very low birthweights
# that are almost certainly errors, but for now I am not
# filtering
try:
if (rec.birthwgt_lb != 'NA' and rec.birthwgt_lb < 20 and
rec.birthwgt_oz != 'NA' and rec.birthwgt_oz <= 16):
rec.totalwgt_oz = rec.birthwgt_lb * 16 + rec.birthwgt_oz
else:
rec.totalwgt_oz = 'NA'
except AttributeError:
pass
def main(name, data_dir='.'):
resp = Respondents()
resp.ReadRecords(data_dir)
print ('Number of respondents', len(resp.records))
preg = Pregnancies()
preg.ReadRecords(data_dir)
print ('Number of pregnancies', len(preg.records))
if __name__ == '__main__':
main(*sys.argv)
将平均怀孕周期统计代码first.py与survey.py,及NSFG放于同一目录下运行,程序会读取数据文件并统计出第一胎婴儿和其他婴儿的平均怀孕周期对比
>>>Number of first babies 4413
>>>Number of others 4735
>>>Mean gestation in weeks:
>>>First babies 38.60095173351461
>>>Others 38.52291446673706
>>>Difference in days 0.5462608674428466
"""This file contains code used in "Think Stats",
by Allen B. Downey, available from greenteapress.com
Copyright 2010 Allen B. Downey
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
"""
import survey
# copying Mean from thinkstats.py so we don't have to deal with
# importing anything in Chapter 1
def Mean(t):
"""Computes the mean of a sequence of numbers.
Args:
t: sequence of numbers
Returns:
float
"""
return float(sum(t)) / len(t)
def PartitionRecords(table):
"""Divides records into two lists: first babies and others.
Only live births are included
Args:
table: pregnancy Table
"""
firsts = survey.Pregnancies()
others = survey.Pregnancies()
for p in table.records:
# skip non-live births
if p.outcome != 1:
continue
if p.birthord == 1:
firsts.AddRecord(p)
else:
others.AddRecord(p)
return firsts, others
def Process(table):
"""Runs analysis on the given table.
Args:
table: table object
"""
table.lengths = [p.prglength for p in table.records]
table.n = len(table.lengths)
table.mu = Mean(table.lengths)
def MakeTables(data_dir='.'):
"""Reads survey data and returns tables for first babies and others."""
table = survey.Pregnancies()
table.ReadRecords(data_dir)
firsts, others = PartitionRecords(table)
return table, firsts, others
def ProcessTables(*tables):
"""Processes a list of tables
Args:
tables: gathered argument tuple of Tuples
"""
for table in tables:
Process(table)
def Summarize(data_dir):
"""Prints summary statistics for first babies and others.
Returns:
tuple of Tables
"""
table, firsts, others = MakeTables(data_dir)
ProcessTables(firsts, others)
print 'Number of first babies', firsts.n
print 'Number of others', others.n
mu1, mu2 = firsts.mu, others.mu
print 'Mean gestation in weeks:'
print 'First babies', mu1
print 'Others', mu2
print 'Difference in days', (mu1 - mu2) * 7.0
def main(name, data_dir='.'):
Summarize(data_dir)
if __name__ == '__main__':
import sys
main(*sys.argv)
第一胎婴儿的出生时间比其他婴儿的出生时间平均晚13个小时,出现了直观效应,仍需考虑以下问题: