python描述性统计
Universidad Surcolombiana — Facultad de Salud
Surcolombiana大学-Salud学院
The following Notebook, is a minimalist text, that aim introduce to new users, and students to get descriptive statistics that are frequently required in curricular courses and exploratory research. Python is an object-oriented language, therefore it is very easy use and write code to get descriptive statistics. Python allow us work with some important libraries that have been designed by other persons to handle several common tasks, in this text we will use only one of them and a built-in module.
下面的笔记本是一个极简主义的文本,旨在向新用户和学生介绍,以获取在课程和探索性研究中经常需要的描述性统计信息。 Python是一种面向对象的语言,因此非常容易使用和编写代码来获取描述性统计信息。 Python使我们可以使用其他人设计的一些重要库来处理一些常见任务,在本文中,我们将仅使用其中一个和内置模块。
Pandas it is a library designed to handling datasets, and it is a open source project.
Pandas是一个旨在处理数据集的库,它是一个开源项目。
Note: you must include in your script only the code that beginning with the header #include, due this code is consecutive at least until specify that not.
注意:您必须在脚本中仅包含以标题#include开头的代码,因为该代码至少在没有指定之前是连续的。
#include
import os
import pandas as pd
The reserved word import allow us called this code (libraries) , if you are in Windows operating system the most probable is that you need open your Command prompt and type sequentially: pip install pandas
保留字import允许我们将其称为此代码(库),如果您使用的是Windows操作系统,则最有可能需要打开Command提示符并依次键入: pip install pandas
os is a built-in library and allow us specify the working directory, namely set up in your computer where is the path to load or save your dataset or excel files. In Windows operating system this path usually begin with C: or a letter. “C:/my_user/my_folder”
os是一个内置库,允许我们指定工作目录,即在您的计算机中设置的加载或保存数据集或excel文件的路径。 在Windows操作系统中,此路径通常以C:或字母开头。 “ C:/ my_user / my_folder”
#include
#include
os.chdir("C:/my_user/my_folder")
os.chdir("C:/my_user/my_folder")
pandas provide a object class similar to excel spreadsheet that allow us put the information in rows and columns, or observations and variables. See how a dataframe looks in Python.
熊猫提供了一个类似于excel电子表格的对象类,该类允许我们将信息放入行和列或观察值和变量中。 查看数据框在Python中的外观。
data=[[1,2,3,4,5],["A", "B", "A", "B", "A"]]
print("A data Frame with 2 Rows and 4 columns : \n ")
print("-"*16)
print(pd.DataFrame(data))
print("-"*16)A data Frame with 2 Rows and 4 columns :
----------------
0 1 2 3 4
0 1 2 3 4 5
1 A B A B A
----------------
Now the objective it is used the pd.read_excel() to load your dataset in a Python DataFrame.
现在的目标是使用pd.read_excel()将数据集加载到Python DataFrame中。
#include
#include
df=pd.read_excel("file.xlsx")
df=pd.read_excel("file.xlsx")
There are important points that are necesary stress, the first of them is that in some cases you need specify the sheet name. df=pd.read_excel("File.xls, sheet="sheet_name").
One of the most important point is that each variable name in the excel file it is put only in the first row, and there are not merged cells. Also the acronym df of DataFrame it is only used by tradition ( Altough this a recommended practice).
有一些需要强调的要点,首先是在某些情况下,您需要指定工作表名称。 df=pd.read_excel("File.xls, sheet="sheet_name").
最重要的一点是excel文件中的每个变量名都只放在第一行,并且没有合并的单元格。 DataFrame的首字母缩写df ,仅由传统使用(尽管这是推荐的做法)。
To work with data is important distinguish among string and numerical variables due, each of them has a different ways of described. Categorical variables as educational level, profession and other are described with absolute and relative frequency, in otherwise numerical could be described with median and standard deviation or median and interquartile range according its distribution since either normal or not.
要处理数据,重要的是区分字符串变量和数字变量,因为每种变量都有不同的描述方式。 分类变量(如学历,职业等)用绝对和相对频率描述,否则数值可以用中位数和标准差或中位数和四分位间距根据其分布来描述,因为无论其是正态还是非正态。
>>> lista=[[1,"one"], [2,"two"], [3,"three"]]
>>> print("Dataset:\n")
>>> df=pd.DataFrame(lista, columns=["Numerical", "String"])
>>>
print(df)
print("\n")Dataset:
Numerical String
0 1 one
1 2 two
2 3 three
In pandas the type object is similar to string or categorical and we can associated int with integers and float with floating point numbers or decimal. To check the type of variable in a Dataframe, we need used the method dtype. Suppose that your dataset looks like the above and it is ready to load in python memory. To verify the type of data, is important select columns or rows. In a general way we can access to a column thus: df["Variable_name"]
在pandas中,类型对象类似于string或categorical ,我们可以将int与整数关联,将float与浮点数或十进制关联。 要检查变量的类型的数据帧,我们需要使用的方法D型 。 假设您的数据集与上面类似,并且可以加载到python内存中。 要验证数据类型,重要的是选择列或行。 通常,我们可以这样访问列: df["Variable_name"]
df["Numerical"]0 1
1 2
2 3
Name: Numerical, dtype: int64print("The type of the Numerical variable is ",df["Numerical"].dtypes)
print("The type of the String Variable is ",df["String"].dtypes)The type of the Numerical variable is int64
The type of the String Variable is object
Notice that is important to get the appropriate descriptive measures according to data type. Therefore, we use the methods, describe() in numerical variables to get a set of measures as; min,max, median, p50(median) and another relative position measures. In otherwise, in categorical variables we need used .value_counts() to get the number of times that appear a measure in a variable.
请注意,根据数据类型获取适当的描述性措施非常重要。 因此,我们使用数值变量中的describe()方法来获取一组度量为; 最小值,最大值,中位数,p50(中值)和其他相对位置度量。 否则,在分类变量中,我们需要使用.value_counts()来获取在变量中出现度量的次数。
df["Numerical"].describe()count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: Numerical, dtype: float64df["String"].value_counts()three 1
one 1
two 1
Name: String, dtype: int64
Notice that each number only appear one time in dataset, however to get the relative frequency we need add the number of rows or observations that are recorded in the variable or dataset (this method count missing values?) with the built-in function len() we can get this number you can also use shape or index.
请注意,每个数字在数据集中只会出现一次,但是要获得相对频率,我们需要使用内置函数len( )添加变量或数据集中记录的行数或观测值(此方法计算缺失值?) )我们可以得到这个数字,也可以使用shape或index 。
print("The number it is", len(df))The number it is 3print(df["String"].value_counts()/len(df))three 0.333333
one 0.333333
two 0.333333
Name: String, dtype: float64
This add up the 100% how we hope.
这加起来我们的希望是100%。
There are another ways of get this results automatically for all variables, but before we need to understand an important concept in programming denominated loop.
还有另一种方法可以自动获取所有变量的结果,但是在我们需要了解编程循环命名的重要概念之前,还有其他方法。
for k in range(5):
print(k)0
1
2
3
4
Intuitively, this loop repeat the operation of print the current number of k, that go from 0 to 4, the program not include 5.
直观地讲,此循环重复打印当前k的操作,即从0到4,程序不包括5。
for x in range(4):
print("This message will be printed in screen four times")This message will be printed in screen four times
This message will be printed in screen four times
This message will be printed in screen four times
This message will be printed in screen four times
This introduce us in a important concept in some Python objects denominated iterable this means that this data type we allow us access to each one of its elements. for instance, a list is a class of object that is iterable, now we will see how this works.
这为我们引入了一些重要的概念,这些概念被称为可迭代的某些Python对象,这意味着我们允许我们访问此数据类型的每个元素。 例如, 列表是可迭代的一类对象,现在我们将了解其工作原理。
elements=[1,2,3,4]
elements_st=["first","second","third","fourth"]for a in elements:
print("The number of position in arabic is", a)The number of position in arabic is 1
The number of position in arabic is 2
The number of position in arabic is 3
The number of position in arabic is 4for x in elements_st:
print(x)first
second
third
fourthfor a in elements:
print("The number of position in arabic is", a , "While in words", elements_st[a-1])The number of position in arabic is 1 While in words first
The number of position in arabic is 2 While in words second
The number of position in arabic is 3 While in words third
The number of position in arabic is 4 While in words fourth
Note that each element has been printed in screen, in the appropriate order.
请注意,每个元素均已按照适当的顺序打印在屏幕上。
This will be important due df.columns save all variables names in a iterable object. Now we can see how to apply a loop to get the name of each variable and its type.
这很重要,因为df.columns将所有变量名称保存在可迭代的对象中。 现在我们可以看到如何应用循环来获取每个变量的名称及其类型。
for x in df.columns:
print("The variable",x, "was stored in", df[x].dtypes)The variable Numerical was stored in int64
The variable String was stored in object
In a general way we can put some statements to be executed only by some true conditions for instance, we have an iterable object that contain 10 elements and we need only print in screen the numbers higher to 6.
通常,我们可以使某些语句仅在某些真实条件下才能执行,例如,我们有一个包含10个元素的可迭代对象,我们只需要在屏幕上打印大于6的数字即可。
for x in range(11):
if x>6:
print(x)7
8
9
10
Notice that only the numbers are major to six are printed in screen. The operator $>$ not is the only one to assess expressions, we could need check if two values are equal using == or if they are different != and the results of this expression turn back a Boolean value to indicate if the expression is True or False.
请注意,只有数字是大到6的数字。 运算符$> $不是唯一一个评估表达式的运算符,我们可能需要使用==检查两个值是否相等,或者它们是否不同!=,并且该表达式的结果返回一个布尔值以指示该表达式是否为是非 题 。
set_n=[1,2,3,4,5,6]
In the list set_n we have a set of numbers and we are interested in check if the value 7 is contained in this list, then if the condition is true then the program could execute a set of statements.
在列表set_n中,我们有一组数字,我们有兴趣检查值7是否包含在此列表中,然后,如果条件为true,则程序可以执行一组语句。
4 in set_nTrue
Because of the list contained the value 4, then the result will be True. This is important, due given a set of data types or features we can discriminate code to each one.
由于列表包含值4,因此结果将为True 。 这很重要,由于给定了一组数据类型或功能,我们可以将代码区分为每一个。
if 4 in set_n:
print("The number is in the list!")The number is in the list!
Note in the above example, that the statement print("The number it is the list!")
will be not executed if the number not it is in the list.
请注意,在上面的示例中,如果不是print("The number it is the list!")
不会执行语句print("The number it is the list!")
。
if 999 in set_n:
print("The number is in the list!")
Effectively there are not a output due the condition not is true, then we could take advantage of this and get automatic descriptive statistics upon all string and numeric variables in any dataset.
实际上,由于条件不成立而没有输出,因此我们可以利用这一点,并获得任何数据集中所有字符串和数字变量的自动描述性统计信息。
#include
for x in df.columns:
if df[x].dtypes==object :
print("Due",x ,"is a ", df[x].dtype)
print("\n we can get its frequency table")
print(df[x].value_counts() / len(df)*100)Due String is a object
we can get its frequency table
three 33.333333
one 33.333333
two 33.333333
Name: String, dtype: float64
We can extend to a more general dataset, to see how this work without explicit the name of variables.
我们可以扩展到更通用的数据集,以了解在没有显式变量名称的情况下这是如何工作的。
dataset=[["A",1,"circle",10], ["A",2,"circle",11],["C",2,"line",12]]
We have a list with four variables now we convert to dataframe.
我们有一个包含四个变量的列表,现在我们将其转换为数据框。
df=pd.DataFrame(dataset, columns=["letter","number","shape","number2"])
print(df)letter number shape number2
0 A 1 circle 10
1 A 2 circle 11
2 C 2 line 12#include
for x in df.columns:
if df[x].dtypes==object :
print("Due",x ,"is a ", df[x].dtype)
print("\n We can get its frequency table")
print(df[x].value_counts() / len(df)*100)Due letter is a object
We can get its frequency table
A 66.666667
C 33.333333
Name: letter, dtype: float64
Due shape is a object
We can get its frequency table
circle 66.666667
line 33.333333
Name: shape, dtype: float64
The method describe() we can used directly in the object df
我们可以直接在对象df中使用的describe()方法
print(df.describe())number number2
count 3.000000 3.0
mean 1.666667 11.0
std 0.577350 1.0
min 1.000000 10.0
25% 1.500000 10.5
50% 2.000000 11.0
75% 2.000000 11.5
max 2.000000 12.0
This last output could be save in a object and be exported to a spreadsheet. Using pd.to_excel() function of pandas
最后的输出可以保存在对象中并导出到电子表格中。 使用熊猫的 pd.to_excel()函数
numeric_results=df.describe()
numeric_results.to_excel("numeric_resutls.xls")
This will created a file with xls extension in the folder we defined with os.chdir(PATH..)
.
这将在我们用os.chdir(PATH..)
定义的文件夹中创建一个扩展名为xls的文件。
At the end your script must be similar to the following compile code, to describe any dataset.
最后,您的脚本必须类似于以下编译代码,以描述任何数据集。
import os
import pandas as pd
import numpy
os.chdir("C:/my_user/my_folder")
df=pd.read_excel("file.xlsx")
# To get frecuency table.
for x in df.columns:
if df[x].dtypes==object :
print("Due",x ,"is a ", df[x].dtype)
print("\n we can get its frecuency table")
print(df[x].value_counts() / len(df)*100)
# To get numeric measures.
print(df.describe())
翻译自: https://medium.com/@ivanandrestrujillo/minimal-guideline-to-get-descriptive-statistics-using-python-8ede26a7146d
python描述性统计