r语言中的多因素方差分析
In this tutorial, we’ll move on to understanding factors in R programming. One operation we perform frequently in data science is the estimation of a variable based upon the model we built. We are sometimes required to estimate the price of a share or a house, and sometimes we need to estimate what color car is likely to be sold the fastest.
在本教程中,我们将继续了解R编程中的因素。 我们在数据科学中经常执行的一项操作是根据我们建立的模型对变量进行估算。 有时我们需要估计股票或房屋的价格,有时我们需要估计最快销售哪种颜色的汽车。
Variables in data science fall under two categories – continuous and categorical. Continuous variables are those that can take numerical values including floating points. Prices of houses or shares, quantifiable variables like age, weight or height of a person are all continuous variables.
数据科学中的变量分为两类- 连续的和分类的 。 连续变量是那些可以采用包括浮点在内的数值的变量。 房屋或股份的价格,诸如年龄,体重或身高等可量化变量都是连续变量。
On the other hand, categorical variables take a set of fixed values that can be represented using a set of labels. Examples for this category as marital status, gender, the color of the vehicle, the highest educational degree of a person and so on.
另一方面,分类变量采用一组固定值,可以使用一组标签来表示。 例如,婚姻状况,性别,车辆的颜色,个人的最高学历等。
Categorical variables are represented using the factors in R.
分类变量使用R中的因子表示。
Factors can be created using a factor()
function.
可以使用factor()
函数创建factor()
。
factor(x=vector, levels, labels, is.ordered=TRUE/FALSE)
The first argument to factor function is the vector x of values that you wish to factorize. Note that you cannot create a factor using a matrix. X should always be a single-dimensional vector of character strings or integer values.
因子函数的第一个参数是要分解的值的向量 x。 请注意,您不能使用矩阵创建因子。 X应该始终是字符串或整数值的一维向量 。
Secondly, you need to supply the list of levels you need in the factor. Levels is a vector of unique values used in the factor. This is an optional argument.
其次,您需要提供因子中所需级别的列表。 级别是因子中使用的唯一值的向量。 这是一个可选参数。
The third argument is labels. Sometimes when you encode the variables as a vector of integers, you need to specify what integer represents what label. You could use 0 and 1 to represent male and female, but you need to specify that using these labels. So basically this is the key for looking up the factors.
第三个参数是标签 。 有时,当您将变量编码为整数向量时,需要指定什么整数代表什么标签。 您可以使用0和1来代表男性和女性,但是您需要使用这些标签来指定。 因此,基本上,这是查找因素的关键。
Finally, you have a Boolean valued argument is.ordered. Sometimes you may wish to retain the order amongst the factors used. For example, you may encode the month of joining using integers 1 to 12, to represent months from January to Decemeber. In these cases, you need to specify ordered to TRUE.
最后,您有一个布尔值参数is.ordered 。 有时,您可能希望保留所使用因素之间的顺序。 例如,您可以使用整数1到12编码加入月份,以表示从一月到十二月的月份。 在这些情况下,您需要将命令指定为TRUE。
Let us look at examples of factors now.
现在让我们来看一些因素的例子。
#Encode the genders of people into a vector first
#These might be extracted from a dataset usually.
> genvector <- c("Male","Female","Female","Male","Male","Female")
#Create a factor from this vector
> genfact <- factor(genvector)
> genfact
[1] Male Female Female Male Male Female
Levels: Female Male
Notice how the levels are automatically obtained from the vector’s unique values here. Let us try another example where we define male and female as 0 and 1 using labels.
请注意,此处是如何从向量的唯一值自动获取级别的。 让我们尝试另一个示例,其中使用标签将“男性”和“女性”定义为0和1。
#Define a vector with 0 for Male and 1 for Female.
> genvector2 <- c(0,1,1,0,0,1)
#Assign labels Male and Female to 0 and 1 when creating a Factor.
> genfact2 <-factor(genvector2,levels=c("0","1"),labels=c("Male","Female"))
> genfact2
[1] Male Female Female Male Male Female
Levels: Male Female
Observe that the labels you have defined are displayed instead of 0 and 1 defined in the factor.
请注意,显示的是您定义的标签,而不是因子中定义的0和1。
Let us work another example using the ordering of factor levels. Let us first define a vector representing the month of joining for 8 employees.
让我们使用因子水平的排序来工作另一个示例。 让我们首先定义一个向量,表示8位员工的加入月份。
> moj <- c("Jan","Jun","May","Jan","Apr","Dec","Nov","Sep")
Now, there is no way for the compiler to know that May comes before Jun in the order of months. So the following code throws FALSE.
现在,编译器无法知道May会在Jun之前几个月出现。 因此,以下代码将引发FALSE。
> moj[2]>moj[3]
[1] FALSE
To impose ordering, we need to define a vector with all the months in order first.
要强加排序 ,我们需要先定义一个包含所有月份的向量。
> ordermonths <-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
Now create a factor for our data using our moj vector, set the levels to ordermonths and set the argument ordered to TRUE.
现在,使用moj向量为我们的数据创建一个因子,将级别设置为ordermonths并将参数定为TRUE。
> factormoj <- factor(x=moj, levels=ordermonths, ordered=TRUE)
Now factormoj displays as follows.
现在factormoj显示如下。
> factormoj
[1] Jan Jun May Jan Apr Dec Nov Sep
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec
The compiler now knows the ordering among the months. Let us check if it knows that May comes before June.
现在,编译器知道月份之间的顺序。 让我们检查一下是否知道五月份会在六月之前。
> factormoj[2]>factormoj[3]
[1] TRUE
Each element of factor can be assigned a value individually using indexing, just like we index vectors. Let us modify a value from the genfactor we created earlier in the tutorial.
就像我们对向量进行索引一样,可以使用索引分别为因子的每个元素分配一个值。 让我们根据本教程前面创建的基因因子修改值。
We’ll continue with the same variable from before, genfact to make things easier for you.
我们将继续使用以前相同的变量,通过genfact为您简化事情。
> genfact
[1] Male Female Female Male Male Female
Levels: Female Male
> genfact[1]
[1] Male
Levels: Female Male
> genfact[1]<-"Female"
> genfact
[1] Female Female Female Male Male Female
Levels: Female Male
To add a new level to a factor, which hasn’t been defined earlier, you just need to modify the levels vector in the following manner. Let’s try this on our existing genfact variable.
要将新级别添加到因子(之前尚未定义),您只需按照以下方式修改级别向量即可。 让我们在现有的genfact变量上尝试一下。
> levels(genfact) <- c(levels(genfact),"Other")
> genfact
[1] Female Female Female Male Male Female
Levels: Female Male Other
You can now modify the factors to the newly defined level “Other” as well.
现在,您也可以将因子修改为新定义的级别“其他”。
> genfact[3] <- "Other"
> genfact
[1] Female Female Other Male Male Female
Levels: Female Male Other
翻译自: https://www.journaldev.com/35599/factors-in-r
r语言中的多因素方差分析