数据集准备及数据预处理
In my previous four articles, I worked on a single variable of a dataset. I have shown example code in T-SQL, R, and Python languages. I always used the same dataset. Therefore, you might have gotten the impression that in R and in Python, you can operate on a dataset the same way like you operate on an SQL Server table. However, there is a big difference between an SQL Server table and Python or R data frame.
在前四篇文章中,我研究了数据集的单个变量。 我已经用T-SQL,R和Python语言显示了示例代码。 我总是使用相同的数据集。 因此,您可能会产生一种印象,即在R和Python中,您可以像对SQL Server表一样操作数据集。 但是,SQL Server表和Python或R数据帧之间有很大的区别。
In this article, will do a bit more formal introduction to R and Python data frames. I will show how to make basic operations on data frames, like filter them, make a projection, join and bind them, and sort them. For the sake of completeness, I am starting with T-SQL. Of course, the first part is really just a very brief recapitulation of the basic SELECT statement.
在本文中,将对R和Python数据框架做一些更正式的介绍。 我将展示如何对数据框进行基本操作,例如过滤它们,进行投影,联接和绑定它们以及对其进行排序。 为了完整起见,我从T-SQL开始。 当然,第一部分实际上只是对基本SELECT语句的简要概述。
The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set.
检索可写数据的最简单查询包括SELECT和FROM子句。 在select子句中,可以使用星号字符SELECT *表示您需要结果集中表中的所有列。
Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. A projection means you filter the columns. Of course, you can filter also the rows with the WHERE clause.
比使用SELECT *更好的是仅显式列出所需的列。 这意味着您只返回桌子上的投影。 投影意味着您可以过滤列。 当然,您也可以使用WHERE子句过滤行。
In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, like customers, or products, or orders. In order to get result sets meaningful for the business your database supports, you most of the time need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. Rows returned are those for which the condition in the join predicate for the two tables joined evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false or NULL. For an inner join, the order of the tables involved in the join is not important.
在关系数据库中,通常您的数据分布在多个表中。 每个表代表一组相同种类的实体,例如客户,产品或订单。 为了获得对数据库支持的业务有意义的结果集,大多数时候您需要从同一查询的多个表中检索数据。 您需要根据某些条件联接两个或多个表。 最常见的联接类型是内部联接。 返回的行是连接的两个表的连接谓词中的条件评估为true的行。 请注意,在关系数据库中,您具有三值逻辑,因为总有一种数据未知的可能性。 您使用NULL关键字标记未知。 因此,谓词可以评估为true , false或NULL 。 对于内部联接,联接中涉及的表的顺序并不重要。
In the query where you join multiple tables, you should use table aliases. If a column’s name is unique across all tables in the query, then you can use it without table name. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can’t refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query. You can specify column aliases as well.
在连接多个表的查询中,应使用表别名。 如果列名在查询中的所有表中都是唯一的,则可以不使用表名就使用它。 您可以使用表别名来缩短两部分的列名。 您可以在FROM子句中指定表别名。 一旦指定了表别名,就必须始终使用别名。 您将无法再在该查询中引用原始表名。 请注意,在编写查询时,列名在查询中可能是唯一的。 但是,稍后有人可以在查询所涉及的另一个表中添加具有相同名称的列。 如果列名前面没有别名或表名,则由于列名模棱两可,执行查询时会出现错误。 为了使代码更稳定和可读性更好,应始终对查询中的每一列使用表别名。 您也可以指定列别名。
A table in SQL Server represents a set. In a set, the order of the elements is not defined. Therefore, you cannot refer to a row or a column in a table by position. Also, the result of the SELECT statement does not guarantee any specific ordering. If you want to return the data in a specific order, you need to use the ORDER BY clause. The following query shows these basic SELECT elements.
SQL Server中的表代表一个集合。 在一组中,未定义元素的顺序。 因此,您不能按位置引用表中的行或列。 同样,SELECT语句的结果也不保证任何特定的顺序。 如果要按特定顺序返回数据,则需要使用ORDER BY子句。 以下查询显示了这些基本的SELECT元素。
USE AdventureWorksDW2016;
GO
-- Basic SELECT statement elements
SELECT c.CustomerKey, c.FirstName, c.LastName,
g.City, g.StateProvinceName AS State
FROM dbo.DimCustomer AS c
INNER JOIN dbo.DimGeography AS g
ON c.GeographyKey = g.GeographyKey
WHERE g.EnglishCountryRegionName = N'Australia'
ORDER BY c.LastName DESC, c.FirstName DESC;
The following figure shows the partial results, limited to the first six rows only. Please note the order of the rows and other elements of the basic SELECT statement implemented.
下图显示了部分结果,仅限于前六行。 请注意所执行的基本SELECT语句的行顺序和其他元素。
In R, you operate on matrices, not on tables. A matrix is a two-dimensional array of values of the same type. The elements are ordered, and you can refer to them by position.
在R中,您对矩阵进行操作,而不对表格进行操作。 矩阵是相同类型的值的二维数组。 元素是有序的,您可以按位置引用它们。
The most important data structure in R is a data frame. Most of the time, you analyze data stored in a data frame. Data frames are matrices where each variable can be of a different type. Remember, a variable is stored in a column, and all values of a single variable must be of the same type. Data frames are very similar to SQL Server tables. However, they are still matrices, meaning that you can refer to the elements by position and that they are ordered.
R中最重要的数据结构是数据帧。 大多数时候,您会分析存储在数据框中的数据。 数据帧是每个变量可以具有不同类型的矩阵。 请记住,变量存储在列中,并且单个变量的所有值都必须是同一类型。 数据框与SQL Server表非常相似。 但是,它们仍然是矩阵,这意味着您可以按位置引用元素并且它们是有序的。
Most of the times, you get a data frame from your data source, for example from a SQL Server database. You can also enter the data manually, or read it from many other sources, including text files, Excel, and many more. The following code reads the data from SQL Server AdventureWorksDW2016 demo database, the dbo.vTargetMail view.
大多数情况下,您是从数据源(例如从SQL Server数据库)获得数据帧的。 您还可以手动输入数据,或从许多其他来源读取数据,包括文本文件,Excel等。 以下代码从SQL Server AdventureWorksDW2016演示数据库dbo.vTargetMail视图中读取数据。
library(RODBC)
# Connecting and reading the data
con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd")
TM <- as.data.frame(sqlQuery(con,
"SELECT CustomerKey, MaritalStatus, Gender, Age
FROM dbo.vTargetMail;"),
stringsAsFactors = TRUE)
close(con)
The following code shows how you can access the data in a data frame by the position or mixed by the position and column names. Please note that the index of the first row or column is 1, and that the boundaries are included – you get three rows and two columns.
以下代码显示如何按位置访问数据帧中的数据,或按位置和列名混合访问数据帧中的数据。 请注意,第一行或第一列的索引为1,并且包括边界-您将获得三行两列。
TM[1:3, 1:2];
TM[1:3, c("MaritalStatus", "Gender")];
The result of the previous two rows is the same:
前两行的结果相同:
CustomerKey | MaritalStatus | |
1 | 11000 | M |
2 | 11001 | S |
3 | 11002 | M |
客户密钥 | 婚姻状况 | |
1个 | 11000 | 中号 |
2 | 11001 | 小号 |
3 | 11002 | 中号 |
You can refer to columns similarly like you refer to them in T-SQL, where you use table.column dotation; in R, the dollar sign replaces the dot, and therefore you refer to columns with dataframe$column. The following code shows how to refer to the columns. It does the crosstabulation of the MaritalStatus and Gender columns and shows this in a bar chart.
您可以像在T-SQL中使用table.column点号一样引用它们。 在R中,美元符号替换了点,因此您使用dataframe $ column引用列。 以下代码显示了如何引用这些列。 它对MaritalStatus和Gender列进行交叉制表,并在条形图中显示。
# $ Notation
mg <- table(TM$MaritalStatus, TM$Gender);
mg
barplot(mg,
main = 'Marital status and gender',
xlab = 'MaritalStatus', ylab = 'Gender',
col = c("blue", "yellow"),
beside = TRUE,
legend = TRUE)
Here is the bar chart.
这是条形图。
For a projection, you simply select appropriate columns, like the following code shows. It creates two new data frames with subset of the columns only.
对于投影,只需选择适当的列,如以下代码所示。 它仅使用列的子集创建两个新的数据框。
# Projections
cols1 <- c("CustomerKey", "MaritalStatus")
TM1 <- TM[cols1]
cols2 <- c("CustomerKey", "Gender")
TM2 <- TM[cols2]
TM1[1:3, 1:2]
TM2[1:3, 1:2]
Here are the first three rows of the two new data frames.
这是两个新数据帧的前三行。
CustomerKey | MaritalStatus | |
1 | 11000 | M |
2 | 11001 | S |
3 | 11002 | M |
CustomerKey | Gender | |
1 | 11000 | M |
2 | 11001 | M |
3 | 11002 | M |
客户密钥 | 婚姻状况 | |
1个 | 11000 | 中号 |
2 | 11001 | 小号 |
3 | 11002 | 中号 |
客户密钥 | 性别 | |
1个 | 11000 | 中号 |
2 | 11001 | 中号 |
3 | 11002 | 中号 |
You can merge two data frames using some column values from a column that appears in both of them. This is very similar to T-SQL join. The following code performs the merge of the two previously created data frames.
您可以使用同时出现在两个列中的某些列值来合并两个数据框。 这与T-SQL连接非常相似。 以下代码执行两个先前创建的数据帧的合并。
# Merge datasets
TM3 <- merge(TM1, TM2, by = "CustomerKey")
TM3[1:3, 1:3]
The results are:
结果是:
CustomerKey | MaritalStatus | Gender | |
1 | 11000 | M | M |
2 | 11001 | S | M |
3 | 11002 | M | M |
客户密钥 | 婚姻状况 | 性别 | |
1个 | 11000 | 中号 | 中号 |
2 | 11001 | 小号 | 中号 |
3 | 11002 | 中号 | 中号 |
However, data frames are ordered. You can rely on the order, unless you reorder a data frame, of course. Since I created the two projection data frames TM1 and TM2 from the original TM data frame without reordering them, they maintain the original order. Instead of joining them, I can simply bind the columns, row by row, by the cbind() function, like the following code shows.
但是,数据帧是有序的。 当然,您可以依赖该顺序,除非您对一个数据框重新排序。 由于我从原始TM数据帧中创建了两个投影数据帧TM1和TM2,而没有对其进行重新排序,因此它们保持了原始顺序。 不用加入它们,我可以简单地通过cbind()函数逐行绑定列,如以下代码所示。
# Binding datasets
TM4 <- cbind(TM1, TM2)
TM4[1:3, 1:4]
Note the results:
注意结果:
CustomerKey | MaritalStatus | CustomerKey.1 | Gender | |
1 | 11000 | M | 11000 | M |
2 | 11001 | S | 11001 | M |
3 | 11002 | M | 11002 | M |
客户密钥 | 婚姻状况 | CustomerKey.1 | 性别 | |
1个 | 11000 | 中号 | 11000 | 中号 |
2 | 11001 | 小号 | 11001 | 中号 |
3 | 11002 | 中号 | 11002 | 中号 |
The CustomerKey column is listed twice, and the second time it is automatically renamed.
CustomerKey列被列出两次,并且第二次被自动重命名。
You can filter a data frame by using a predicate when you refer to the index of rows or columns. The following code creates a two new data frames with the same columns, CustomerKey and MaritalStatus. The first data frame is filtered to include only two customers with the lowest CustomerKey, and the second includes only two customers with the highest CustomerKey values. Both data frames have the same columns. You can bind such data frames by rows with the rbind() function. This is a similar process like the UNION ALL clause does in T-SQL.
当您引用行或列的索引时,可以使用谓词来过滤数据框。 以下代码使用相同的列CustomerKey和MaritalStatus创建两个新的数据框。 过滤第一个数据帧以仅包括两个具有最低CustomerKey的客户,第二个数据帧仅包括两个具有最高CustomerKey值的客户。 两个数据帧具有相同的列。 您可以使用rbind()函数按行绑定此类数据帧。 这类似于UNION ALL子句在T-SQL中所做的类似过程。
# Filtering and row binding data
TM1 <- TM[TM$CustomerKey < 11002, cols1]
TM2 <- TM[TM$CustomerKey > 29481, cols1]
TM5 <- rbind(TM1, TM2)
TM5
Here is the content of complete TM5 data fame.
这是完整的TM5数据声誉的内容。
CustomerKey | MaritalStatus | |
1 | 11000 | M |
2 | 11001 | S |
18483 | 29482 | M |
18484 | 29483 | M |
客户密钥 | 婚姻状况 | |
1个 | 11000 | 中号 |
2 | 11001 | 小号 |
18483 | 29482 | 中号 |
18484 | 29483 | 中号 |
Finally, you can reorder a data frame by using the order() function in the index reference. The following code creates a reordered data frame by sorting the rows over the Age column. Note the minus sign in the order() function – it means sort descending.
最后,您可以使用索引引用中的order()函数对数据框进行重新排序。 以下代码通过对“年龄”列上的行进行排序来创建重新排序的数据框。 注意order()函数中的减号–这意味着降序排列。
# Sort
TMSortedByAge <- TM[order(-TM$Age), c("CustomerKey", "Age")]
TMSortedByAge[1:5, 1:2]
So here are the reordered first five rows.
因此,这是重新排序的前五行。
CustomerKey | Age | |
1726 | 12725 | 99 |
5456 | 16455 | 98 |
3842 | 14841 | 97 |
3993 | 14992 | 97 |
7035 | 18034 | 97 |
客户密钥 | 年龄 | |
1726 | 12725 | 99 |
5456 | 16455 | 98 |
3842 | 14841 | 97 |
3993 | 14992 | 97 |
7035 | 18034 | 97 |
In Python, there is also the data frame object, like in R. However, it is not part of the basic engine like in R. It is defined in the pandas library. You can communicate with SQL Server through the pandas data frames. But before getting there, you need first to refer to the numpy library, which brings efficient work with matrices to Python. The following code does the necessary imports and reads the data from SQL Server.
在Python中,也像R中一样有数据框对象。但是,它不像R中那样是基本引擎的一部分。它在pandas库中定义。 您可以通过pandas数据框与SQL Server通信。 但是在到达那里之前,您首先需要引用numpy库,该库为Python提供了矩阵处理的高效方法。 下面的代码执行必要的导入,并从SQL Server读取数据。
import numpy as np
import pandas as pd
import pyodbc
import matplotlib.pyplot as plt
# Connecting and reading the data
con = pyodbc.connect('DSN=AWDW;UID=RUser;PWD=Pa$$w0rd')
query = """SELECT CustomerKey, MaritalStatus, Gender, Age
FROM dbo.vTargetMail;"""
TM = pd.read_sql(query, con)
The pandas data frame has many built methods for which you need separate functions in R. For example, you can use the pandas crosstab() function to cross tabulate the data like the R table() function. However, you don’t need a separate function to plot the data; you ca use the pandas data frame plot() function, like the following code shows.
pandas数据框具有许多内置方法,您需要在R中使用单独的函数。例如,您可以使用pandas crosstab()函数像R table()函数那样对数据进行交叉制表。 但是,您不需要单独的函数即可绘制数据。 您可以使用pandas数据框plot()函数,如以下代码所示。
# A graph of marital status and gender
pd.crosstab(TM.MaritalStatus + TM.Gender,
columns = 'Count',
rownames = 'X').plot(kind = 'barh',
legend = False,
title = 'Marital status and gender',
fontsize = 12)
plt.show()
Not that the result of the pandas crosstab() function is a new data frame, and I am calling the plot() method of this data frame in order to produce this time a horizontal bar chart, like you can see in the following figure.
并不是说pandas crosstab()函数的结果是一个新的数据框,而是调用此数据框的plot()方法是为了这次生成水平条形图,如下图所示。
You make a projection of a data frame by selecting a subset of columns listed by their names in an array, like the following code shows.
您可以通过选择数组中按其名称列出的列的子集来对数据框进行投影,如以下代码所示。
# Projections
TM1 = TM[["CustomerKey", "MaritalStatus"]]
TM2 = TM[["CustomerKey", "Gender"]]
I created two new data frames in the previous code. Because a data frame is a matrix in Python as well, I can refer to the elements by their positional index with the iloc(), or index locate method:
我在前面的代码中创建了两个新的数据框。 因为数据框也是Python中的矩阵,所以我可以使用iloc()或索引定位方法通过元素的位置索引来引用元素:
# Positional access
TM1.iloc[0:3, 0:2]
TM2.iloc[0:3, 0:2]
Note that the index is zero-based. But there is another interesting difference from R. Please observe the results of the previous code.
请注意,索引是从零开始的。 但是与R还有另一个有趣的区别。请观察前面代码的结果。
CustomerKey | MaritalStatus | |
0 | 11000 | M |
1 | 11001 | S |
2 | 11002 | M |
CustomerKey | Gender | |
0 | 11000 | M |
1 | 11001 | M |
2 | 11002 | M |
客户密钥 | 婚姻状况 | |
0 | 11000 | 中号 |
1个 | 11001 | 小号 |
2 | 11002 | 中号 |
客户密钥 | 性别 | |
0 | 11000 | 中号 |
1个 | 11001 | 中号 |
2 | 11002 | 中号 |
Not that when you refer to the elements of a data frame by the index position, the upper boundary is not included. For example, the fourth row (index value 3) is not included in any of the results.
并非当您通过索引位置引用数据框的元素时,不包括上限。 例如,第四行(索引值3)不包含在任何结果中。
With the loc() method you can locate the elements based on a predicate for the rows and columns. For example, the following code selects Only rows where Age is greater than 97 and lists the three columns included in the results explicitly in an array.
使用loc()方法,您可以基于行和列的谓词来定位元素。 例如,以下代码仅选择Age大于97的行,并在数组中显式列出结果中包括的三列。
# Filter and projection
TM.loc[TM.Age > 97, ["CustomerKey", "Age", "Gender"]]
CustomerKey | Age | Gender | |
1725 | 12725 | 99 | F |
5455 | 16455 | 98 | F |
客户密钥 | 年龄 | 性别 | |
1725 | 12725 | 99 | F |
5455 | 16455 | 98 | F |
In order to join two data frames, you can use the merge() pandas function, and specify the column you want to use for the join. This is similar to R merge() function. You can see this process in the following code.
为了联接两个数据框,可以使用merge()pandas函数,并指定要用于联接的列。 这类似于R merge()函数。 您可以在以下代码中看到此过程。
# Joining data frames
TM3 = pd.merge(TM1, TM2, on = "CustomerKey")
TM3.iloc[0:3, 0:3]
The first three rows of the merged data frame are shown below.
合并数据帧的前三行如下所示。
CustomerKey | MaritalStatus | Gender | |
0 | 11000 | M | M |
1 | 11001 | S | M |
2 | 11002 | M | M |
客户密钥 | 婚姻状况 | 性别 | |
0 | 11000 | 中号 | 中号 |
1个 | 11001 | 小号 | 中号 |
2 | 11002 | 中号 | 中号 |
Finally, you can reorder a data frame with help of the sort() method, like shown in the following code.
最后,您可以借助sort()方法对数据框进行重新排序,如以下代码所示。
# Sort
TMSortedByAge = TM.sort(["Age"], ascending = False)
TMSortedByAge.iloc[0:5, 0:4]
And here is the last result in this article.
这是本文的最后结果。
CustomerKey | MaritalStatus | Gender | Age | |
1725 | 12725 | M | F | 99 |
5455 | 16455 | M | F | 98 |
3841 | 14841 | M | M | 97 |
7034 | 18034 | M | M | 97 |
3992 | 14992 | M | M | 97 |
客户密钥 | 婚姻状况 | 性别 | 年龄 | |
1725 | 12725 | 中号 | F | 99 |
5455 | 16455 | 中号 | F | 98 |
3841 | 14841 | 中号 | 中号 | 97 |
7034 | 18034 | 中号 | 中号 | 97 |
3992 | 14992 | 中号 | 中号 | 97 |
In this article, you learned how to do the basic operations on a whole dataset. For the next article, I plan to show how you can do more advanced operations, like grouping and aggregating data in t-SQL, R, and Python.
在本文中,您学习了如何对整个数据集执行基本操作。 在下一篇文章中,我计划展示如何执行更多高级操作,例如在t-SQL,R和Python中对数据进行分组和聚合。
Introduction to data science, data understanding and preparation |
Data science in SQL Server: Data understanding and transformation – ordinal variables and dummies |
Data science in SQL Server: Data analysis and transformation – binning a continuous variable |
Data science in SQL Server: Data analysis and transformation – Information entropy of a discrete variable |
Data understanding and preparation – basic work with datasets |
Data science in SQL Server: Data analysis and transformation – grouping and aggregating data I |
Data science in SQL Server: Data analysis and transformation – grouping and aggregating data II |
Interview questions and answers about data science, data understanding and preparation |
数据科学导论,数据理解和准备 |
SQL Server中的数据科学:数据理解和转换–序数变量和虚拟变量 |
SQL Server中的数据科学:数据分析和转换–合并连续变量 |
SQL Server中的数据科学:数据分析和转换–离散变量的信息熵 |
数据理解和准备–数据集的基础工作 |
SQL Server中的数据科学:数据分析和转换–分组和聚合数据I |
SQL Server中的数据科学:数据分析和转换–分组和聚合数据II |
面试有关数据科学,数据理解和准备的问答 |
翻译自: https://www.sqlshack.com/data-understanding-and-preparation-basic-work-with-datasets/
数据集准备及数据预处理