pandas.to_sql_SQL vs. Pandas-2020年选择哪一个? 第2部分

pandas.to_sql

SQL and Pandas aren’t new technologies. Still, it’s not the easiest task to find corresponding functions for both technologies. That’s where this one and the previous article come into play — providing you with a detailed comparison between the two.

SQLPandas并不是新技术。 尽管如此,为这两种技术找到相应的功能并不是最简单的任务。 这就是本文和上一篇文章起作用的地方-为您提供了两者之间的详细比较。

A couple of days back, I’ve covered the first part of this two-part series, dealing with more simple comparisons between the two technologies:

几天前,我已经介绍了这个分为两部分的系列的第一部分,涉及两种技术之间的更简单比较:

Reading that article first is not a prerequisite, but will definitely help you to get a better understanding of the two. The technologies aren’t designed for the same job, but it’s nice to see corresponding functions between the two. As promised, today we’ll cover more advanced topics:

首先阅读该文章不是前提条件,但绝对可以帮助您更好地理解两者。 这些技术并非为同一工作而设计,但是很高兴看到两者之间具有相应的功能。 按照承诺,今天我们将介绍更多高级主题:

  • Joins

    加入
  • Unions

    工会
  • Groupings

    分组

Before we do so, let’s start simple with the delete statements.

在我们这样做之前,让我们从删除语句开始简单。

删除 (Delete)

DELETE statement is used in SQL to delete or remove a row from the table. The syntax for deleting rows in SQL is as follows:

SQL中使用DELETE语句从表中删除或删除行。 在SQL中删除行的语法如下:

DELETE FROM table_name
WHERE condition;

Deleting a row is slightly different in Pandas. In Pandas, we do not delete a row, we just select the part that we require and discard the rest of it. Don’t worry if it seems like a riddle to you, the example will illustrate it further.

Pandas中删除行稍有不同。 在Pandas中 ,我们不删除任何行,我们只选择所需的部分并丢弃其余部分。 如果您觉得这很麻烦,请不要担心,该示例将进一步说明这一点。

Let’s say we want to delete all the records from the Asian region.

假设我们要删除亚洲区域中的所有记录。

SQL (SQL)

DELETE FROM fert_data
WHERE region = ‘Asia’;

The rows have been successfully deleted. Now let’s see how to perform this task in Pandas.

这些行已成功删除。 现在,让我们看看如何在Pandas中执行此任务。

大熊猫 (Pandas)

df = df.loc[df[‘region’] != ‘Asia’]

Here, we have selected all the rows where the region is not ‘Asia’ and then assigned the resultset to our current data frame. That means we have excluded all the rows where the region was ‘Asia’.

在这里,我们选择了区域不为“ Asia”的所有行,然后将结果集分配给我们的当前数据框。 这意味着我们已排除了该地区为“亚洲”的所有行。

加入 (Joins)

JOINS are used in SQL to join or merge two or more tables together based on a specific condition. There are primarily four types of joins in SQL: LEFT, RIGHT, INNER, FULL. Here is the syntax for JOIN :

SQL中JOINS用于根据特定条件将两个或多个表JOINS或合并在一起。 SQL中的联接主要有四种类型 LEFTRIGHTINNERFULL 。 这是JOIN的语法:

SELECT *
FROM table_name_1 as t1 JOIN table_name_2 as t2
ON t1.column_name_1 = t2.column_name_2;

In Pandas, we can join two or more data frames using merge(). By default, it will perform an inner join. But you can customize it using the how argument to perform other joins. The basic syntax for pd.merge() is as follows :

Pandas中 ,我们可以使用merge()连接两个或更多数据帧 默认情况下,它将执行内部联接。 但是您可以使用how参数自定义它以执行其他联接。 pd.merge()的基本语法如下:

merge(left_df, right_df, how=’inner’, on=condition)

Here is an example to illustrate joins.

这是说明联接的示例。

SQL (SQL)

Given below is a table called country_sub_region. We have to join this table with fert_data using an inner join.

以下是一个名为country_sub_region的表。 我们必须使用内部联接将表与fert_data联接

SELECT country, sub_region
FROM country_sub_region;
SELECT * FROM
fert_data as f INNER JOIN country_sub_region as c
ON f.country = c.country;

The tables have been successfully joined. Let’s see how to join them in Pandas.

这些表已成功加入。 让我们看看如何将它们加入熊猫

大熊猫 (Pandas)

Here we have created a data frame similar to the country_sub_region table:

在这里,我们创建了一个类似于country_sub_region表的数据框:

country_sub_reg = data = [
[‘country’, ’subregion’],
[‘Kenya’, ’East Africa’],
[‘Liberia’, ’West Africa’],
[‘Mali’, ’West Africa’]
]df_sr = pd.DataFrame(country_sub_reg[1:],columns=country_sub_reg[0])

We will merge df_sr with df on the country field using an inner join:

我们将使用内部df_sr国家/地区字段上将df_srdf合并:

pd.merge(df, df_sr, on=’country’, how=’inner’)

联盟 (Union)

UNION operator is used to club together the results of two or more SELECT statements in SQL. There is a comrade to the Union operator called UNION ALL. They differ in the sense that the former removes duplicate values from the combined result.

UNION运算符用于将SQL中两个或多个SELECT语句的结果组合在一起。 联盟运营商有一个同志,称为UNION ALL 。 它们的区别在于前者从合并结果中删除重复的值。

The task of a UNION ALL operator in Pandas can be performed using pd.concat(). While the function of the UNION operator can be performed by first concatenating the data frames using pd.concat() and then applying pd.drop_duplicates() on it.

可以使用pd.concat()来执行Pandas中的UNION ALL运算符的任务。 UNION运算符的功能可以通过以下方式执行:首先使用pd.concat()连接数据帧,然后应用pd.drop_duplicates() 在上面。

SQL (SQL)

In order to illustrate the UNION/UNION ALL operator in SQL, we have created an additional table called fert_data_1. The data in this table looks something as follows:

为了说明SQL中UNION/UNION ALL运算符,我们创建了一个名为fert_data_1的附加表。 该表中的数据如下所示:

Our task is as follows — find the union of rows from fert_data and fert_data_1 table:

我们的任务如下-从fert_datafert_data_1表中找到行的并

SELECT * FROM fert_data_1
UNION ALL
SELECT * FROM fert_data
ORDER BY country;

You will observe that there are some duplicate values. Yes, you guessed it right. You can use the UNION operator to remove them. Try it for yourself.

您会发现其中存在一些重复的值。 是的,你猜对了。 您可以使用UNION运算符删除它们。 自己尝试一下。

大熊猫 (Pandas)

In Pandas, we have created a data frame that is similar to the fert_data_1 table in SQL.

在Pandas中,我们创建了一个类似于SQL中fert_data_1表的数据框。

data = [
[‘country’, ’region’, ’tfr’, ’contraceptors’],
[‘USA’, ’North.Amer’, 1.77, 20],
[‘UK’, ’Europe’, 1.79, 23],
[‘Bangladesh’, ’Asia’, 5.5, 40],
[‘Thailand’, ’Asia’, 2.3, 68]
]df1 = pd.DataFrame(data[1:], columns=data[0])

Union of df and df1:

dfdf1并集:

df_dupli = pd.concat([df1, df])

The data from data frames have been combined. But, in this case, we will get duplicate rows as well. For example, the goal is to have ‘Bangladesh’ listed only once:

来自数据帧的数据已合并。 但是,在这种情况下,我们也会得到重复的行。 例如,目标是仅将“孟加拉国”列出一次:

df_dupli[df_dupli[‘country’] == ’Bangladesh’]

We can drop duplicate records using drop_duplicates() as shown:

我们可以使用drop_duplicates()删除重复的记录,如下所示:

df_wo_dupli = pd.concat([df1, df]).drop_duplicates()

Let’s run the same query and see if we still get two rows.

让我们运行相同的查询,看看是否仍然得到两行。

df_wo_dupli[df_wo_dupli[‘country’] == ‘Bangladesh’]

Problem solved. No more duplicate rows.

问题解决了。 没有更多重复的行。

通过...分组 (Group by)

GROUP BY clause in SQL is used to prepare summary rows by grouping records together. The clause is usually used in conjugation with aggregate functions such as AVG, SUM, COUNT, MIN, MAX, etc. Here is the basic syntax for GROUP BY clause :

SQL中的 GROUP BY子句用于通过将记录分组在一起来准备摘要行。 该子句通常与诸如AVG,SUM,COUNT,MIN,MAX等聚合函数结合使用。这是GROUP BY子句的基本语法:

SELECT column_name_1, agg_func(column_name_2)
FROM table_name
GROUP BY column_name_1;

In Pandas, we have a groupby() function that helps us in summarizing data along a specific column. The generic syntax is as follows:

Pandas中 ,我们具有groupby()函数,该函数可帮助我们汇总特定列中的数据。 通用语法如下:

df.groupby([‘column_name_1’]).agg_function()

Let’s try an example to understand it better — find the average tfr and count of contraceptors field for each region.

让我们尝试一个例子来更好地理解它-找到每个区域的平均tfr避孕药具的数量。

SQL (SQL)

SELECT region, round(avg(tfr),2), count(contraceptors)
FROM fert_data
GROUP BY region;

大熊猫 (Pandas)

df.groupby(‘region’).agg({‘tfr’: np.mean, ‘contraceptors’: np.size}).round(2)

We got the same results from both the queries. You must be wondering what that agg() in Pandas is used for. It used to aggregate one or more operations over a specified axis.

我们从两个查询中得到了相同的结果。 您一定想知道Pandas中的agg()用途。 它用于汇总指定轴上的一项或多项操作。

你走之前 (Before you go)

And this does it — you should now have a good picture behind both technologies, at least data-analysis-wise. It’s difficult to recommend one over the other, as that will depend on your previous experience, biases, and options company you work in opted for.

这样就可以了-您现在应该至少在数据分析方面对这两种技术都有很好的了解。 很难相互推荐,因为这将取决于您以前的经验,偏见和您选择的工作公司。

The good thing is — everything done in SQL can be done in Pandas — at least on this level. Feel free to choose the one you like better, you won’t make a mistake.

好处是-至少在此级别上, SQL中完成的所有工作都可以在Pandas中完成。 随意选择自己喜欢的一个,不会出错。

Thanks for reading.

谢谢阅读。

Join my private email list for more helpful insights.

加入我的私人电子邮件列表以获取更多有用的见解。

翻译自: https://towardsdatascience.com/sql-vs-pandas-which-one-to-choose-in-2020-part-2-9268d4a69984

pandas.to_sql

你可能感兴趣的:(python,java,算法,机器学习)