

Each and every data scientist is using the very famous libraries for data manipulation that is Pandas, built on top of the Python programming language. It is a powerful python package that makes importing, cleaning, analyzing, and exporting the data easier.

每个数据科学家都在使用非常著名的Pandas库进行数据处理,该库基于Python编程语言构建。 它是一个功能强大的python软件包,可简化导入,清理,分析和导出数据的过程。

In a nutshell, Pandas is like excel for Python, with tables (which in pandas are called DataFrames), rows and columns (which in pandas are called Series), and many functionalities that make it an awesome library for processing and data inspection and manipulation.


Sharing some of the great insights and hacks in pandas which makes data analysis more fun and handy.


import pandas as pd

While reading the dataframe many times we face the problem that a complete set of rows are not visible, thus analyzing the data becomes quite difficult. So pandas provide the function as “set_options” which help us to define the maximum number of counts of rows to be displayed.

在多次读取数据帧时,我们面临的问题是看不到完整的行集,因此分析数据变得非常困难。 因此,熊猫提供了“ set_options ”功能,可以帮助我们定义要显示的最大行数。

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

1.导入数据 (1. Importing the data)

Here are the different formats of data that can be imported by using pandas read functionality.


Csv, Excel, Html, Binary files, Pickle file, Json file, SQL query


The format which is most commonly used for machine learning is CSV i.e (comma separated file) and every data scientist encounters a CSV file on a daily basis, so we would restrict it to a CSV file.


Some of the important key arguments while reading CSV file in pandas are


  • delimiter: a blank space, comma, or other character or symbol that separates different cells in a row


  • header: a row which is to be used as column name


  • index_col: columns to be used as a row labels


  • usecol: the name of columns to be used while reading the file if provided only a subset of the file will be read


  • skiprows: number of lines to be skipped, generally if a file has a blank line or unnecessary content which need to skip


pd.read_csv(filepath, delimiter, header, names, index_col, usecol, skiprows, parse_dates, keep_date_col, chunksize)pd.read_csv("data.csv", header= 

Once the data is imported, it is known as the dataframe in Pandas definition.


2. 了解数据 (2. Understanding the data)

df = pd.DataFrame({

"Company": ["Kia", "Hyundai", "Hyundai", "Hyundai", "Hyundai","Honda","Honda", "Honda", "Honda", "Kia"],

"Segment": ["premium", "budget", "luxury", "premium", "budget","premium", "budget", "budget", "premium", "luxury"],

"Type": ["large", "small", "large", "small","small", "large", "small", "small", "large", "large"],

"CrashRating": [4.5, 2.5, 4, np.nan, 3, 4, 3, 4.2, 4.5, 4.2],

"CustomerFeedback": [9, 7, 5, 5, 8, 5.6, np.nan, 9, 9, 4.8]})

We can check the first n rows or last n rows of raw data by using the function “head(number of rows)”

我们可以使用“ head(行数)”功能检查原始数据的前n行或后n行

df.head(5) #here 5 is first 5 items in the dataframe
df.tail(5) #displays the bottom 5 rows of dataframe

This data is in the format of the table which is the same as visualized in excel or any other CSV reader. To interact more with data lets see some useful inbuilt functions.

此数据采用表格格式,与在excel或任何其他CSV阅读器中显示的格式相同。 要与数据进行更多交互,请看一些有用的内置函数。

  • info: This function provides the summary of the dataframe, that are number of rows, number of columns, name of each column along with the number of null values in that column and the type of data in that column

  • describe: used to analyze the data statistically, and thus only returns results for numerical columns in the dataframe. Returns the table comprises count, mean, standard deviation, minimum value, maximum value, and quantile values which are useful to detect outliers and see the distribution of data.

    describe:用于统计分析数据,因此仅返回数据框中数字列的结果。 返回的表包含计数,平均值,标准偏差,最小值,最大值和分位数,这些值可用于检测异常值和查看数据分布。

  • memoryusage: used to understand the memory usage of each column in bytes


  • dtype: to analyze the datatype of each column within the dataframe. Returns a series with a data type of each column

    dtype:分析数据框中每个列的数据类型。 返回具有每一列数据类型的序列

  • isnull or isna: this function is used to calculate the missing values in the dataframe when used independently returns a bool (True or False) indicating if the value is NA, we can use it in multiple manners


df.isnull().sum() #count the number of missing values in each column
df.isnull().mean()*100 #return the percentage of missing values in each column
  • unique: when we need to count the unique number of values in one pandas series (i.e. in one specific column of dataframe). Generally used to analyze categorical columns.

    唯一:当我们需要计算一个熊猫系列中唯一值的数量时(即在数据框的特定列中)。 通常用于分析分类列。

  • shape: used to define the dimensionality of the dataframe



3. 探索数据 (3. Exploring the Data)

Now that we have loaded our data into a DataFrame and understood its structure, let’s pick and perform visualizations on the data. When it comes to selecting your data, you can do it with both Indexes or based on certain conditions. In this section, let’s go through each one of these methods and do some exploratory analysis.

现在,我们已经将数据加载到DataFrame中并了解了其结构,现在让我们选择数据并对其进行可视化处理。 在选择数据时,可以同时使用两个索引或根据特定条件来执行。 在本节中,我们将逐一介绍这些方法中的每一种,并进行一些探索性分析。

  • Selecting the Columns


Set of columns which we need to analyze can be select in the following way


df[['Company', 'Type']]
df.loc[:,['Company', 'Type']]
  • Selecting the Rows


Selecting the specific rows for analysis can be achieved in the following manner


df.iloc[[1,2], :]
df.loc[[1,2], :]
  • Selecting the specific type of columns


Sometimes it is helpful to select the subset of the column having specific data types than this function can be used

有时选择列的子集会有所帮助 可以使用具有特定于此功能的数据类型

  • Selecting both rows and columns


Most of you are curious to understand that is pandas so week that only one index can be selected at a time either set of rows or columns, no we can select a subset of rows and column at a single time


df.iloc[0:2][['Segment', 'Type']]
  • Applying filter


Now, in a real-time scenario, selecting the particular number of rows based on the indexes is quite tough. So the actual real-life requirement would be to filter out the rows that satisfy a certain condition. With respect to our dataset, we can filter by any of the following conditions

现在,在实时情况下,根据索引选择特定的行数非常困难。 因此,实际的实际需求是过滤出满足特定条件的行。 对于我们的数据集,我们可以通过以下任意条件进行过滤

df[(df['Type']=='large') & (df['Segment']=='luxury')]

4.处理和转换数据 (4. Handling and Transforming the Data)

After doing basic exploration analysis on data, now it’s time to handle missing values and transform the data to perform some advanced data exploration.


  • Missing data handling


Handling missing values is one the trickest and crucial part of data manipulation because replacing the missing cells reflects the change in the distribution of data. Depending on the characteristics of the dataset and the task we can choose to

处理丢失的值是数据操作中最棘手且至关重要的部分之一,因为替换丢失的单元格反映了数据分布的变化。 根据数据集的特征和我们可以选择的任务

  • Drop missing values: We can drop a row or column having missing values. Scenarios where more than 40% of the column have missing values than that whole column is dropped from the analysis. Dropping results in eliminating the entire row from the observation thus reducing the size of the dataframe.

    删除缺失值:我们可以删除具有缺失值的行或列。 场景中超过40%的列缺少值的情况比从分析中删除整个列的情况。 删除导致从观察中消除了整个行,​​从而减小了数据帧的大小。

  • Replace missing values: Depending upon the distribution of the column we can replace the missing values with a special value or an aggregate value such as mean, median, or any other dynamic value which could be average of similar observations. For time-series data missing values are generally replaced with a window of values before and after the observation.

    替换缺失值 :根据列的分布,我们可以将缺失值替换为特殊值或合计值,例如平均值,中位数或任何其他可能是相似观察值的平均值的动态值。 对于时间序列数据,通常将缺失值替换为观察前后的值窗口。

df.fillna(axis=0, method = ‘ffill’, limit =1)
  • Drop column or rows


Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.


Dropping the column is helpful in the labeled dataframe where we want to remove y_true from training and test data.


df.drop(['CrashRating'], axis=1)
df.drop([0,1], axis=0)
  • Group By


In many situations, we split the data into sets like grouping records into buckets by categorical values and apply some functionality on each subset. In the apply functionality, we can perform the following operations:

在许多情况下,我们将数据分成几组,例如通过分类值将记录分组到存储桶中,并对每个子集应用某些功能。 在应用功能中,我们可以执行以下操作:

  • Aggregation − computing a summary statistic

  • Transformation − perform some group-specific operation

  • Filtration − discarding the data with some condition

  • Pivot table


This is much similar to “groupby” which is also composed of counts, sums, or other aggregations derived from a table of data. You may have used this feature in spreadsheets, where you would choose the rows and columns to aggregate on, and the values for those rows and columns. It allows us to summarize data as grouped by different values, including values in categorical columns.

这与“ groupby”非常相似,“ groupby”也由从数据表派生的计数,总和或其他聚合组成。 您可能在电子表格中使用了此功能,可以在其中选择要汇总的行和列,以及这些行和列的值。 它使我们可以按不同值(包括分类列中的值)对数据进行汇总。

Pivot table function in pandas takes certain arguments as input:


  • index, columns


  • values = the name of the column of values to be aggregated in the ultimate table, then grouped by the Index and Columns and aggregated according to the Aggregation Function

    values =要在最终表中聚合的值列的名称,然后按索引和列分组,并根据聚合函数进行聚合

  • aggfunc= (Aggregation Function) how rows are summarized, such as sum, mean, or count

    aggfunc =(聚合函数)如何汇总行,例如求和,均值或计数

df.pivot_table(index=['Company', 'Type'], columns=['Segment'], values=['CrashRating'], aggfunc='mean')
  • Merge and Concatenation


When importing data from multiple files in a separate dataframe it becomes necessary to concat, merge, or join such files into one.


  • concat() — performs all the concatenation operations along an axis while performing optimal set logic (union or intersection) of the indexes

    concat() —沿轴执行所有串联操作,同时执行索引的最佳设置逻辑(联合或交集)

df_new = pd.DataFrame({
"Company": [ "Hyundai","Honda", "Honda", "Honda", "Kia"],
"Segment": ["premium", "budget", "luxury", "luxury", "luxury"],
"Type": ["large", "small", "large", "large", "large"],
"CrashRating": [3.8, 3.5, 4, 4.2, 3],
"CustomerFeedback": [8, 7, 7, 6, 7.5 ]})df_result = pd.concat([df, df_new])
df_result = pd.concat([df, df_new], keys=[‘old’,’new’])
  • merge() — pandas have in-memory join operations very similar to relational databases like SQL, in SQL where we use “join” to combine two tables on one common index.


df_launingyear = pd.DataFrame({
"Company": [ "Hyundai","Honda", "Honda", "Honda", "Kia"],
"LaunchingYear": [2015, 2018, 2017, 2012, 2019]})pd.merge(df, df_launingyear, on='Company')
  • Create dummy variables


Categorical variables whose type are ‘object’ can not be used as it is for training the ML model, we need to create a dummy variable of that specific column using pandas “get_dummies” function

无法将类型为“对象”的分类变量原样用于训练ML模型,我们需要使用pandas“ get_dummies”函数为该特定列创建一个虚拟变量


保存数据框 (Saving a dataframe)

After performing exploratory analysis on the dataset we want to store the observations in the form of a new CSV file, which comprises additional information such as table returned by applying pivot_table function or filtering unnecessary details or new dataframe obtained after running concat or merge-operations.


Exporting the results in the form of a CSV file is a simpler step, as we just need to call “to_csv()” function with some arguments which are same as we used while reading data from CSV.

以CSV文件的形式导出结果是一个简单的步骤,因为我们只需要使用一些参数来调用“ to_csv()”函数,这些参数与从CSV读取数据时使用的参数相同。

df.to_csv('./data.csv', index_label=False)

结论 (Conclusion)

In this article, we have listed some general pandas function used to analyzed each dataset which we have gathered while working with Python and Jupyter Notebooks. We are sure these simple hacks will be of use to you and you will take back something from this article. Till then Happy Coding!.

在本文中,我们列出了一些通用的pandas函数,用于分析在使用Python和Jupyter Notebooks时收集的每个数据集。 我们确信这些简单的技巧对您有用,您将从本文中取回一些东西。 直到快乐编码!

