终极指南 (ULTIMATE GUIDE)
“A picture is worth a thousand words”
“一张图片胜过千言万语”
-Fred R. Barnard
弗雷德·R·巴纳德
Data visualization is a visual (or graphic) representation of data to find useful insights (i.e. trends and patterns) in the data and making the process of data analysis easier and simpler.
数据可视化是数据的可视化(或图形化)表示,可以在数据中找到有用的见解 (即趋势和模式),并使数据分析过程变得更加容易和简单。
Aim of the data visualization is to make a quick and clear understanding of data in the first glance and make it visually presentable to comprehend the information.
数据可视化的目的是一眼就可以快速,清晰地理解数据,并使其在视觉上呈现出可理解的信息。
In Python, several comprehensive libraries are available for creating high quality, attractive, interactive, and informative statistical graphics (2D and 3D).
在Python中,有几个综合库可用于创建高质量,引人注目的,交互式的和信息丰富的统计图形(2D和3D)。
Python中一些流行的数据可视化库 (Some popular data visualization libraries available in Python)
Matplotlib is one such popular visualization library available which allows us to create high-quality graphics with a range of graphs such as scatter plots, line charts, bar charts, histograms, and pie charts.
Matplotlib是这样一种流行的可视化库,它使我们能够使用一系列图形(例如散点图,折线图,条形图,直方图和饼图)来创建高质量的图形。
Seaborn is another of Python’s data visualization library built on top of Matplotlib, which have a high-level interface with attractive designs. Moreover, it reduces the lines of code required to produce the same result as in Matplotlib.
Seaborn是另一个基于Matplotlib构建的Python数据可视化库,它具有高级界面和精美的设计。 而且,它减少了产生与Matplotlib相同的结果所需的代码行。
Pandas is another great library available in Python for data analysis (data manipulation, time-series analysis, integrating indexing of data, etc.). Pandas Visualization (built on top of Matplotlib) is a tool of Pandas library that allows us to create a visual representation of data frames (data aligned in tabular form of columns and rows) and series (one-dimensional labeled array capable of holding data of any type) much quicker and easier way.
Pandas是Python中可用的另一个很棒的库,用于数据分析(数据操作,时间序列分析,集成数据索引等)。 Pandas Visualization (建立在Matplotlib之上)是Pandas库的工具,它使我们能够创建数据框 (数据以列和行的表格形式对齐) 和序列 (能够保存数据的一维标记数组)的可视表示 。任何类型)都更快,更轻松。
Plotly library is used for creating interactive and multidimensional plots making the process of data analysis easier by providing a better visualization for the data.
Plotly库用于创建交互式和多维图,通过提供更好的数据可视化,使数据分析过程更加容易。
With this article, we will be able to visualize the data in different forms by learning how to plot data in different Python libraries and understand where to use which one appropriately.
通过本文,我们将能够通过学习如何在不同的Python库中绘制数据并了解在何处适当使用哪个库来可视化不同形式的数据。
Note: We can use Google Colaboratory to avoid the process of installation of libraries. All the libraries can be used by simply importing them in the notebook.
注意:我们可以使用 Google Colaboratory 来避免安装库的过程。 只需将它们导入笔记本,即可使用所有库。
了解Maplotlib的基础 (Understanding the basics of Maplotlib)
Figure: The entire area where everything is being drawn. It can contain multiple plots with axes, legends, a range of axes, grid, plot-title, etc.
图:正在绘制所有内容的整个区域。 它可以包含带有坐标轴,图例,坐标轴范围,网格,坐标图标题等的多个图形。
Axes: The area under the figure where the plot is being constructed (or the area your plot appears in) is known as axes. There can be multiple axes in a single figure.
轴:在图下构造绘图的区域(或绘图出现的区域)称为轴。 一个图形中可以有多个轴。
Axis: This is the number line present in the graph which represents the range of values for the plot (X-axis and Y-axis as mentioned in the above figure). There can be more than two axis in the graph in the case of a multi-dimensional graph.
轴:这是图形中显示的数字线,代表该图的值范围(如上图所示,X轴和Y轴)。 在多维图形的情况下,图形中可以有两个以上的轴。
Plot title: The title is positioned in the center above the axes, giving an overview of the plot.
绘图标题:标题位于轴上方的中心,提供了绘图的概述。
导入数据集 (Importing the dataset)
In this article, at various points, we will be using the Iris data set (as an example), which is free and is commonly used (since it is one of the best-known databases to be found in the pattern recognition literature).
在本文的各个地方,我们将使用Iris数据集 (作为示例),该数据集是免费的并且经常使用(因为它是模式识别文献中最著名的数据库之一)。
We can import this data set in two ways:
我们可以通过两种方式导入此数据集:
1.使用Scikit学习库: (1. Using Scikit-learn library:)
Importing the Iris dataset using Scikt-learn library 使用Scikt-learn库导入虹膜数据集Without downloading the .csv
file we can directly import the data set in the workspace using sci-kit learn library available in python.
无需下载.csv
文件,我们可以使用python中可用的sci-kit学习库将数据集直接导入工作区中。
2.使用熊猫库: (2. Using Pandas library:)
Importing data set using Pandas library 使用Pandas库导入数据集Using the above (by importing Pandas library) code, and downloading the .csv
format of the dataset we can import the data in our workspace. These, are the first five elements in the iris
dataset:
使用上面的代码(通过导入Pandas库),并下载数据集的.csv
格式,我们可以将数据导入到工作区中。 这些是iris
数据集中的前五个元素:
Both of the above-mentioned methods can be used to import the dataset and to create graphs, but we will be using the latter because of the better readability of the data (as you can see the difference in the output results of both the methods).
上述两种方法都可以用于导入数据集和创建图形,但是由于数据的可读性更好,我们将使用后者(因为您可以看到两种方法的输出结果之间的差异) 。
Matplotlib入门 (Getting started with Matplotlib)
We begin by importing the library in our notebook by using the following code:
我们首先使用以下代码将库导入笔记本中:
Importing matplotlib 导入matplotlibThere are various styles available in this library for drawing the plot.
该库中有多种样式可用于绘制图。
线图 (Line plots)
Line plot or line chart represents the data in a series (in continuation) showing the frequency of data along with the number line. It can be used to compare numerical sets of values. This is one of the most simple graphs that we can make using python.
折线图或折线图代表一系列数据(连续),显示数据的频率以及数字线。 它可用于比较数值的数值集。 这是我们可以使用python创建的最简单的图形之一。
Code for a basic line chart 基本折线图的代码Here, using thenumpy
linspace()
function we will generate data-points and store them in variable x
and calculate the square of values of x
and store them in another variable y
.
在此,使用numpy
linspace()
函数,我们将产生的数据点和将它们存储在变量x
和计算的值的平方x
,并将它们存储在另一个变量y
。
We will use plt.plot()
function to plot the graph and plt.show()
to display the graph.
我们将使用plt.plot()
函数绘制图形,并使用plt.show()
显示图形。
We can add some more functions to our plot to make it much easier to interpret.
我们可以在绘图中添加更多功能,以使其更易于解释。
To add a label:
x-axis label
andy-axis label
we will useplt.xlabel()
andplt.ylabel()
functions respectively.要添加标签:
x-axis label
和y-axis label
我们将分别使用plt.xlabel()
和plt.ylabel()
函数。We can also give a title to our plot using the
plt.title()
function.我们还可以使用
plt.title()
函数为绘图提供标题 。A grid in the plot can simply be applied by calling
plt.grid(True)
function (makes data easier to interpret).通过调用
plt.grid(True)
函数可以简单地应用图中的网格 (使数据更易于解释)。
With the addition of these functions, the graph becomes much more readable and easier to analyze.
通过添加这些功能,图形变得更易读,更易于分析。
We can add more than one line to our plot and make them distinguishable by using different colors and some other features:
我们可以在绘图中添加多条线,并通过使用不同的颜色和其他一些功能来区分它们:
In the above code, we have added another variable z=x**3
(z=x³) and changed the style and color of the line.
在上面的代码中,我们添加了另一个变量z=x**3
(z =x³),并更改了线条的样式和颜色。
To change the color of a line in the line plot we have to add color=''
parameter in plt.plot()
function.
要更改线条图中的线条颜色,我们必须在plt.plot()
函数中添加color=''
参数。
To change the style of a line in the line plot we have to add linestyle=’’
parameter in plt.plot()
function (or simply we can add ‘*’ or ‘- -’, etcetera).
要更改线条图中的线条样式,我们必须在plt.plot()
函数中添加plt.plot()
linestyle=''
参数(或者简单地,我们可以添加'*'或'--'等)。
This makes the extraction of information and comparison of data variables easier.
这使得信息提取和数据变量比较更加容易。
Similarly, we can create plots for mathematical functions as well:
同样,我们也可以为数学函数创建图 :
Here, we have created a plot for sin(x)
and cos(x)
.
在这里,我们为sin(x)
和cos(x)
创建了一个图。
We can adjust the limit of axes by using the functions plt.xlim(lower_limit,upper_limit)
for x-axis and plt.lim(lower_limit,upper_limit)
for y-axis.
我们可以通过使用功能调整轴的极限 plt.xlim(lower_limit,upper_limit)
对x轴和plt.lim(lower_limit,upper_limit)
为y轴。
For further labeling of the plot, we can add legend
with plt.legend()
function, it will help to identify which line stands for which function.
为了进一步标绘图表,我们可以使用plt.legend()
函数添加legend
,这将有助于识别哪条线代表哪个函数。
子图 (Subplots)
For creating separate (multiple) plots in the same figure we can use theplt.subplots(num_rows,num_cols)
function. Here the details of each subplot can be different.
为了在同一图中创建单独的(多个)图,我们可以使用plt.subplots(num_rows,num_cols)
函数。 在这里,每个子图的细节可能不同。
plt.sublots()
function creates a figure and grid of subplots, in which we can define the number of columns and rows by passing an int
value as the parameter. Moreover, we can also change the spacing between the sublopts by using the gridspec_kw={'hspace': , 'wspace': }
argument. After that, by simply using the index number for the subplot we can easily plot the graphs.
plt.sublots()
函数创建一个子图的图形和网格,其中我们可以通过传递int
值作为参数来定义列和行的数量。 此外,我们还可以通过使用gridspec_kw={'hspace': , 'wspace': }
参数来更改子格子之间的间距。 之后,只需使用子图的索引号,我们就可以轻松绘制图形。
散点图 (Scatter plots)
This kind of plot uses ‘dots’ to represent the numerical data for different variables.
这种图使用“点”表示不同变量的数值数据。
Scatter plots can be used to analyze how one variable affects the other variables. (We can use any number of variables we want to plot on the graph.)
散点图可用于分析一个变量如何影响其他变量。 (我们可以使用要在图形上绘制的任意数量的变量。)
We will use dataset_name.plot()
function to create the graph and in parameters, we will apply the kind = 'scatter’
with a label for x-axis
and y-axis
. Check out the example mentioned below (iris dataset).
我们将使用dataset_name.plot()
函数创建图形,并在参数中,将kind = 'scatter'
应用于带有x-axis
和y-axis
的标签。 查看下面提到的示例(iris数据集)。
Here, we are comparing the petal length
and petal width
of different species of flowers present in the dataset.
在这里,我们正在比较数据集中存在的不同花朵种类的petal length
和petal width
。
But, here it would be very difficult for us to analyze and extract information from this plot because we cannot differentiate between classes present.
但是,在这里,由于我们无法区分存在的类,因此很难从该图分析和提取信息。
So now, we will try another approach which will solve our problem. In this method, we will use plt.scatter()
to create a scatter plot.
因此,现在,我们将尝试另一种解决问题的方法。 在这种方法中,我们将使用plt.scatter()
创建散点图。
To change the color of dots based on the species of flower, we can create a dictionary with storing the colors corresponding to the names of the species. By using thefor
loop we create a single scatter plot of three different species (each represented by a different color).
要根据花的种类更改 点 的颜色 ,我们可以创建一个字典,其中存储与种类名称相对应的颜色。 通过使用for
循环,我们创建了三个不同种类的单一散点图(每个种类由不同的颜色表示)。
This plot created is way better than the previous one. The data of species became easier to distinguish and gives an overall clarity for an easier analysis of information.
创建的该图比上一个更好。 物种的数据变得更容易区分,并为清晰的信息分析提供了整体清晰度。
条形图 (Bar plots)
Bar graphs can be used to compare categorical data. We have to provide the frequency and the categories, we want to represent on the plot.
条形图可用于比较分类数据 。 我们必须提供要在图上表示的频率和类别。
Bar plot using matplotlib 使用matplotlib的条形图Here we are using the iris dataset, to compare the count of different species of flowers (however, they are equal to fifty). To find the count of each unique category in the dataset we are using thevalue_counts()
function. The variable species
and count
in the following code store the name of each unique category ( .index
function) and the frequency of each category ( .values
function)
在这里,我们使用虹膜数据集来比较不同种类花朵的数量(但是,它们等于50)。 为了找到数据集中每个唯一类别的计数,我们使用value_counts()
函数。 以下代码中的变量species
和count
存储每个唯一类别的名称( .index
函数)和每个类别的频率( .values
函数)
This is the most basic kind of bar graph, you can try some variations of this plot like multiple bar plots in the same figure, change the width of bars (using width=
parameter) or create a stacked bar plot (using bottom
parameter).
这是最基本的条形图,您可以尝试该图的一些变体,例如同一张图中的多个条形图,更改条形的宽度(使用width=
参数)或创建堆叠的条形图(使用bottom
参数)。
箱形图 (Box plots)
Box plots help plot and compare the values by plotting the distribution of data based on the sample minimum, the lower quartile, the median, the upper quartile, and the sample maximum (known as the five-number summary). This can help us analyze the data to find the outliers and the variation in the data.
箱形图通过根据样本最小值,下四分位数,中位数,上四分位数和样本最大值(称为五位数摘要 )绘制数据分布图来帮助绘制和比较值。 这可以帮助我们分析数据以发现异常值和数据中的变化。
Box plot 箱形图We have excluded the species column here since we are only comparing the petal length, petal width, sepal length, sepal width
of all the flowers in the iris dataset. We create the box plot using the .boxplot()
function.
由于只比较了虹膜数据集中所有花朵的petal length, petal width, sepal length, sepal width
,因此此处排除了种类列。 我们使用.boxplot()
函数创建箱形图。
直方图 (Histograms)
Histograms are used for the representation of frequency distribution (or we can say probability distribution) of the data. We have to use theplt.hist()
function to create the histogram plot and we can also define the bins
for the plot (i.e. breaking down the entire range of values into a series of intervals and calculating the count of values falling in each interval).
直方图用于表示数据的频率分布(或者可以说是概率分布)。 我们必须使用plt.hist()
函数创建直方图,并且我们还可以定义该图的bins
(即,将整个值范围划分为一系列间隔,并计算每个间隔中的值计数)。
Code for creating histogram using matplotlib 使用matplotlib创建直方图的代码Histograms are a special kind of bar graph.
直方图是一种特殊的条形图。
错误条 (Error Bars)
Error bar is an excellent tool to find out the statistical difference between the group of data by giving a visual representation of the variation in data. It helps to point the error and precision in the process of data analysis (and determine the quality of the model).
误差条是一种出色的工具,它可以通过直观地表示数据变化来找出数据组之间的统计差异 。 它有助于指出数据分析过程中的错误和精度 (并确定模型的质量)。
Code for creating error bars matplotlib 用于创建错误栏的代码matplotlibTo plot the error bars, we have to use errorbar()
function where x
and y
are data point locations, yerr
and xerr
define the size of the error bars (in this code we are only using yerr
).
要绘制误差线,我们必须使用errorbar()
函数,其中x
和y
是数据点位置, yerr
和xerr
定义误差线的大小(在此代码中,我们仅使用yerr
)。
We can also change the style and color of the error bars by using fmt
parameter (like we set the style to dots ’o’
in this particular example), ecolor
for changing the color of dots and color
parameter for changing the color of vertical lines.
我们还可以通过使用fmt
参数来更改误差线的样式和颜色(例如,在此特定示例中,我们将样式设置为点'o'
), ecolor
用于更改点的颜色,而color
参数用于更改垂直线的颜色。
By adding loc = ''
parameter in theplt.legend()
function we can determine the position of the legend in the plot.
通过在plt.legend()
函数中添加loc = ''
参数,我们可以确定图例在图中的位置。
热图 (Heat maps)
Heat maps are used to represent categorical data in the form of ‘color-coded image plot’ (values in the data are represented as colors) to find the correlation of the features in data (cluster analysis). With the help of heat maps, we can have a quick and deep analysis of the data visually.
热图用于以“颜色编码图像图”的形式表示分类数据(数据中的值表示为颜色),以查找数据中特征的相关性( 聚类分析) 。 借助热图,我们可以在视觉上快速,深入地分析数据。
Create heatmap using matplotlib 使用matplotlib创建热图In this example, we are using the iris dataset to create a heat map. .corr()
is a panda’s data frame function used to find the correlation in the dataset. The Heat map is created by using the .imshow()
function where we pass the correlation
of dataset, cmap
(for setting the style and color of the plot) as arguments. To add the colobar we use the .figure.colorbar()
function. And finally to add annotations (the values you can see mentioned over the color blocks) we have used two for loops.
在此示例中,我们使用虹膜数据集创建热图。 .corr()
是熊猫的数据框函数,用于在数据集中查找相关性。 通过使用.imshow()
函数创建热图,在该函数中,我们将数据集的correlation
cmap
(用于设置图的样式和颜色)作为参数传递。 要添加colobar,我们使用.figure.colorbar()
函数。 最后添加注释(可以在颜色块上看到的值),我们使用了两个for循环。
饼状图 (Pie charts)
Pie charts are used to find the correlation (it can be percentage or proportion of data) between the composition of categories in the data where each slice represents a different category, giving the summary of whole data.
饼图用于查找数据中类别组成之间的相关性(可以是数据的百分比或比例),其中每个切片代表一个不同的类别,从而给出整个数据的摘要。
Pie chart matplotlib 饼图matplotlibTo plot the pie chart we have to use the plt.pie()
function. To give a 3D effect to the plot we have used shadow = True
parameter,explode
parameter to show a category separately from the rest of the plot, and for displaying the percentage of each category we have to use autopct
parameter. To make the circle proportionate we can use the plt.axis('equal')
function.
要绘制饼图,我们必须使用plt.pie()
函数。 为了给绘图提供3D效果,我们使用了shadow = True
参数,使用explode
参数来显示与绘图其余部分分开的类别,对于显示每个类别的百分比,我们必须使用autopct
参数。 为了使plt.axis('equal')
比例,我们可以使用plt.axis('equal')
函数。
Seaborn (Seaborn)
With the seaborn’s high-level interface and attractive designs, we can create amazing plots with better visualizations. Moreover, the lines of code required are reduced to a very great extent (as compared to matplotlib).
借助seaborn的高级界面和引人入胜的设计,我们可以创建具有更好可视化效果的惊人地块 。 而且, 所需的代码行大大减少了 (与matplotlib相比)。
Code for importing the library in the workplace:
在工作场所导入图书馆的代码:
Importing seaborn library 导入seaborn图书馆线图 (Line plots)
We can simply create the line plot in the seaborn library by using the sns.lineplot()
function.
我们可以使用sns.lineplot()
函数在seaborn库中简单地创建线图。
Here we can vary the color of grid/background using .set_style()
function available in the library. And using sns.lineplot()
function we can plot the line chart.
在这里,我们可以使用库中可用的.set_style()
函数更改网格/背景的颜色。 使用sns.lineplot()
函数,我们可以绘制折线图。
散点图 (Scatter Plot)
With the seaborn library, we can create the scatter plot in just a single line of code!
借助seaborn库,我们只需一行代码即可创建散点图!
Scatter plot using seaborn library 使用Seaborn库的散点图Here, we have used FacetGrid()
function (with which we can quickly explore our dataset) to create the plot in which we can define hue
(i.e. colors for scatter dots) and .map
function to define the graph type. (Alternative method for creating a scatter plot is using sns.scatterplot()
)
在这里,我们使用FacetGrid()
函数 (可以快速浏览我们的数据集)来创建可以定义hue
(即散点的颜色)的图表,并可以使用.map
函数来定义图形类型。 (创建散点图的替代方法是使用sns.scatterplot()
)
条形图 (Bar plots)
We can create a bar plot in the seaborn library by using sns.barplot()
function.
我们可以使用sns.barplot()
函数在seaborn库中创建条形图。
直方图 (Histogram)
We can create a histogram in the seaborn library by using sns.distplot()
function. We can also calculate probability distribution frequency (PDF), cumulative distribution frequency (CDF), and kernel density estimate (KDE) using this library for data analysis.
我们可以使用sns.distplot()
函数在seaborn库中创建直方图。 我们还可以使用此库进行数据分析,以计算概率分布频率(PDF),累积分布频率(CDF)和核密度估计 (KDE) 。
Seaborn gives some more features for data visualization than matplotlib.
与matplotlib相比,Seaborn为数据可视化提供了更多功能。
热图 (Heat maps)
Seaborn is very efficient in creating heat maps by significantly reducing the lines of code to create the figure.
Seaborn通过显着减少创建图形的代码行,在创建热图方面非常有效。
Multiple lines of code in matplotlib is reduced to just two lines!
matplotlib中的多行代码减少到只有两行!
配对图 (Pair plots)
This is a unique kind of plot available in the seaborn library. This plots a pairwise relationship in datasets (in a single figure). This is an amazing tool for the purpose of data analysis.
这是seaborn图书馆中可用的一种独特地块。 这在数据集中绘制了成对关系(在单个图中)。 对于数据分析而言,这是一个了不起的工具。
By using sns.pairplot()
function we can create pair plots ( height
is used to adjust the height of the plots).
通过使用sns.pairplot()
函数,我们可以创建配对图( height
用于调整图的高度)。
熊猫可视化 (Pandas Visualization)
This library provides an easy way to plot graphs using pandas data frames and data structures. This library is also built on top of matplotlib thus requires fewer lines of code.
该库提供了使用熊猫数据框和数据结构绘制图形的简便方法。 该库也是基于matplotlib构建的,因此需要更少的代码行。
直方图 (Histograms)
It is very simple to create a histogram with this library, we simply have to use .plot.hist()
function. We can also create subplots in the same figure by using subplots=True
argument.
使用此库创建直方图非常简单,我们只需使用.plot.hist()
函数。 我们还可以通过使用subplots=True
参数在同一图中创建子图。
线图 (Line plots)
We can create line plots using this library by using .plot.line()
function. Legends are also automatically added in this library.
我们可以使用此库通过.plot.line()
函数来创建线图。 图例也会自动添加到该库中。
密谋 (Plotly)
With this library, we can create multidimensional interactive plots! This is easy to use library with a high-level interface. We can import this library by using the following code:
使用此库,我们可以创建多维交互式图! 这是一个易于使用的具有高级接口的库。 我们可以使用以下代码导入该库:
4D图(虹膜数据集) (4D-plot (Iris dataset))
You try running this code on your own to check and interact with the plot.
您尝试自己运行此代码以检查图并与图进行交互。
结论 (Conclusion)
I hope with this article you will be able to visualize the data using different libraries in python and start analyzing it.
我希望通过本文,您将能够使用python中的不同库来可视化数据并开始进行分析。
For a better understanding of these concepts, I will recommend you try writing these codes on your once. Keep exploring, and I am sure you will discover new features along the way. I am sharing my notebook repository at the end of the document for your reference.
为了更好地理解这些概念,我建议您尝试一次编写这些代码。 继续探索,我相信您会在此过程中发现新功能。 我将在文档末尾共享我的笔记本存储库,以供您参考。
If you have any questions or comments, please post them in the comment section.
如果您有任何问题或意见,请在评论部分中发布。
If you want to improve the way you write your code, check out our another article:
如果您想改善编写代码的方式,请查看我们的另一篇文章:
Originally published at: www.patataeater.blogspot.com
最初发布于: www.patataeater.blogspot.com
翻译自: https://towardsdatascience.com/data-visualization-with-python-8bc988e44f22