Andrew's Curves
An Andrews plot, also known as Andrews curve, helps you visualize higher dimensionality, multivariate data by plotting each of your dataset's observations as a curve. The feature values of the observation act as the coefficients of the curve, so observations with similar characteristics tend to group closer to each other. Due to this, Andrews curves have some use in outlier detection.
Just as with Parallel Coordinates, every plotted feature must be numeric since the curve equation is essentially the product of the observation's features vector (transposed) and the vector: (1/sqrt(2), sin(t), cos(t), sin(2t), cos(2t), sin(3t), cos(3t), ...) to create a Fourier series.
The Pandas implementation requires you once again specify a GroupBy feature, which is then used to color code the curves as well as produce as chart legend:
from sklearn.datasets import load_iris
from pandas.tools.plotting import andrews_curves
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
# Look pretty...
matplotlib.style.use('ggplot')
# If the above line throws an error, use plt.style.use('ggplot') instead
# Load up SKLearn's Iris Dataset into a Pandas Dataframe
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target_names'] = [data.target_names[i] for i in data.target]
# Andrews Curves Start Here:
plt.figure()
andrews_curves(df, 'target_names')
plt.show()
One of the current weaknesses with the Pandas implementation (and this goes for Parallel Coordinates as well) is that every single observation is charted. In the MATLAB version, you can specify a quantile or probability distribution cutoff. This way, only the mean feature values for a specific group are plotted, with a transparent boundary around the cutoffs. If you feel up to the challenge, a straightforward bonus assignment for you is to take the existing Pandas Andrews curve implementation and extend it with said functionality.
Imshow
One last higher dimensionality, visualization-technique you should know how to use is MatPlotLib's .imshow() method. This command generates an image based off of the normalized values stored in a matrix, or rectangular array of float64s. The properties of the generated image will depend on the dimensions and contents of the array passed in:
An [X, Y] shaped array will result in a grayscale image being generated
A [X, Y, 3] shaped array results in a full-color image: 1 channel for red, 1 for green, and 1 for blue
A [X, Y, 4] shaped array results in a full-color image as before with an extra channel for alpha
Besides being a straightforward way to display .PNG and other images, the .imshow() method has quite a few other use cases. When you use the .corr() method on your dataset, Pandas calculates a correlation matrix for you that measures how close to being linear the relationship between any two features in your dataset are. Correlation values may range from -1 to 1, where 1 would mean the two features are perfectly positively correlated and have identical slopes for all values. -1 would mean they are perfectly negatively correlated, and have a negative slope for one another, again being linear. Values closer to 0 mean there is little to no linear relationship between the two variables at all (e.g., pizza sales and plant growth), and so the the further away from 0 the value is, the stronger the relationship between the features:
>>> df = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> df.corr()
a b c d e
a 1.000000 0.007568 0.014746 0.027275 -0.029043
b 0.007568 1.000000 -0.039130 -0.011612 0.082062
c 0.014746 -0.039130 1.000000 0.025330 -0.028471
d 0.027275 -0.011612 0.025330 1.000000 -0.002215
e -0.029043 0.082062 -0.028471 -0.002215 1.000000
The matrix is symmetric because the correlation between any two features X and Y is, of course, identical to that of features Y and X. It is invariant to scale, so even if one feature is measured in inches and the other is in centimeters, it makes no difference. This matrix and others like the covariance matrix, are useful for inspecting how the variance of a feature is explained by the variance in other feature, and verifying how much new information each feature provides. But even looking at this little, 5x5 matrix makes me dizzy, so you can imagine how easy it is to get lost in a higher dimensionality dataset. You can circumvent this by visualizing your correlation matrix by plotting it with .imshow():
import matplotlib.pyplot as plt
plt.imshow(df.corr(), cmap=plt.cm.Blues, interpolation='nearest')
plt.colorbar()
tick_marks = [i for i in range(len(df.columns))]
plt.xticks(tick_marks, df.columns, rotation='vertical')
plt.yticks(tick_marks, df.columns)
plt.show()
.imshow() can help you any time you have a square matrix you want to visualize. Other matrices you might want to visualize include the covariance matrix, the confusion matrix, and in the future once you learn how to use certain machine learning algorithms that generate clusters which live in your feature-space, you'll also be able to use .imshow() to peek into the brain of your algorithms as they run, so long as your features represent a rectangular image!
Dive Deeper
Being the cornerstone of data science, not much can be done without sound data. Having learned how to look for and manipulate your data, in this module you experimented with numerous visualization techniques to ensure the data you've collected is sound, such as scatter plots, histograms and other higher dimensionality methods. You probably also learned more about wheat kernels than you probably wanted to. We hope you've taken scrupulous notes about the best use cases for each of these plotting mechanisms and will be able to apply them on demand as needed!
The time has come for you to start applying real machine learning to your data. If you have some extra time, take a look at the following list of additional resources so that your visualization toolbox has all the tools you need to continue marching forward!
Basic Visualizations
Pandas Visualization with MatPlotLib
Radar Charts
Scatter-Histogram 2-Variable Distribution
Higher Dimensionality
Andrews Plot
Parallel Coordinates on Wikipedia
More on Parallel Coordinates Usage
Parallel Coords with Different Axes in MatPlotLib
Extras
MatPlotLib Markers
MatPlotLib ColorMaps