Operations
There are lots of operations with pandas that will be really useful to you
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
|
col1 |
col2 |
col3 |
0 |
1 |
444 |
abc |
1 |
2 |
555 |
def |
2 |
3 |
666 |
ghi |
3 |
4 |
444 |
xyz |
Info on Unique Values
df['col2'].unique()
array([444, 555, 666])
df['col2'].nunique()
3
df['col2'].value_counts()
444 2
555 1
666 1
Name: col2, dtype: int64
Selecting Data
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
|
col1 |
col2 |
col3 |
3 |
4 |
444 |
xyz |
Applying Functions
def times2(x):
return x*2
df['col1'].apply(times2)
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
df['col3'].apply(len)
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
df['col1'].sum()
10
** Permanently Removing a Column**
del df['col1']
df
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
** Get column and index names: **
df.columns
Index(['col2', 'col3'], dtype='object')
df.index
RangeIndex(start=0, stop=4, step=1)
** Sorting and Ordering a DataFrame:**
df
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
df.sort_values(by='col2')
|
col2 |
col3 |
0 |
444 |
abc |
3 |
444 |
xyz |
1 |
555 |
def |
2 |
666 |
ghi |
** Find Null Values or Check for Null Values**
df.isnull()
|
col2 |
col3 |
0 |
False |
False |
1 |
False |
False |
2 |
False |
False |
3 |
False |
False |
df.dropna()
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
** Filling in NaN values with something else: **
import numpy as np
df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
df.head()
|
col1 |
col2 |
col3 |
0 |
1.0 |
NaN |
abc |
1 |
2.0 |
555.0 |
def |
2 |
3.0 |
666.0 |
ghi |
3 |
NaN |
444.0 |
xyz |
df.fillna('FILL')
|
col1 |
col2 |
col3 |
0 |
1 |
FILL |
abc |
1 |
2 |
555 |
def |
2 |
3 |
666 |
ghi |
3 |
FILL |
444 |
xyz |
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
|
A |
B |
C |
D |
0 |
foo |
one |
x |
1 |
1 |
foo |
one |
y |
3 |
2 |
foo |
two |
x |
2 |
3 |
bar |
two |
y |
5 |
4 |
bar |
one |
x |
4 |
5 |
bar |
one |
y |
1 |
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
|
C |
x |
y |
A |
B |
|
|
bar |
one |
4.0 |
1.0 |
two |
NaN |
5.0 |
foo |
one |
1.0 |
3.0 |
two |
2.0 |
NaN |