DataFrames Advanced Methods#

Section Title: DataFrames Advanced Methods

import pandas as pd

The following dataframes will be used throughout this section.

df1 = pd.DataFrame({'1C1':[3,4,5], '1C2':[30,40,50], 'col':[300,400,500]}, index=['A', 'B', 'C'])
df1
1C1 1C2 col
A 3 30 300
B 4 40 400
C 5 50 500
df2 = pd.DataFrame({'2C1':[5,6,9], '2C2':[50,60,90]}, index=['A', 'B', 'C'])
df2
2C1 2C2
A 5 50
B 6 60
C 9 90
df3 = pd.DataFrame({'3C1':[7,8], '3C2':[70,80], 'col':[700,800]}, index=['C', 'D'])
df3
3C1 3C2 col
C 7 70 700
D 8 80 800

concat()#

It is used to concatenate dataframes along an axis (horizontally or vertically).

# default axis=0: add as new rows
# two new rows coming from df3: C and D
pd.concat([df1,df3])        
1C1 1C2 col 3C1 3C2
A 3.0 30.0 300 NaN NaN
B 4.0 40.0 400 NaN NaN
C 5.0 50.0 500 NaN NaN
C NaN NaN 700 7.0 70.0
D NaN NaN 800 8.0 80.0
# axis=1, add as new columns
# three more rows are coming from df3: 3C1, 3C2, col 
pd.concat([df1,df3], axis=1)   
1C1 1C2 col 3C1 3C2 col
A 3.0 30.0 300.0 NaN NaN NaN
B 4.0 40.0 400.0 NaN NaN NaN
C 5.0 50.0 500.0 7.0 70.0 700.0
D NaN NaN NaN 8.0 80.0 800.0
# along only on common columns
pd.concat([df1,df2], join='inner')   # intersection of columns
A
B
C
A
B
C
  • If the indexes of two dataframes are the same, the column labels are different, and axis=1, then the second dataframe is concatenated horizontally.

# default axis=1: add as a new row
pd.concat([df1,df2], axis=1) 
1C1 1C2 col 2C1 2C2
A 3 30 300 5 50
B 4 40 400 6 60
C 5 50 500 9 90

shift()#

It is used to shift the data up or down.

df = pd.DataFrame(['A', 'B', 'C', 'D', 'E'])
df
0
0 A
1 B
2 C
3 D
4 E
# shift the values down by 1 row
df.shift()
0
0 None
1 A
2 B
3 C
4 D
# shift the values down by 2 rows
df.shift(2)
0
0 None
1 None
2 A
3 B
4 C
# shift the values up by 1 row
df.shift(-1)
0
0 B
1 C
2 D
3 E
4 None
# shift the values up by 2 rows
df.shift(-2)
0
0 C
1 D
2 E
3 None
4 None

pct_change()#

Percentage change between the current and a prior value.

df = pd.DataFrame([500, 400, 600, 150, 180])
df
0
0 500
1 400
2 600
3 150
4 180
df.pct_change()
0
0 NaN
1 -0.20
2 0.50
3 -0.75
4 0.20

rolling()#

It is used to do calculations using a rolling window.

  • The most commonly used methods following the rolling() method are sum(), mean(), median(), and std().

df
0
0 500
1 400
2 600
3 150
4 180
  • rolling() returns a Rolling object.

  • The window parameter specifies the size of the moving window and must be provided.

# window=3
type(df.rolling(3))
pandas.core.window.rolling.Rolling
df.rolling(window=3).sum()
0
0 NaN
1 NaN
2 1500.0
3 1150.0
4 930.0

With a moving window size of 3, the row triplets and corresponding values are as follows:

  • Rows 0, 1, and 2: The sum of 500, 400, and 600 is 1500.

  • Rows 1, 2, and 3: The sum of 400, 600, and 150 is 1150.

  • Rows 2, 3, and 4: The sum of 600, 150, and 180 is 930.

The first two rolling sum values are NaN (not a number) because a minimum of 3 values is required to compute the sum

corr()#

It returns the correlation coefficients of all column pairs.

Let’s find the correlation coefficients for the following DataFrame.

df4 = pd.DataFrame( {'x-values': [1,2,3,4,5], 'square-values': [1,4,9,16,25],'cube-values': [1,8,27,64,125],
                    'random':[9,4,3,1,8], 'reciprocal': [1/1,1/2,1/3,1/4,1/5]})
df4
x-values square-values cube-values random reciprocal
0 1 1 1 9 1.000000
1 2 4 8 4 0.500000
2 3 9 27 3 0.333333
3 4 16 64 1 0.250000
4 5 25 125 8 0.200000
df4.corr()
x-values square-values cube-values random reciprocal
x-values 1.000000 0.981105 0.943118 -0.233126 -0.901754
square-values 0.981105 1.000000 0.989216 -0.053368 -0.806339
cube-values 0.943118 0.989216 1.000000 0.088235 -0.722074
random -0.233126 -0.053368 0.088235 1.000000 0.553018
reciprocal -0.901754 -0.806339 -0.722074 0.553018 1.000000

The correlation matrix can be visualized using Seaborn’s heatmap function.

import seaborn as sns
sns.heatmap(df4.corr(), annot=True);
_images/77186bedeaba46a594999660930c816e9a5d7bb06d65df872f313280392ba6d5.png

groupby()#

The groupby() function is used to group rows based on column values.

  • It is commonly used to apply functions, such as calculating the sum, to each group.

df5 = pd.DataFrame({'gender':['M','F','F','M','F'], 'age':[35, 50, 25, 30, 40]})
df5
gender age
0 M 35
1 F 50
2 F 25
3 M 30
4 F 40
  • The groupby() method below returns a GroupBy object with two groups, corresponding to ‘M’ and ‘F’.

df5.groupby('gender')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x124ae3d90>
  • The sum() method below calculates the total of the age values for the ‘M’ and ‘F’ groups.

df5.groupby('gender').sum()
age
gender
F 115
M 65
  • The mean() method below calculates the mean of the age values for the ‘M’ and ‘F’ groups.

df5.groupby('gender').mean()
age
gender
F 38.333333
M 32.500000
  • The count() method below determines the number of samples (rows) in the ‘M’ and ‘F’ groups.

df5.groupby('gender').count()
age
gender
F 3
M 2

agg()#

The agg() methodapplies specified functions to values along a given axis.

  • Syntax: pd.agg(function, axis=0)

df = pd.DataFrame([[7,2], [3,4], [5,6]], columns=['A', 'B'])
df
A B
0 7 2
1 3 4
2 5 6
df.agg('sum')
A    15
B    12
dtype: int64
df.agg('sum', axis=1)
0     9
1     7
2    11
dtype: int64
df.agg(['sum', 'min'])
A B
sum 15 12
min 3 2
df.agg(lambda x: x**2)
A B
0 49 4
1 9 16
2 25 36
df
A B
0 7 2
1 3 4
2 5 6

iterrows()#

The iterrows() method iterates through the rows of a DataFrame, returning each row as an (index, Series) pair.

  • It returns a generator object.

  • The index represents the row index in the DataFrame.

df.iterrows()
<generator object DataFrame.iterrows at 0x13480ace0>
list(df.iterrows())
[(0,
  A    7
  B    2
  Name: 0, dtype: int64),
 (1,
  A    3
  B    4
  Name: 1, dtype: int64),
 (2,
  A    5
  B    6
  Name: 2, dtype: int64)]
list(df.iloc[1:,:].iterrows())
[(1,
  A    3
  B    4
  Name: 1, dtype: int64),
 (2,
  A    5
  B    6
  Name: 2, dtype: int64)]
for idx, row in df.iterrows():
    print(idx, row.values)
0 [7 2]
1 [3 4]
2 [5 6]

apply()#

The apply() method is used to apply a function to either rows or columns of a DataFrame.

  • Syntax: df.apply(function, axis=0)

  • The default value of axis is 0, meaning the function is applied to each column.

  • If axis=1, the function is applied to each row.

  • Additionally, a lambda function can be used within apply() to define the function inline.

We will use the apply() method on the following DataFrame.

df
A B
0 7 2
1 3 4
2 5 6

In the following example, a custom square function is used to compute the square of values in each column.

def square_function(x):
    return x**2
df.apply(square_function)
A B
0 49 4
1 9 16
2 25 36

In the following example, a lambda function is used to define the square function.

df.apply(lambda x: x**2)
A B
0 49 4
1 9 16
2 25 36

In the following example, the sum() function from NumPy is used to compute the sum of values in each column.

import numpy as np
df.apply(np.sum)
A    15
B    12
dtype: int64

In the following example, the sum() function from NumPy is used to compute the sum of values in each row.

df.apply(np.sum, axis=1)
0     9
1     7
2    11
dtype: int64

The following example calculates the weighted sum of particular DataFrame columns using a custom function and a lambda function.

def add_weighted_columns(row):
    return 0.2*row['A'] + 0.8*row['B']

df.apply(add_weighted_columns, axis=1)
0    3.0
1    3.8
2    5.8
dtype: float64
df.apply(lambda  row: 0.2*row['A'] + 0.8*row['B'], axis=1)
0    3.0
1    3.8
2    5.8
dtype: float64

at()#

The at() method is used to access a single value from a DataFrame using a specific row and column label pair.

It is similar to loc, but at() is used for retrieving a single value and does not support slicing.

df
A B
0 7 2
1 3 4
2 5 6
df.at[1,'B']
4