DataFrames Attributes and Methods#

Section Title: DataFrames Attributes and Methods

import pandas as pd

In this section, we will cover basic attributes and methods of dataframes.

  • Throughout this section, we’ll be using the following three dataframes.

state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim'])
df_sh
State Height
John NY 160
Liz OK 180
Tim TX 140
df_grades = pd.read_csv('https://raw.githubusercontent.com/datasmp/datasets/main/grades.csv')
df_grades.head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
1 jbbsx 35439616 11 F 6 18 93 9 98 91
2 mrvab 35543247 11 M 78 92 60 43 34 26
3 bjyve 61282135 9 M 60 8 10 99 80 87
4 rlpsr 53448034 10 M 3 38 45 43 79 69
df_stock = pd.read_excel('https://raw.githubusercontent.com/datasmp/datasets/main/stock.xlsx')
df_stock.head()
Date APPLE TESLA AMAZON VISA SP500
0 2020-01-02 74.33 86.05 1898.01 189.66 3257.85
1 2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2 2020-01-06 74.20 90.31 1902.88 187.74 3246.28
3 2020-01-07 73.85 93.81 1906.86 187.24 3237.18
4 2020-01-08 75.04 98.43 1891.97 190.45 3253.05

Attributes#

  • Attributes provide information about the dataframe.

shape and len()#

  • shape returns the tuple (number of rows, number of columns).

  • len() returns the number of rows.

# 3 rows and 2 columns
df_sh.shape
(3, 2)
# 3 rows
len(df_sh)
3
# 100 rows and 10 columns
df_grades.shape
(100, 10)
# 100 rows 
len(df_grades)
100

index and columns#

  • index returns row labels.

  • columns returns column labels.

# row labels
df_sh.index
Index(['John', 'Liz', 'Tim'], dtype='object')
# column labels
df_sh.columns
Index(['State', 'Height'], dtype='object')

values#

The values attribute removes labels and returns an array.

df_sh.values
array([['NY', 160],
       ['OK', 180],
       ['TX', 140]], dtype=object)

Methods#

head() and tail()#

  • head(n) displays the first n rows.

  • tail(n) displays the last n rows.

  • The default value of \(n\) is 5.

# first 5 rows
df_grades.head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
1 jbbsx 35439616 11 F 6 18 93 9 98 91
2 mrvab 35543247 11 M 78 92 60 43 34 26
3 bjyve 61282135 9 M 60 8 10 99 80 87
4 rlpsr 53448034 10 M 3 38 45 43 79 69
# first 7 rows
df_grades.head(7)
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
1 jbbsx 35439616 11 F 6 18 93 9 98 91
2 mrvab 35543247 11 M 78 92 60 43 34 26
3 bjyve 61282135 9 M 60 8 10 99 80 87
4 rlpsr 53448034 10 M 3 38 45 43 79 69
5 hotxq 96808584 9 F 29 18 61 17 72 39
6 ylskw 37901138 9 M 53 85 61 87 9 94
# last 5 rows
df_grades.tail()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
95 zxuaq 58332078 12 M 57 1 48 61 58 99
96 vegye 10019910 11 F 88 30 48 85 48 17
97 euwlj 27967150 12 F 70 15 72 77 20 65
98 oesms 36171468 10 M 88 16 100 62 76 19
99 buugr 88783139 10 M 23 74 14 17 40 59
# last 7 rows
df_grades.tail(7)
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
93 yzljj 10884705 10 M 92 75 41 33 61 66
94 jloze 30408806 10 M 19 24 73 98 48 95
95 zxuaq 58332078 12 M 57 1 48 61 58 99
96 vegye 10019910 11 F 88 30 48 85 48 17
97 euwlj 27967150 12 F 70 15 72 77 20 65
98 oesms 36171468 10 M 88 16 100 62 76 19
99 buugr 88783139 10 M 23 74 14 17 40 59

info()#

The info() method provides basic information about the DataFrame, including:

  • row labels: RangeIndex: 100 entries, 0 to 99

  • column labels: ‘Name’, ‘ID’, ‘Grade’, ‘Gender’, ‘HW’, ‘Test-1’, ‘Test-2’, ‘Test-3’, ‘Test-4’, ‘Final’

  • number of columns: 10

  • Non-null (non-missing) values: 100 for each column

  • dtype: object (strings), int64

  • memory usage: 7.9+ KB

df_grades.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    100 non-null    object
 1   ID      100 non-null    int64 
 2   Grade   100 non-null    int64 
 3   Gender  100 non-null    object
 4   HW      100 non-null    int64 
 5   Test-1  100 non-null    int64 
 6   Test-2  100 non-null    int64 
 7   Test-3  100 non-null    int64 
 8   Test-4  100 non-null    int64 
 9   Final   100 non-null    int64 
dtypes: int64(8), object(2)
memory usage: 7.9+ KB
df_grades.columns
Index(['Name', 'ID', 'Grade', 'Gender', 'HW', 'Test-1', 'Test-2', 'Test-3',
       'Test-4', 'Final'],
      dtype='object')

describe()#

It returns the descriptive statistics of the columns of the DataFrame, including:

  • Count

  • Mean

  • Standard deviation (std)

  • Minimum (min)

  • 25th percentile

  • 50th percentile (median)

  • 75th percentile

  • Maximum (max)

df_stock.describe()
APPLE TESLA AMAZON VISA SP500
count 252.000000 252.000000 252.000000 252.000000 252.000000
mean 94.753056 288.347579 2678.569087 192.986984 3215.720079
std 21.836181 167.278855 545.632095 15.541861 318.045077
min 55.660000 72.240000 1676.610000 134.900000 2237.400000
25% 76.622500 149.745000 2146.120000 187.957500 3034.530000
50% 90.960000 232.830000 2884.500000 196.205000 3275.360000
75% 115.350000 421.035000 3164.790000 203.785000 3403.050000
max 136.490000 695.000000 3531.450000 218.020000 3735.360000

value_counts()#

It returns the number of occurrences for each value of the column.

# There are 54 occurrences of 'M' and 46 occurrences of 'F' in the column Gender.
df_grades['Gender'].value_counts()
Gender
M    54
F    46
Name: count, dtype: int64
# There are 27 occurrences of 'nine', 26 occurrences of 'eleven', 24 occurrences of 'twelve', 
# and 23 occurrences of 'ten' in the Grade column.
df_grades['Grade'].value_counts()
Grade
9     27
11    26
12    24
10    23
Name: count, dtype: int64

copy()#

The copy() method returns a copy of the DataFrame.

df_sh_copy = df_sh.copy()
df_sh_copy
State Height
John NY 160
Liz OK 180
Tim TX 140

drop()#

The drop() method is used to remove column(s) and row(s) from the DataFrame.

  • axis=0 means that the given row(s) are dropped (default).

  • axis=1 means that the given column(s) are dropped.

# row with label 'Liz' is removed
df_sh.drop('Liz', axis=0)
State Height
John NY 160
Tim TX 140
# column with label 'Height' is removed
df_sh.drop('Height', axis=1)
State
John NY
Liz OK
Tim TX
  • By default, the drop() method does not change the DataFrame.

  • It just displays how the DataFrame would look if the dropping were performed.

# no change on df_sh
df_sh
State Height
John NY 160
Liz OK 180
Tim TX 140
  • To make a change to the DataFrame, the parameter inplace should be set to True, which is False by default.

# no 'State' column any more
df_sh.drop('Height', axis=1, inplace=True)  
df_sh
State
John NY
Liz OK
Tim TX
# no 'Tim' row any more
df_sh.drop('Tim', inplace=True) #default value of axis=0  
df_sh
State
John NY
Liz OK

reset_index()#

The reset_index() method is used to reset the row labels of a DataFrame to default values: \(0, 1, 2, ...\)

  • It also adds the initial row labels as a new column with label index.

df_sh.reset_index()
index State
0 John NY
1 Liz OK
  • If you do not want to keep the initial index values, you can set the drop parameter to True.

# no index column
df_sh.reset_index(drop=True)
State
0 NY
1 OK
  • reset_index() does not change the DataFrame unless inplace=True.

# no change on row labels
df_sh
State
John NY
Liz OK
# row labels are changed
df_sh.reset_index(inplace=True)
df_sh
index State
0 John NY
1 Liz OK

set_index()#

The set_index() method is used to set the DataFrame row labels to an existing column.

  • The label of the column that becomes the new index is preserved for future use.

# Use 'State' column values as new row labels 
df_sh.set_index('State')
index
State
NY John
OK Liz
  • set_index() does not change the DataFrame unless inplace=True

# no change on row labels
df_sh
index State
0 John NY
1 Liz OK
# row labels are changed
df_sh.set_index('State', inplace=True)
df_sh
index
State
NY John
OK Liz
  • Let’s update the index of df_stock using the Date column.

df_stock.set_index('Date', inplace=True)
df_stock.head()
APPLE TESLA AMAZON VISA SP500
Date
2020-01-02 74.33 86.05 1898.01 189.66 3257.85
2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2020-01-06 74.20 90.31 1902.88 187.74 3246.28
2020-01-07 73.85 93.81 1906.86 187.24 3237.18
2020-01-08 75.04 98.43 1891.97 190.45 3253.05

sum(), mean(), median(), std()#

When axis=0, computation is column-wise (vertically). This means that for each column, the operation is performed on different row values.

  • This is the default value.

When axis=1, computation is row-wise (horizontally). This means that for each row, the operation is performed on different column values.

# axis=0
df_stock.sum(0)    # sum of values in each column
APPLE      23877.77
TESLA      72663.59
AMAZON    674999.41
VISA       48632.72
SP500     810361.46
dtype: float64
# axis=1
df_stock.sum(1)    # sum of values in each row
Date
2020-01-02    5505.90
2020-01-03    5460.18
2020-01-06    5501.41
2020-01-07    5498.94
2020-01-08    5508.94
               ...   
2020-12-23    7857.00
2020-12-24    7877.67
2020-12-28    8031.80
2020-12-29    8063.74
2020-12-30    8064.21
Length: 252, dtype: float64
# mean of each column
df_stock.mean()
APPLE       94.753056
TESLA      288.347579
AMAZON    2678.569087
VISA       192.986984
SP500     3215.720079
dtype: float64
# median of each column
df_stock.median()
APPLE       90.960
TESLA      232.830
AMAZON    2884.500
VISA       196.205
SP500     3275.360
dtype: float64
# standard deviation of each column
df_stock.std()
APPLE      21.836181
TESLA     167.278855
AMAZON    545.632095
VISA       15.541861
SP500     318.045077
dtype: float64

rename()#

The rename() method is used to rename column or row labels.

df_sh
index
State
NY John
OK Liz
  • Let’s rename the index column as “Name”.

df_sh.rename(columns={'index':'Name'})
Name
State
NY John
OK Liz
# no change on df_sh
df_sh
index
State
NY John
OK Liz
  • rename() does not change the DataFrame unless inplace=True.

# 'index' --> 'Name'
df_sh.rename(columns={'index':'Name'}, inplace=True)
df_sh
Name
State
NY John
OK Liz

sort_values()#

It sorts the dataframe based on the provided column in ascending or descending order.

  • By default, sorting is done in ascending order.

  • To sort in descending order, set the ascending parameter to False.

# ascending
df_grades.sort_values('Test-2').head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
12 jjgag 46794501 11 F 8 43 1 7 82 36
19 cxkpb 71808293 10 M 87 80 4 14 77 87
62 mwsqn 85824590 11 M 89 95 4 34 94 21
65 yxchm 48696307 9 M 33 54 5 39 94 99
61 rulsy 40694890 12 M 4 8 6 81 62 46
# descending
df_grades.sort_values('Test-2', ascending=False).head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
16 abqln 59927557 11 F 26 97 100 46 94 78
98 oesms 36171468 10 M 88 16 100 62 76 19
74 xohww 81223224 9 M 32 24 100 81 59 73
90 wxtpu 98301627 12 M 44 57 99 87 9 62
25 tkhtm 56044667 12 F 20 67 97 13 73 41

duplicated()#

  • It returns a Series with boolean values showing if a row is a repeat or not.

state_height_dic = {'State':['NY', 'OK', 'TX', 'TX'], 'Height':[160, 180, 140, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim', 'Tim'])
df_sh
State Height
John NY 160
Liz OK 180
Tim TX 140
Tim TX 140
df_sh.duplicated()
John    False
Liz     False
Tim     False
Tim      True
dtype: bool

drop_duplicates()#

  • It returns a DataFrame with duplicate rows removed.

df_sh.drop_duplicates()
State Height
John NY 160
Liz OK 180
Tim TX 140

to_excel()#

to_csv()#