DataFrames#

A DataFrame is a two-dimensional data structure characterized by labeled axes for rows and columns.
Similar to a Microsoft Excel worksheet, it efficiently accommodates mixed data types.
Primarily used for data wrangling and manipulation.
DataFrames offer a wide range of methods for tasks such as data cleaning, exploration, and preprocessing.
They support vectorized operations across rows or columns and facilitate visualization.
As the main data structure of the pandas library, DataFrames require importing the library to create and manipulate them, and to utilize their methods.
It’s common practice to abbreviate pandas as pd for brevity in code
import pandas as pd.
Create Dataframes#
import pandas as pd
Use Tuples or Lists#
Use
pd.DataFrame()to create a dataframe.When using pd.DataFrame(), you can provide column and row labels. If they are not provided, default values \(0, 1, 2, \ldots\) are used.
state_tuple = ('NY', 'OK', 'TX')
df = pd.DataFrame(state_tuple)
display(df)
| 0 | |
|---|---|
| 0 | NY |
| 1 | OK |
| 2 | TX |
state_list = ['NY', 'OK', 'TX']
df = pd.DataFrame(state_tuple)
display(df)
| 0 | |
|---|---|
| 0 | NY |
| 1 | OK |
| 2 | TX |
You can provide row and column labels using the index and columns parameters
state_list = ['NY', 'OK', 'TX']
df = pd.DataFrame(state_tuple, index=['John', 'Liz', 'Tim'], columns=['State'])
display(df)
| State | |
|---|---|
| John | NY |
| Liz | OK |
| Tim | TX |
Use a Dictionary#
The keys of the dictionary serve as column labels.
state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df = pd.DataFrame(state_height_dic)
display(df)
| State | Height | |
|---|---|---|
| 0 | NY | 160 |
| 1 | OK | 180 |
| 2 | TX | 140 |
You can change row labels.
state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim'])
df_sh
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Liz | OK | 180 |
| Tim | TX | 140 |
Use an Array#
import numpy as np
state_array = np.array([['NY',160], ['OK', 180], ['TX', 140]])
state_array
array([['NY', '160'],
['OK', '180'],
['TX', '140']], dtype='<U21')
df = pd.DataFrame(state_array)
display(df)
| 0 | 1 | |
|---|---|---|
| 0 | NY | 160 |
| 1 | OK | 180 |
| 2 | TX | 140 |
Import from Websites#
If the URL of the website ends with
.csv, import the data from the website as a CSV file.To import a CSV file, use
pd.read_csv().
If the URL ends with
.xlsx, import the data as an Excel file.To import an Excel file, use
pd.read_excel().
To import a table from a website, you can use pd.read_html().
This function returns a list containing all tables found on the website.
You can use indexing to access the specific dataframe you want from the list.
The
head()method displays the first 5 rows.The
tail()method displays the last 5 rows.
df_grades = pd.read_csv('https://raw.githubusercontent.com/datasmp/datasets/main/grades.csv')
df_grades.head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 1 | jbbsx | 35439616 | 11 | F | 6 | 18 | 93 | 9 | 98 | 91 |
| 2 | mrvab | 35543247 | 11 | M | 78 | 92 | 60 | 43 | 34 | 26 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 |
| 4 | rlpsr | 53448034 | 10 | M | 3 | 38 | 45 | 43 | 79 | 69 |
df_stock = pd.read_excel('https://raw.githubusercontent.com/datasmp/datasets/main/stock.xlsx')
df_stock.head()
| Date | APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-02 | 74.33 | 86.05 | 1898.01 | 189.66 | 3257.85 |
| 1 | 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2 | 2020-01-06 | 74.20 | 90.31 | 1902.88 | 187.74 | 3246.28 |
| 3 | 2020-01-07 | 73.85 | 93.81 | 1906.86 | 187.24 | 3237.18 |
| 4 | 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
# list of tables from a website imported as dataframes
df_list = pd.read_html('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States')
type(df_list)
list
# there are 26 tables in the website
len(df_list)
21
# second table
df_list[1].head()
| Flag, name and postal abbreviation[8] | Cities | Ratification or admission[A] | Population (2020)[10] | Total area[11] | Reps. | ||||
|---|---|---|---|---|---|---|---|---|---|
| Flag, name and postal abbreviation[8] | Flag, name and postal abbreviation[8].1 | Capital | Largest[12] | Ratification or admission[A] | Population (2020)[10] | mi2 | km2 | Reps. | |
| 0 | Alabama | AL | Montgomery | Huntsville | Dec 14, 1819 | 5024279 | 52420 | 135767 | 7 |
| 1 | Alaska | AK | Juneau | Anchorage | Jan 3, 1959 | 733391 | 665384 | 1723337 | 1 |
| 2 | Arizona | AZ | Phoenix | Phoenix | Feb 14, 1912 | 7151502 | 113990 | 295234 | 9 |
| 3 | Arkansas | AR | Little Rock | Little Rock | Jun 15, 1836 | 3011524 | 53179 | 137732 | 4 |
| 4 | California | CA | Sacramento | Los Angeles | Sep 9, 1850 | 39538223 | 163695 | 423967 | 52 |
Import from a Local Computer#
To import an Excel file, use pd.read_excel().
To import a CSV file, use pd.read_csv().
You need to provide the path of the Excel or CSV file that will be imported.
In the following example, the grades CSV file is located in the data_files folder.
df_grades = pd.read_csv('data_files/grades.csv')
df_grades.head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 1 | jbbsx | 35439616 | 11 | F | 6 | 18 | 93 | 9 | 98 | 91 |
| 2 | mrvab | 35543247 | 11 | M | 78 | 92 | 60 | 43 | 34 | 26 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 |
| 4 | rlpsr | 53448034 | 10 | M | 3 | 38 | 45 | 43 | 79 | 69 |
df_stock = pd.read_excel('data_files/stock.xlsx')
df_stock.head()
| Date | APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-02 | 74.33 | 86.05 | 1898.01 | 189.66 | 3257.85 |
| 1 | 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2 | 2020-01-06 | 74.20 | 90.31 | 1902.88 | 187.74 | 3246.28 |
| 3 | 2020-01-07 | 73.85 | 93.81 | 1906.86 | 187.24 | 3237.18 |
| 4 | 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
Import from Google Drive#
You need to mount to establish the connection between your Google Drive and Google Colab notebook using the following code.
from google.colab import drive
drive.mount('/content/drive')
You can import the Excel or CSV file by providing the file’s path.
df_grades = pd.read_excel('/content/drive/My Drive/data_files/grades.xlsx')
In the code above data_files is a folder in My Drive.
grades.xlsx is an Excel file in data_files folder.
df_grades.head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 1 | jbbsx | 35439616 | 11 | F | 6 | 18 | 93 | 9 | 98 | 91 |
| 2 | mrvab | 35543247 | 11 | M | 78 | 92 | 60 | 43 | 34 | 26 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 |
| 4 | rlpsr | 53448034 | 10 | M | 3 | 38 | 45 | 43 | 79 | 69 |
Accessing values#
Select columns#
A single column or multiple columns can be accessed using square brackets.
If multiple columns are accessed, column labels can be provided as a list.
When accessing single columns, if the column label consists of letters and digits, a period (.) can also be used instead of square brackets.
# df_stock has 6 columns.
df_stock.head()
| Date | APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-02 | 74.33 | 86.05 | 1898.01 | 189.66 | 3257.85 |
| 1 | 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2 | 2020-01-06 | 74.20 | 90.31 | 1902.88 | 187.74 | 3246.28 |
| 3 | 2020-01-07 | 73.85 | 93.81 | 1906.86 | 187.24 | 3237.18 |
| 4 | 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
df_stick[‘APPLE’] returns only the APPLE column as a Series type.
df_stock['APPLE'].head()
0 74.33
1 73.61
2 74.20
3 73.85
4 75.04
Name: APPLE, dtype: float64
Using a period (.) instead of square brackets ([]).
df_stock.APPLE.head()
0 74.33
1 73.61
2 74.20
3 73.85
4 75.04
Name: APPLE, dtype: float64
When [‘APPLE’,’TESLA’] is provided as a list inside the square brackets, it returns a DataFrame with only two columns.
df_stock[['APPLE','TESLA']].head()
| APPLE | TESLA | |
|---|---|---|
| 0 | 74.33 | 86.05 |
| 1 | 73.61 | 88.60 |
| 2 | 74.20 | 90.31 |
| 3 | 73.85 | 93.81 |
| 4 | 75.04 | 98.43 |
Select rows#
The loc operator can be used to access single or multiple rows using row labels.
# row with label 'Tim'
df_sh.loc['Tim']
State TX
Height 140
Name: Tim, dtype: object
# rows with labels 'John' and 'Tim'
df_sh.loc[['John', 'Tim']]
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Tim | TX | 140 |
Slicing#
Slicing can be performed either by using the row and column indexes or labels.
For row and column indexes: Use iloc[row index, column index].
For row and column labels: Use loc[row label, column label].
Let’s recall the small DataFrame df_sh.
df_sh
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Liz | OK | 180 |
| Tim | TX | 140 |
Let’s try to access Tim’s height, which is 140, using both loc and iloc.
The row label for Tim’s height of 140 is Tim.
The column label for the height is Height.
df_sh.loc['Tim', 'Height']
140
Let’s try to access Tim’s height, which is 140, using iloc.
The row index for Tim is the index of Tim in the list [‘John’, ‘Liz’, ‘Tim’], which is 2.
The column index for the height is the index of Height in the list [‘State’, ‘Height’], which is 1.
df_sh.iloc[2,1]
140
You can also perform slicing similar to strings, tuples, and lists.
When using slicing with loc (labels), there’s a small difference:
loc is inclusive for the end label, meaning the end label is included.
For example, considering the column labels of the df_sh DataFrame are [‘State’, ‘Height’]:
df_sh.loc[‘Liz’:,:’Height’] includes the columns up to and including ‘Height’.
# rows starting from index 1 (includes 1) --> row indexes: 1, 2
# columns up to index 1 (does not include 1) --> column index: 0
df_sh.iloc[1:,:1]
| State | |
|---|---|
| Liz | OK |
| Tim | TX |
# rows starting from 'Liz' (includes 'Liz') --> row labels: 1, 2
# columns up to label 'Height' (includes 'Height') --> column labels: 'State', 'Height'
df_sh.loc['Liz':,:'Height']
| State | Height | |
|---|---|---|
| Liz | OK | 180 |
| Tim | TX | 140 |
Filtering#
Boolean expressions can be used to select specific parts of the DataFrame based on certain conditions.
For example, df[‘column_label’] > 90 returns a Series with a boolean value for each row, indicating whether the condition is True or False for that row.
# returns a boolean value for each row.
df_grades['Test-3']>90
0 True
1 False
2 False
3 True
4 False
...
95 False
96 False
97 False
98 False
99 False
Name: Test-3, Length: 100, dtype: bool
Two conditions can be combined with a boolean operator to filter the DataFrame based on multiple criteria.
(df_grades['Test-2']>65) & (df_grades['Test-3']>90)
0 True
1 False
2 False
3 False
4 False
...
95 False
96 False
97 False
98 False
99 False
Length: 100, dtype: bool
These boolean values can be used within square brackets to select a subset of the DataFrame.
Example: The following code selects the rows for which the grade for Test-3 is more than 90.
df_grades[ df_grades['Test-3']>90 ]
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 |
| 9 | iyhzx | 840574 | 11 | M | 91 | 89 | 37 | 93 | 96 | 28 |
| 24 | affft | 61590869 | 9 | M | 65 | 29 | 44 | 100 | 83 | 35 |
| 35 | gauus | 73384666 | 9 | M | 19 | 58 | 94 | 91 | 76 | 19 |
| 39 | micxs | 32507584 | 12 | M | 17 | 62 | 20 | 95 | 26 | 67 |
| 58 | rhjkj | 15745708 | 12 | F | 59 | 61 | 9 | 98 | 5 | 10 |
| 81 | mndro | 52386329 | 11 | M | 76 | 88 | 85 | 91 | 80 | 44 |
| 92 | lagws | 82526433 | 12 | F | 12 | 45 | 97 | 92 | 20 | 29 |
| 94 | jloze | 30408806 | 10 | M | 19 | 24 | 73 | 98 | 48 | 95 |
Example: The following code selects the rows for which the grade for Test-2 is more than 65 and the grade for Test-3 is more than 90.
df_grades[ (df_grades['Test-2']>65) & (df_grades['Test-3']>90) ]
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 35 | gauus | 73384666 | 9 | M | 19 | 58 | 94 | 91 | 76 | 19 |
| 81 | mndro | 52386329 | 11 | M | 76 | 88 | 85 | 91 | 80 | 44 |
| 92 | lagws | 82526433 | 12 | F | 12 | 45 | 97 | 92 | 20 | 29 |
| 94 | jloze | 30408806 | 10 | M | 19 | 24 | 73 | 98 | 48 | 95 |
Example: The following code selects the rows for which Gender is male and the grade for Final is more than 95.
df_grades[ (df_grades['Gender']=='M') & (df_grades['Final']>95) ]
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 29 | htboh | 16544350 | 10 | M | 22 | 93 | 35 | 42 | 12 | 99 |
| 36 | hbkmj | 92596786 | 11 | M | 55 | 14 | 74 | 49 | 75 | 98 |
| 56 | hnewl | 2809344 | 12 | M | 4 | 15 | 45 | 18 | 27 | 97 |
| 65 | yxchm | 48696307 | 9 | M | 33 | 54 | 5 | 39 | 94 | 99 |
| 95 | zxuaq | 58332078 | 12 | M | 57 | 1 | 48 | 61 | 58 | 99 |
Adding Columns and Rows#
To add a new column, use df['new_col_label'] = New column values.
The length of the New column values should be equal to the number of rows to fit the DataFrame.
To add a new row, use df.loc['new_row_label'] = New row values.
The length of the New row values should be equal to the number of columns to fit the DataFrame.
df_sh
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Liz | OK | 180 |
| Tim | TX | 140 |
To add a new column with values 150, 190, and 170 and label it ‘Weight’, you can use the following code:
df_sh['Weight'] = [150, 190, 170] # you can also use a tuple
df_sh
| State | Height | Weight | |
|---|---|---|---|
| John | NY | 160 | 150 |
| Liz | OK | 180 | 190 |
| Tim | TX | 140 | 170 |
To add a new row with values ‘Jack’, 190, 185 and label it ‘TX’, you can use the following code:
df_sh.loc['TX'] = ['Jack', 190, 185] # you can also use a tuple
df_sh
| State | Height | Weight | |
|---|---|---|---|
| John | NY | 160 | 150 |
| Liz | OK | 180 | 190 |
| Tim | TX | 140 | 170 |
| TX | Jack | 190 | 185 |
Vectorized Operations#
Vectorized operations can be performed using the columns of a DataFrame, meaning that a single algebraic operation, such as addition (+), can be executed for each row.
df_grades.head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 1 | jbbsx | 35439616 | 11 | F | 6 | 18 | 93 | 9 | 98 | 91 |
| 2 | mrvab | 35543247 | 11 | M | 78 | 92 | 60 | 43 | 34 | 26 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 |
| 4 | rlpsr | 53448034 | 10 | M | 3 | 38 | 45 | 43 | 79 | 69 |
# sum of Test-1 and Test-2 grades for each student
df_grades['Test-1'] + df_grades['Test-2']
0 160
1 111
2 152
3 18
4 83
...
95 49
96 78
97 87
98 116
99 88
Length: 100, dtype: int64
# difference of Test-1 and Test-2 grades for each student
df_grades['Test-1'] - df_grades['Test-2']
0 22
1 -75
2 32
3 -2
4 -7
..
95 -47
96 -18
97 -57
98 -84
99 60
Length: 100, dtype: int64
# twice the Test-1 grades
2*df_grades['Test-1']
0 182
1 36
2 184
3 16
4 76
...
95 2
96 60
97 30
98 32
99 148
Name: Test-1, Length: 100, dtype: int64
To calculate the weighted average using the following formula and add it as a new column:
Weighted Average \(= 0.1\times HW + 0.15\times Test1+ 0.15\times Test2+ 0.15\times Test3+ 0.15\times Test4+0.3\times Final\)
df_grades['Weighted Average'] = 0.1*df_grades['HW']+0.15*df_grades['Test-1']+0.15*df_grades['Test-2']+0.15*df_grades['Test-3']+0.15*df_grades['Test-4']+0.3*df_grades['Final']
df_grades.head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | Weighted Average | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 | 58.50 |
| 1 | jbbsx | 35439616 | 11 | F | 6 | 18 | 93 | 9 | 98 | 91 | 60.60 |
| 2 | mrvab | 35543247 | 11 | M | 78 | 92 | 60 | 43 | 34 | 26 | 49.95 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 | 61.65 |
| 4 | rlpsr | 53448034 | 10 | M | 3 | 38 | 45 | 43 | 79 | 69 | 51.75 |
Iterations and Dataframes#
DataFrames can be iterated through using their row and column labels or indexes. Here’s how you can access them:
Use df.index for row labels.
Use df.columns for column labels.
Use range(len(df)) or range(df.shape[0]) for row indexes.
Use range(df.shape[1]) for column indexes.
df_sh
| State | Height | Weight | |
|---|---|---|---|
| John | NY | 160 | 150 |
| Liz | OK | 180 | 190 |
| Tim | TX | 140 | 170 |
| TX | Jack | 190 | 185 |
# row labels
for row in df_sh.index:
print(row)
John
Liz
Tim
TX
# column labels
for col in df_sh.columns:
print(col)
State
Height
Weight
# access all values using the row and column labels
for row in df_sh.index:
for col in df_sh.columns:
print(df_sh.loc[row, col])
NY
160
150
OK
180
190
TX
140
170
Jack
190
185
for i in range(df_sh.shape[0]): # you can also use len(df_sh)
print(i)
0
1
2
3
# column indexes
for j in range(df_sh.shape[1]):
print(j)
0
1
2
# access all values using the row and column indexes
for i in range(df_sh.shape[0]):
for j in range(df_sh.shape[1]):
print(df_sh.iloc[i,j])
NY
160
150
OK
180
190
TX
140
170
Jack
190
185
Saving Dataframes#
Google Colab#
You need to mount to establish the connection between your Google Drive and Google Colab notebook using the following code.
from google.colab import drive
drive.mount('/content/drive')
You can save your DataFrame as an Excel or CSV file using the path of the folder where you want to save it.
df_sh.to_excel('/content/drive/My Drive/data_files/state_and_height.xlsx')
In the code above, data_files is a folder in My Drive.
The DataFrame df_sh is saved as an Excel file in the data_files folder, and the name of the Excel file is state_and_height.
Jupyter Notebook#
The following codes will save your DataFrame as an Excel or CSV file inside the folder where your Jupyter notebook is located. You can use the same codes with other editors, including Spider and PyCharm.
name of dataframe.to_excel(‘name of the excel file.xlsx’)
name of dataframe.to_csv(‘name of the csv file.csv’)
# save as an excel file
df_sh.to_excel('state_and_height.xlsx')
#save as a csv file
df_sh.to_csv('state_and_height.csv')
If you want to save your DataFrame in a different folder, you need to provide the path.
The following code saves the DataFrame as a CSV file in a folder called data_files:
df_sh.to_csv('data_files/state_and_height.csv')