DataFrames#

Section Title: DataFrames

A DataFrame is a two-dimensional data structure characterized by labeled axes for rows and columns.

  • Similar to a Microsoft Excel worksheet, it efficiently accommodates mixed data types.

  • Primarily used for data wrangling and manipulation.

  • DataFrames offer a wide range of methods for tasks such as data cleaning, exploration, and preprocessing.

  • They support vectorized operations across rows or columns and facilitate visualization.

  • As the main data structure of the pandas library, DataFrames require importing the library to create and manipulate them, and to utilize their methods.

  • It’s common practice to abbreviate pandas as pd for brevity in code import pandas as pd.

Create Dataframes#

import pandas as pd

Use Tuples or Lists#

  • Use pd.DataFrame() to create a dataframe.

  • When using pd.DataFrame(), you can provide column and row labels. If they are not provided, default values \(0, 1, 2, \ldots\) are used.

state_tuple = ('NY', 'OK', 'TX')
df = pd.DataFrame(state_tuple)
display(df)
0
0 NY
1 OK
2 TX
state_list = ['NY', 'OK', 'TX']
df = pd.DataFrame(state_tuple)
display(df)
0
0 NY
1 OK
2 TX
  • You can provide row and column labels using the index and columns parameters

state_list = ['NY', 'OK', 'TX']
df = pd.DataFrame(state_tuple, index=['John', 'Liz', 'Tim'], columns=['State'])
display(df)
State
John NY
Liz OK
Tim TX

Use a Dictionary#

The keys of the dictionary serve as column labels.

state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df = pd.DataFrame(state_height_dic)
display(df)
State Height
0 NY 160
1 OK 180
2 TX 140
  • You can change row labels.

state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim'])
df_sh
State Height
John NY 160
Liz OK 180
Tim TX 140

Use an Array#

import numpy as np
state_array = np.array([['NY',160], ['OK', 180], ['TX', 140]])
state_array
array([['NY', '160'],
       ['OK', '180'],
       ['TX', '140']], dtype='<U21')
df = pd.DataFrame(state_array)
display(df)
0 1
0 NY 160
1 OK 180
2 TX 140

Import from Websites#

  1. If the URL of the website ends with .csv, import the data from the website as a CSV file.

    • To import a CSV file, use pd.read_csv().

  2. If the URL ends with .xlsx, import the data as an Excel file.

    • To import an Excel file, use pd.read_excel().

  3. To import a table from a website, you can use pd.read_html().

    • This function returns a list containing all tables found on the website.

    • You can use indexing to access the specific dataframe you want from the list.

  • The head() method displays the first 5 rows.

  • The tail() method displays the last 5 rows.

df_grades = pd.read_csv('https://raw.githubusercontent.com/datasmp/datasets/main/grades.csv')
df_grades.head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
1 jbbsx 35439616 11 F 6 18 93 9 98 91
2 mrvab 35543247 11 M 78 92 60 43 34 26
3 bjyve 61282135 9 M 60 8 10 99 80 87
4 rlpsr 53448034 10 M 3 38 45 43 79 69
df_stock = pd.read_excel('https://raw.githubusercontent.com/datasmp/datasets/main/stock.xlsx')
df_stock.head()
Date APPLE TESLA AMAZON VISA SP500
0 2020-01-02 74.33 86.05 1898.01 189.66 3257.85
1 2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2 2020-01-06 74.20 90.31 1902.88 187.74 3246.28
3 2020-01-07 73.85 93.81 1906.86 187.24 3237.18
4 2020-01-08 75.04 98.43 1891.97 190.45 3253.05
# list of tables from a website imported as dataframes
df_list = pd.read_html('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States')
type(df_list)
list
# there are 26 tables in the website
len(df_list)
21
# second table
df_list[1].head()
Flag, name and postal abbreviation[8] Cities Ratification or admission[A] Population (2020)[10] Total area[11] Reps.
Flag, name and postal abbreviation[8] Flag, name and postal abbreviation[8].1 Capital Largest[12] Ratification or admission[A] Population (2020)[10] mi2 km2 Reps.
0 Alabama AL Montgomery Huntsville Dec 14, 1819 5024279 52420 135767 7
1 Alaska AK Juneau Anchorage Jan 3, 1959 733391 665384 1723337 1
2 Arizona AZ Phoenix Phoenix Feb 14, 1912 7151502 113990 295234 9
3 Arkansas AR Little Rock Little Rock Jun 15, 1836 3011524 53179 137732 4
4 California CA Sacramento Los Angeles Sep 9, 1850 39538223 163695 423967 52

Import from a Local Computer#

  • To import an Excel file, use pd.read_excel().

  • To import a CSV file, use pd.read_csv().

  • You need to provide the path of the Excel or CSV file that will be imported.

    • In the following example, the grades CSV file is located in the data_files folder.

df_grades = pd.read_csv('data_files/grades.csv')
df_grades.head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
1 jbbsx 35439616 11 F 6 18 93 9 98 91
2 mrvab 35543247 11 M 78 92 60 43 34 26
3 bjyve 61282135 9 M 60 8 10 99 80 87
4 rlpsr 53448034 10 M 3 38 45 43 79 69
df_stock = pd.read_excel('data_files/stock.xlsx')
df_stock.head()
Date APPLE TESLA AMAZON VISA SP500
0 2020-01-02 74.33 86.05 1898.01 189.66 3257.85
1 2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2 2020-01-06 74.20 90.31 1902.88 187.74 3246.28
3 2020-01-07 73.85 93.81 1906.86 187.24 3237.18
4 2020-01-08 75.04 98.43 1891.97 190.45 3253.05

Import from Google Drive#

  1. You need to mount to establish the connection between your Google Drive and Google Colab notebook using the following code.

from google.colab import drive
drive.mount('/content/drive')
  1. You can import the Excel or CSV file by providing the file’s path.

  df_grades = pd.read_excel('/content/drive/My Drive/data_files/grades.xlsx')
  • In the code above data_files is a folder in My Drive.

  • grades.xlsx is an Excel file in data_files folder.

df_grades.head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
1 jbbsx 35439616 11 F 6 18 93 9 98 91
2 mrvab 35543247 11 M 78 92 60 43 34 26
3 bjyve 61282135 9 M 60 8 10 99 80 87
4 rlpsr 53448034 10 M 3 38 45 43 79 69

Accessing values#

Select columns#

A single column or multiple columns can be accessed using square brackets.

  • If multiple columns are accessed, column labels can be provided as a list.

  • When accessing single columns, if the column label consists of letters and digits, a period (.) can also be used instead of square brackets.

# df_stock has 6 columns.
df_stock.head()
Date APPLE TESLA AMAZON VISA SP500
0 2020-01-02 74.33 86.05 1898.01 189.66 3257.85
1 2020-01-03 73.61 88.60 1874.97 188.15 3234.85
2 2020-01-06 74.20 90.31 1902.88 187.74 3246.28
3 2020-01-07 73.85 93.81 1906.86 187.24 3237.18
4 2020-01-08 75.04 98.43 1891.97 190.45 3253.05
  • df_stick[‘APPLE’] returns only the APPLE column as a Series type.

df_stock['APPLE'].head()
0    74.33
1    73.61
2    74.20
3    73.85
4    75.04
Name: APPLE, dtype: float64
  • Using a period (.) instead of square brackets ([]).

df_stock.APPLE.head()
0    74.33
1    73.61
2    74.20
3    73.85
4    75.04
Name: APPLE, dtype: float64
  • When [‘APPLE’,’TESLA’] is provided as a list inside the square brackets, it returns a DataFrame with only two columns.

df_stock[['APPLE','TESLA']].head()
APPLE TESLA
0 74.33 86.05
1 73.61 88.60
2 74.20 90.31
3 73.85 93.81
4 75.04 98.43

Select rows#

The loc operator can be used to access single or multiple rows using row labels.

# row with label 'Tim'
df_sh.loc['Tim']
State      TX
Height    140
Name: Tim, dtype: object
# rows with labels 'John' and 'Tim'
df_sh.loc[['John', 'Tim']]
State Height
John NY 160
Tim TX 140

Slicing#

Slicing can be performed either by using the row and column indexes or labels.

  • For row and column indexes: Use iloc[row index, column index].

  • For row and column labels: Use loc[row label, column label].

Let’s recall the small DataFrame df_sh.

df_sh
State Height
John NY 160
Liz OK 180
Tim TX 140

Let’s try to access Tim’s height, which is 140, using both loc and iloc.

  • The row label for Tim’s height of 140 is Tim.

  • The column label for the height is Height.

df_sh.loc['Tim', 'Height']
140

Let’s try to access Tim’s height, which is 140, using iloc.

  • The row index for Tim is the index of Tim in the list [‘John’, ‘Liz’, ‘Tim’], which is 2.

  • The column index for the height is the index of Height in the list [‘State’, ‘Height’], which is 1.

df_sh.iloc[2,1]
140

You can also perform slicing similar to strings, tuples, and lists.

  • When using slicing with loc (labels), there’s a small difference:

    • loc is inclusive for the end label, meaning the end label is included.

  • For example, considering the column labels of the df_sh DataFrame are [‘State’, ‘Height’]:

    • df_sh.loc[‘Liz’:,:’Height’] includes the columns up to and including ‘Height’.

# rows starting from index 1 (includes 1)    --> row indexes: 1, 2
# columns up to index 1 (does not include 1) --> column index: 0

df_sh.iloc[1:,:1]
State
Liz OK
Tim TX
# rows starting from  'Liz' (includes 'Liz')          --> row labels: 1, 2
# columns up to label 'Height' (includes 'Height')    --> column labels: 'State', 'Height'

df_sh.loc['Liz':,:'Height']
State Height
Liz OK 180
Tim TX 140

Filtering#

Boolean expressions can be used to select specific parts of the DataFrame based on certain conditions.

  • For example, df[‘column_label’] > 90 returns a Series with a boolean value for each row, indicating whether the condition is True or False for that row.

# returns a boolean value for each row.
df_grades['Test-3']>90
0      True
1     False
2     False
3      True
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: Test-3, Length: 100, dtype: bool
  • Two conditions can be combined with a boolean operator to filter the DataFrame based on multiple criteria.

(df_grades['Test-2']>65) & (df_grades['Test-3']>90)
0      True
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Length: 100, dtype: bool
  • These boolean values can be used within square brackets to select a subset of the DataFrame.

    • Example: The following code selects the rows for which the grade for Test-3 is more than 90.

df_grades[ df_grades['Test-3']>90 ]
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
3 bjyve 61282135 9 M 60 8 10 99 80 87
9 iyhzx 840574 11 M 91 89 37 93 96 28
24 affft 61590869 9 M 65 29 44 100 83 35
35 gauus 73384666 9 M 19 58 94 91 76 19
39 micxs 32507584 12 M 17 62 20 95 26 67
58 rhjkj 15745708 12 F 59 61 9 98 5 10
81 mndro 52386329 11 M 76 88 85 91 80 44
92 lagws 82526433 12 F 12 45 97 92 20 29
94 jloze 30408806 10 M 19 24 73 98 48 95

Example: The following code selects the rows for which the grade for Test-2 is more than 65 and the grade for Test-3 is more than 90.

df_grades[ (df_grades['Test-2']>65) & (df_grades['Test-3']>90) ]
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
35 gauus 73384666 9 M 19 58 94 91 76 19
81 mndro 52386329 11 M 76 88 85 91 80 44
92 lagws 82526433 12 F 12 45 97 92 20 29
94 jloze 30408806 10 M 19 24 73 98 48 95

Example: The following code selects the rows for which Gender is male and the grade for Final is more than 95.

df_grades[ (df_grades['Gender']=='M') & (df_grades['Final']>95) ]
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
29 htboh 16544350 10 M 22 93 35 42 12 99
36 hbkmj 92596786 11 M 55 14 74 49 75 98
56 hnewl 2809344 12 M 4 15 45 18 27 97
65 yxchm 48696307 9 M 33 54 5 39 94 99
95 zxuaq 58332078 12 M 57 1 48 61 58 99

Adding Columns and Rows#

To add a new column, use df['new_col_label'] = New column values.

  • The length of the New column values should be equal to the number of rows to fit the DataFrame.

To add a new row, use df.loc['new_row_label'] = New row values.

  • The length of the New row values should be equal to the number of columns to fit the DataFrame.

df_sh
State Height
John NY 160
Liz OK 180
Tim TX 140
  • To add a new column with values 150, 190, and 170 and label it ‘Weight’, you can use the following code:

df_sh['Weight'] = [150, 190, 170]  # you can also use a tuple
df_sh
State Height Weight
John NY 160 150
Liz OK 180 190
Tim TX 140 170
  • To add a new row with values ‘Jack’, 190, 185 and label it ‘TX’, you can use the following code:

df_sh.loc['TX'] = ['Jack', 190, 185]  # you can also use a tuple
df_sh
State Height Weight
John NY 160 150
Liz OK 180 190
Tim TX 140 170
TX Jack 190 185

Vectorized Operations#

Vectorized operations can be performed using the columns of a DataFrame, meaning that a single algebraic operation, such as addition (+), can be executed for each row.

df_grades.head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final
0 mqtvy 37047871 10 M 30 91 69 93 17 50
1 jbbsx 35439616 11 F 6 18 93 9 98 91
2 mrvab 35543247 11 M 78 92 60 43 34 26
3 bjyve 61282135 9 M 60 8 10 99 80 87
4 rlpsr 53448034 10 M 3 38 45 43 79 69
# sum of Test-1 and Test-2 grades for each student
df_grades['Test-1'] + df_grades['Test-2']
0     160
1     111
2     152
3      18
4      83
     ... 
95     49
96     78
97     87
98    116
99     88
Length: 100, dtype: int64
# difference of Test-1 and Test-2 grades for each student
df_grades['Test-1'] - df_grades['Test-2']
0     22
1    -75
2     32
3     -2
4     -7
      ..
95   -47
96   -18
97   -57
98   -84
99    60
Length: 100, dtype: int64
# twice the Test-1 grades
2*df_grades['Test-1']
0     182
1      36
2     184
3      16
4      76
     ... 
95      2
96     60
97     30
98     32
99    148
Name: Test-1, Length: 100, dtype: int64

To calculate the weighted average using the following formula and add it as a new column:

  • Weighted Average \(= 0.1\times HW + 0.15\times Test1+ 0.15\times Test2+ 0.15\times Test3+ 0.15\times Test4+0.3\times Final\)

df_grades['Weighted Average'] = 0.1*df_grades['HW']+0.15*df_grades['Test-1']+0.15*df_grades['Test-2']+0.15*df_grades['Test-3']+0.15*df_grades['Test-4']+0.3*df_grades['Final'] 
df_grades.head()
Name ID Grade Gender HW Test-1 Test-2 Test-3 Test-4 Final Weighted Average
0 mqtvy 37047871 10 M 30 91 69 93 17 50 58.50
1 jbbsx 35439616 11 F 6 18 93 9 98 91 60.60
2 mrvab 35543247 11 M 78 92 60 43 34 26 49.95
3 bjyve 61282135 9 M 60 8 10 99 80 87 61.65
4 rlpsr 53448034 10 M 3 38 45 43 79 69 51.75

Iterations and Dataframes#

DataFrames can be iterated through using their row and column labels or indexes. Here’s how you can access them:

  • Use df.index for row labels.

  • Use df.columns for column labels.

  • Use range(len(df)) or range(df.shape[0]) for row indexes.

  • Use range(df.shape[1]) for column indexes.

df_sh
State Height Weight
John NY 160 150
Liz OK 180 190
Tim TX 140 170
TX Jack 190 185
# row labels
for row in df_sh.index:
    print(row)
John
Liz
Tim
TX
# column labels
for col in df_sh.columns:
    print(col)
State
Height
Weight
# access all values using the row and column labels
for row in df_sh.index:
    for col in df_sh.columns:
        print(df_sh.loc[row, col])
NY
160
150
OK
180
190
TX
140
170
Jack
190
185
for i in range(df_sh.shape[0]):    # you can also use len(df_sh)
    print(i)
0
1
2
3
# column indexes
for j in range(df_sh.shape[1]):
    print(j)
0
1
2
# access all values using the row and column indexes
for i in range(df_sh.shape[0]):
    for j in range(df_sh.shape[1]):
        print(df_sh.iloc[i,j])
NY
160
150
OK
180
190
TX
140
170
Jack
190
185

Saving Dataframes#

Google Colab#

  1. You need to mount to establish the connection between your Google Drive and Google Colab notebook using the following code.

from google.colab import drive
drive.mount('/content/drive')
  1. You can save your DataFrame as an Excel or CSV file using the path of the folder where you want to save it.

  df_sh.to_excel('/content/drive/My Drive/data_files/state_and_height.xlsx')
  • In the code above, data_files is a folder in My Drive.

  • The DataFrame df_sh is saved as an Excel file in the data_files folder, and the name of the Excel file is state_and_height.

Jupyter Notebook#

The following codes will save your DataFrame as an Excel or CSV file inside the folder where your Jupyter notebook is located. You can use the same codes with other editors, including Spider and PyCharm.

  • name of dataframe.to_excel(‘name of the excel file.xlsx’)

  • name of dataframe.to_csv(‘name of the csv file.csv’)

# save as an excel file
df_sh.to_excel('state_and_height.xlsx')
#save as a csv file
df_sh.to_csv('state_and_height.csv')
  • If you want to save your DataFrame in a different folder, you need to provide the path.

  • The following code saves the DataFrame as a CSV file in a folder called data_files:

df_sh.to_csv('data_files/state_and_height.csv')