DataFrames

DataFrames#

Section Title: DataFrames

A DataFrame is a two-dimensional data structure characterized by labeled axes for rows and columns.

Similar to a Microsoft Excel worksheet, it efficiently accommodates mixed data types.
Primarily used for data wrangling and manipulation.
DataFrames offer a wide range of methods for tasks such as data cleaning, exploration, and preprocessing.
They support vectorized operations across rows or columns and facilitate visualization.
As the main data structure of the pandas library, DataFrames require importing the library to create and manipulate them, and to utilize their methods.
It’s common practice to abbreviate pandas as pd for brevity in code import pandas as pd.

Create Dataframes#

import pandas as pd

Use Tuples or Lists#

Use pd.DataFrame() to create a dataframe.
When using pd.DataFrame(), you can provide column and row labels. If they are not provided, default values \(0, 1, 2, \ldots\) are used.

state_tuple = ('NY', 'OK', 'TX')
df = pd.DataFrame(state_tuple)
display(df)

	0
0	NY
1	OK
2	TX

state_list = ['NY', 'OK', 'TX']
df = pd.DataFrame(state_tuple)
display(df)

	0
0	NY
1	OK
2	TX

You can provide row and column labels using the index and columns parameters

state_list = ['NY', 'OK', 'TX']
df = pd.DataFrame(state_tuple, index=['John', 'Liz', 'Tim'], columns=['State'])
display(df)

	State
John	NY
Liz	OK
Tim	TX

Use a Dictionary#

The keys of the dictionary serve as column labels.

state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df = pd.DataFrame(state_height_dic)
display(df)

	State	Height
0	NY	160
1	OK	180
2	TX	140

You can change row labels.

state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim'])
df_sh

	State	Height
John	NY	160
Liz	OK	180
Tim	TX	140

Use an Array#

import numpy as np

state_array = np.array([['NY',160], ['OK', 180], ['TX', 140]])
state_array

array([['NY', '160'],
       ['OK', '180'],
       ['TX', '140']], dtype='<U21')

df = pd.DataFrame(state_array)
display(df)

	0	1
0	NY	160
1	OK	180
2	TX	140

Import from Websites#

If the URL of the website ends with .csv, import the data from the website as a CSV file.
- To import a CSV file, use pd.read_csv().
If the URL ends with .xlsx, import the data as an Excel file.
- To import an Excel file, use pd.read_excel().
To import a table from a website, you can use pd.read_html().
- This function returns a list containing all tables found on the website.
- You can use indexing to access the specific dataframe you want from the list.

The head() method displays the first 5 rows.
The tail() method displays the last 5 rows.

df_grades = pd.read_csv('https://raw.githubusercontent.com/datasmp/datasets/main/grades.csv')
df_grades.head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
1	jbbsx	35439616	11	F	6	18	93	9	98	91
2	mrvab	35543247	11	M	78	92	60	43	34	26
3	bjyve	61282135	9	M	60	8	10	99	80	87
4	rlpsr	53448034	10	M	3	38	45	43	79	69

df_stock = pd.read_excel('https://raw.githubusercontent.com/datasmp/datasets/main/stock.xlsx')
df_stock.head()

	Date	APPLE	TESLA	AMAZON	VISA	SP500
0	2020-01-02	74.33	86.05	1898.01	189.66	3257.85
1	2020-01-03	73.61	88.60	1874.97	188.15	3234.85
2	2020-01-06	74.20	90.31	1902.88	187.74	3246.28
3	2020-01-07	73.85	93.81	1906.86	187.24	3237.18
4	2020-01-08	75.04	98.43	1891.97	190.45	3253.05

# list of tables from a website imported as dataframes
df_list = pd.read_html('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States')

type(df_list)

list

# there are 26 tables in the website
len(df_list)

# second table
df_list[1].head()

	Flag, name and postal abbreviation[8]		Cities		Ratification or admission[A]	Population (2020)[10]	Total area[11]		Reps.
	Flag, name and postal abbreviation[8]	Flag, name and postal abbreviation[8].1	Capital	Largest[12]	Ratification or admission[A]	Population (2020)[10]	mi2	km2	Reps.
0	Alabama	AL	Montgomery	Huntsville	Dec 14, 1819	5024279	52420	135767	7
1	Alaska	AK	Juneau	Anchorage	Jan 3, 1959	733391	665384	1723337	1
2	Arizona	AZ	Phoenix	Phoenix	Feb 14, 1912	7151502	113990	295234	9
3	Arkansas	AR	Little Rock	Little Rock	Jun 15, 1836	3011524	53179	137732	4
4	California	CA	Sacramento	Los Angeles	Sep 9, 1850	39538223	163695	423967	52

Import from a Local Computer#

To import an Excel file, use pd.read_excel().
To import a CSV file, use pd.read_csv().
You need to provide the path of the Excel or CSV file that will be imported.
- In the following example, the grades CSV file is located in the data_files folder.

df_grades = pd.read_csv('data_files/grades.csv')
df_grades.head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
1	jbbsx	35439616	11	F	6	18	93	9	98	91
2	mrvab	35543247	11	M	78	92	60	43	34	26
3	bjyve	61282135	9	M	60	8	10	99	80	87
4	rlpsr	53448034	10	M	3	38	45	43	79	69

df_stock = pd.read_excel('data_files/stock.xlsx')
df_stock.head()

	Date	APPLE	TESLA	AMAZON	VISA	SP500
0	2020-01-02	74.33	86.05	1898.01	189.66	3257.85
1	2020-01-03	73.61	88.60	1874.97	188.15	3234.85
2	2020-01-06	74.20	90.31	1902.88	187.74	3246.28
3	2020-01-07	73.85	93.81	1906.86	187.24	3237.18
4	2020-01-08	75.04	98.43	1891.97	190.45	3253.05

Import from Google Drive#

You need to mount to establish the connection between your Google Drive and Google Colab notebook using the following code.

from google.colab import drive
drive.mount('/content/drive')

You can import the Excel or CSV file by providing the file’s path.

  df_grades = pd.read_excel('/content/drive/My Drive/data_files/grades.xlsx')

In the code above data_files is a folder in My Drive.
grades.xlsx is an Excel file in data_files folder.

df_grades.head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
1	jbbsx	35439616	11	F	6	18	93	9	98	91
2	mrvab	35543247	11	M	78	92	60	43	34	26
3	bjyve	61282135	9	M	60	8	10	99	80	87
4	rlpsr	53448034	10	M	3	38	45	43	79	69

Accessing values#

Select columns#

A single column or multiple columns can be accessed using square brackets.

If multiple columns are accessed, column labels can be provided as a list.
When accessing single columns, if the column label consists of letters and digits, a period (.) can also be used instead of square brackets.

# df_stock has 6 columns.
df_stock.head()

	Date	APPLE	TESLA	AMAZON	VISA	SP500
0	2020-01-02	74.33	86.05	1898.01	189.66	3257.85
1	2020-01-03	73.61	88.60	1874.97	188.15	3234.85
2	2020-01-06	74.20	90.31	1902.88	187.74	3246.28
3	2020-01-07	73.85	93.81	1906.86	187.24	3237.18
4	2020-01-08	75.04	98.43	1891.97	190.45	3253.05

df_stick[‘APPLE’] returns only the APPLE column as a Series type.

df_stock['APPLE'].head()

  74.33
  73.61
  74.20
  73.85
  75.04
Name: APPLE, dtype: float64

Using a period (.) instead of square brackets ([]).

df_stock.APPLE.head()

  74.33
  73.61
  74.20
  73.85
  75.04
Name: APPLE, dtype: float64

When [‘APPLE’,’TESLA’] is provided as a list inside the square brackets, it returns a DataFrame with only two columns.

df_stock[['APPLE','TESLA']].head()

	APPLE	TESLA
0	74.33	86.05
1	73.61	88.60
2	74.20	90.31
3	73.85	93.81
4	75.04	98.43

Select rows#

The loc operator can be used to access single or multiple rows using row labels.

# row with label 'Tim'
df_sh.loc['Tim']

State      TX
Height    140
Name: Tim, dtype: object

# rows with labels 'John' and 'Tim'
df_sh.loc[['John', 'Tim']]

	State	Height
John	NY	160
Tim	TX	140

Slicing#

Slicing can be performed either by using the row and column indexes or labels.

For row and column indexes: Use iloc[row index, column index].
For row and column labels: Use loc[row label, column label].

Let’s recall the small DataFrame df_sh.

df_sh

	State	Height
John	NY	160
Liz	OK	180
Tim	TX	140

Let’s try to access Tim’s height, which is 140, using both loc and iloc.

The row label for Tim’s height of 140 is Tim.
The column label for the height is Height.

df_sh.loc['Tim', 'Height']

Let’s try to access Tim’s height, which is 140, using iloc.

The row index for Tim is the index of Tim in the list [‘John’, ‘Liz’, ‘Tim’], which is 2.
The column index for the height is the index of Height in the list [‘State’, ‘Height’], which is 1.

df_sh.iloc[2,1]

You can also perform slicing similar to strings, tuples, and lists.

When using slicing with loc (labels), there’s a small difference:
- loc is inclusive for the end label, meaning the end label is included.
For example, considering the column labels of the df_sh DataFrame are [‘State’, ‘Height’]:
- df_sh.loc[‘Liz’:,:’Height’] includes the columns up to and including ‘Height’.

# rows starting from index 1 (includes 1)    --> row indexes: 1, 2
# columns up to index 1 (does not include 1) --> column index: 0

df_sh.iloc[1:,:1]

	State
Liz	OK
Tim	TX

# rows starting from  'Liz' (includes 'Liz')          --> row labels: 1, 2
# columns up to label 'Height' (includes 'Height')    --> column labels: 'State', 'Height'

df_sh.loc['Liz':,:'Height']

	State	Height
Liz	OK	180
Tim	TX	140

Filtering#

Boolean expressions can be used to select specific parts of the DataFrame based on certain conditions.

For example, df[‘column_label’] > 90 returns a Series with a boolean value for each row, indicating whether the condition is True or False for that row.

# returns a boolean value for each row.
df_grades['Test-3']>90

    True
   False
   False
    True
   False
      ...  
  False
  False
  False
  False
  False
Name: Test-3, Length: 100, dtype: bool

Two conditions can be combined with a boolean operator to filter the DataFrame based on multiple criteria.

(df_grades['Test-2']>65) & (df_grades['Test-3']>90)

    True
   False
   False
   False
   False
      ...  
  False
  False
  False
  False
  False
Length: 100, dtype: bool

These boolean values can be used within square brackets to select a subset of the DataFrame.
- Example: The following code selects the rows for which the grade for Test-3 is more than 90.

df_grades[ df_grades['Test-3']>90 ]

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
3	bjyve	61282135	9	M	60	8	10	99	80	87
9	iyhzx	840574	11	M	91	89	37	93	96	28
24	affft	61590869	9	M	65	29	44	100	83	35
35	gauus	73384666	9	M	19	58	94	91	76	19
39	micxs	32507584	12	M	17	62	20	95	26	67
58	rhjkj	15745708	12	F	59	61	9	98	5	10
81	mndro	52386329	11	M	76	88	85	91	80	44
92	lagws	82526433	12	F	12	45	97	92	20	29
94	jloze	30408806	10	M	19	24	73	98	48	95

Example: The following code selects the rows for which the grade for Test-2 is more than 65 and the grade for Test-3 is more than 90.

df_grades[ (df_grades['Test-2']>65) & (df_grades['Test-3']>90) ]

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
35	gauus	73384666	9	M	19	58	94	91	76	19
81	mndro	52386329	11	M	76	88	85	91	80	44
92	lagws	82526433	12	F	12	45	97	92	20	29
94	jloze	30408806	10	M	19	24	73	98	48	95

Example: The following code selects the rows for which Gender is male and the grade for Final is more than 95.

df_grades[ (df_grades['Gender']=='M') & (df_grades['Final']>95) ]

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
29	htboh	16544350	10	M	22	93	35	42	12	99
36	hbkmj	92596786	11	M	55	14	74	49	75	98
56	hnewl	2809344	12	M	4	15	45	18	27	97
65	yxchm	48696307	9	M	33	54	5	39	94	99
95	zxuaq	58332078	12	M	57	1	48	61	58	99

Adding Columns and Rows#

To add a new column, use df['new_col_label'] = New column values.

The length of the New column values should be equal to the number of rows to fit the DataFrame.

To add a new row, use df.loc['new_row_label'] = New row values.

The length of the New row values should be equal to the number of columns to fit the DataFrame.

df_sh

	State	Height
John	NY	160
Liz	OK	180
Tim	TX	140

To add a new column with values 150, 190, and 170 and label it ‘Weight’, you can use the following code:

df_sh['Weight'] = [150, 190, 170]  # you can also use a tuple

df_sh

	State	Height	Weight
John	NY	160	150
Liz	OK	180	190
Tim	TX	140	170

To add a new row with values ‘Jack’, 190, 185 and label it ‘TX’, you can use the following code:

df_sh.loc['TX'] = ['Jack', 190, 185]  # you can also use a tuple

df_sh

	State	Height	Weight
John	NY	160	150
Liz	OK	180	190
Tim	TX	140	170
TX	Jack	190	185

Vectorized Operations#

Vectorized operations can be performed using the columns of a DataFrame, meaning that a single algebraic operation, such as addition (+), can be executed for each row.

df_grades.head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
1	jbbsx	35439616	11	F	6	18	93	9	98	91
2	mrvab	35543247	11	M	78	92	60	43	34	26
3	bjyve	61282135	9	M	60	8	10	99	80	87
4	rlpsr	53448034	10	M	3	38	45	43	79	69

# sum of Test-1 and Test-2 grades for each student
df_grades['Test-1'] + df_grades['Test-2']

   160
   111
   152
    18
    83
     ... 
   49
   78
   87
  116
   88
Length: 100, dtype: int64

# difference of Test-1 and Test-2 grades for each student
df_grades['Test-1'] - df_grades['Test-2']

   22
  -75
   32
   -2
   -7
      ..
 -47
 -18
 -57
 -84
  60
Length: 100, dtype: int64

# twice the Test-1 grades
2*df_grades['Test-1']

   182
    36
   184
    16
    76
     ... 
    2
   60
   30
   32
  148
Name: Test-1, Length: 100, dtype: int64

To calculate the weighted average using the following formula and add it as a new column:

Weighted Average \(= 0.1\times HW + 0.15\times Test1+ 0.15\times Test2+ 0.15\times Test3+ 0.15\times Test4+0.3\times Final\)

df_grades['Weighted Average'] = 0.1*df_grades['HW']+0.15*df_grades['Test-1']+0.15*df_grades['Test-2']+0.15*df_grades['Test-3']+0.15*df_grades['Test-4']+0.3*df_grades['Final'] 

df_grades.head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final	Weighted Average
0	mqtvy	37047871	10	M	30	91	69	93	17	50	58.50
1	jbbsx	35439616	11	F	6	18	93	9	98	91	60.60
2	mrvab	35543247	11	M	78	92	60	43	34	26	49.95
3	bjyve	61282135	9	M	60	8	10	99	80	87	61.65
4	rlpsr	53448034	10	M	3	38	45	43	79	69	51.75

Iterations and Dataframes#

DataFrames can be iterated through using their row and column labels or indexes. Here’s how you can access them:

Use df.index for row labels.
Use df.columns for column labels.
Use range(len(df)) or range(df.shape[0]) for row indexes.
Use range(df.shape[1]) for column indexes.

df_sh

	State	Height	Weight
John	NY	160	150
Liz	OK	180	190
Tim	TX	140	170
TX	Jack	190	185

# row labels
for row in df_sh.index:
    print(row)

John
Liz
Tim
TX

# column labels
for col in df_sh.columns:
    print(col)

State
Height
Weight

# access all values using the row and column labels
for row in df_sh.index:
    for col in df_sh.columns:
        print(df_sh.loc[row, col])

NY
160
150
OK
180
190
TX
140
170
Jack
190
185

for i in range(df_sh.shape[0]):    # you can also use len(df_sh)
    print(i)

# column indexes
for j in range(df_sh.shape[1]):
    print(j)

0
1
2

# access all values using the row and column indexes
for i in range(df_sh.shape[0]):
    for j in range(df_sh.shape[1]):
        print(df_sh.iloc[i,j])

NY
160
150
OK
180
190
TX
140
170
Jack
190
185

Saving Dataframes#

Google Colab#

You need to mount to establish the connection between your Google Drive and Google Colab notebook using the following code.

from google.colab import drive
drive.mount('/content/drive')

You can save your DataFrame as an Excel or CSV file using the path of the folder where you want to save it.

  df_sh.to_excel('/content/drive/My Drive/data_files/state_and_height.xlsx')

In the code above, data_files is a folder in My Drive.
The DataFrame df_sh is saved as an Excel file in the data_files folder, and the name of the Excel file is state_and_height.

Jupyter Notebook#

The following codes will save your DataFrame as an Excel or CSV file inside the folder where your Jupyter notebook is located. You can use the same codes with other editors, including Spider and PyCharm.

name of dataframe.to_excel(‘name of the excel file.xlsx’)
name of dataframe.to_csv(‘name of the csv file.csv’)

# save as an excel file
df_sh.to_excel('state_and_height.xlsx')

#save as a csv file
df_sh.to_csv('state_and_height.csv')

If you want to save your DataFrame in a different folder, you need to provide the path.
The following code saves the DataFrame as a CSV file in a folder called data_files:

df_sh.to_csv('data_files/state_and_height.csv')