DataFrames Attributes and Methods#

import pandas as pd
In this section, we will cover basic attributes and methods of dataframes.
Throughout this section, we’ll be using the following three dataframes.
state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim'])
df_sh
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Liz | OK | 180 |
| Tim | TX | 140 |
df_grades = pd.read_csv('https://raw.githubusercontent.com/datasmp/datasets/main/grades.csv')
df_grades.head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 1 | jbbsx | 35439616 | 11 | F | 6 | 18 | 93 | 9 | 98 | 91 |
| 2 | mrvab | 35543247 | 11 | M | 78 | 92 | 60 | 43 | 34 | 26 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 |
| 4 | rlpsr | 53448034 | 10 | M | 3 | 38 | 45 | 43 | 79 | 69 |
df_stock = pd.read_excel('https://raw.githubusercontent.com/datasmp/datasets/main/stock.xlsx')
df_stock.head()
| Date | APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-02 | 74.33 | 86.05 | 1898.01 | 189.66 | 3257.85 |
| 1 | 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2 | 2020-01-06 | 74.20 | 90.31 | 1902.88 | 187.74 | 3246.28 |
| 3 | 2020-01-07 | 73.85 | 93.81 | 1906.86 | 187.24 | 3237.18 |
| 4 | 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
Attributes#
Attributes provide information about the dataframe.
shape and len()#
shapereturns the tuple (number of rows, number of columns).len()returns the number of rows.
# 3 rows and 2 columns
df_sh.shape
(3, 2)
# 3 rows
len(df_sh)
3
# 100 rows and 10 columns
df_grades.shape
(100, 10)
# 100 rows
len(df_grades)
100
index and columns#
index returns row labels.
columns returns column labels.
# row labels
df_sh.index
Index(['John', 'Liz', 'Tim'], dtype='object')
# column labels
df_sh.columns
Index(['State', 'Height'], dtype='object')
values#
The values attribute removes labels and returns an array.
df_sh.values
array([['NY', 160],
['OK', 180],
['TX', 140]], dtype=object)
Methods#
head() and tail()#
head(n) displays the first n rows.
tail(n) displays the last n rows.
The default value of \(n\) is 5.
# first 5 rows
df_grades.head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 1 | jbbsx | 35439616 | 11 | F | 6 | 18 | 93 | 9 | 98 | 91 |
| 2 | mrvab | 35543247 | 11 | M | 78 | 92 | 60 | 43 | 34 | 26 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 |
| 4 | rlpsr | 53448034 | 10 | M | 3 | 38 | 45 | 43 | 79 | 69 |
# first 7 rows
df_grades.head(7)
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | mqtvy | 37047871 | 10 | M | 30 | 91 | 69 | 93 | 17 | 50 |
| 1 | jbbsx | 35439616 | 11 | F | 6 | 18 | 93 | 9 | 98 | 91 |
| 2 | mrvab | 35543247 | 11 | M | 78 | 92 | 60 | 43 | 34 | 26 |
| 3 | bjyve | 61282135 | 9 | M | 60 | 8 | 10 | 99 | 80 | 87 |
| 4 | rlpsr | 53448034 | 10 | M | 3 | 38 | 45 | 43 | 79 | 69 |
| 5 | hotxq | 96808584 | 9 | F | 29 | 18 | 61 | 17 | 72 | 39 |
| 6 | ylskw | 37901138 | 9 | M | 53 | 85 | 61 | 87 | 9 | 94 |
# last 5 rows
df_grades.tail()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 95 | zxuaq | 58332078 | 12 | M | 57 | 1 | 48 | 61 | 58 | 99 |
| 96 | vegye | 10019910 | 11 | F | 88 | 30 | 48 | 85 | 48 | 17 |
| 97 | euwlj | 27967150 | 12 | F | 70 | 15 | 72 | 77 | 20 | 65 |
| 98 | oesms | 36171468 | 10 | M | 88 | 16 | 100 | 62 | 76 | 19 |
| 99 | buugr | 88783139 | 10 | M | 23 | 74 | 14 | 17 | 40 | 59 |
# last 7 rows
df_grades.tail(7)
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 93 | yzljj | 10884705 | 10 | M | 92 | 75 | 41 | 33 | 61 | 66 |
| 94 | jloze | 30408806 | 10 | M | 19 | 24 | 73 | 98 | 48 | 95 |
| 95 | zxuaq | 58332078 | 12 | M | 57 | 1 | 48 | 61 | 58 | 99 |
| 96 | vegye | 10019910 | 11 | F | 88 | 30 | 48 | 85 | 48 | 17 |
| 97 | euwlj | 27967150 | 12 | F | 70 | 15 | 72 | 77 | 20 | 65 |
| 98 | oesms | 36171468 | 10 | M | 88 | 16 | 100 | 62 | 76 | 19 |
| 99 | buugr | 88783139 | 10 | M | 23 | 74 | 14 | 17 | 40 | 59 |
info()#
The info() method provides basic information about the DataFrame, including:
row labels: RangeIndex: 100 entries, 0 to 99
column labels: ‘Name’, ‘ID’, ‘Grade’, ‘Gender’, ‘HW’, ‘Test-1’, ‘Test-2’, ‘Test-3’, ‘Test-4’, ‘Final’
number of columns: 10
Non-null (non-missing) values: 100 for each column
dtype: object (strings), int64
memory usage: 7.9+ KB
df_grades.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 100 non-null object
1 ID 100 non-null int64
2 Grade 100 non-null int64
3 Gender 100 non-null object
4 HW 100 non-null int64
5 Test-1 100 non-null int64
6 Test-2 100 non-null int64
7 Test-3 100 non-null int64
8 Test-4 100 non-null int64
9 Final 100 non-null int64
dtypes: int64(8), object(2)
memory usage: 7.9+ KB
df_grades.columns
Index(['Name', 'ID', 'Grade', 'Gender', 'HW', 'Test-1', 'Test-2', 'Test-3',
'Test-4', 'Final'],
dtype='object')
describe()#
It returns the descriptive statistics of the columns of the DataFrame, including:
Count
Mean
Standard deviation (std)
Minimum (min)
25th percentile
50th percentile (median)
75th percentile
Maximum (max)
df_stock.describe()
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| count | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 |
| mean | 94.753056 | 288.347579 | 2678.569087 | 192.986984 | 3215.720079 |
| std | 21.836181 | 167.278855 | 545.632095 | 15.541861 | 318.045077 |
| min | 55.660000 | 72.240000 | 1676.610000 | 134.900000 | 2237.400000 |
| 25% | 76.622500 | 149.745000 | 2146.120000 | 187.957500 | 3034.530000 |
| 50% | 90.960000 | 232.830000 | 2884.500000 | 196.205000 | 3275.360000 |
| 75% | 115.350000 | 421.035000 | 3164.790000 | 203.785000 | 3403.050000 |
| max | 136.490000 | 695.000000 | 3531.450000 | 218.020000 | 3735.360000 |
value_counts()#
It returns the number of occurrences for each value of the column.
# There are 54 occurrences of 'M' and 46 occurrences of 'F' in the column Gender.
df_grades['Gender'].value_counts()
Gender
M 54
F 46
Name: count, dtype: int64
# There are 27 occurrences of 'nine', 26 occurrences of 'eleven', 24 occurrences of 'twelve',
# and 23 occurrences of 'ten' in the Grade column.
df_grades['Grade'].value_counts()
Grade
9 27
11 26
12 24
10 23
Name: count, dtype: int64
copy()#
The copy() method returns a copy of the DataFrame.
df_sh_copy = df_sh.copy()
df_sh_copy
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Liz | OK | 180 |
| Tim | TX | 140 |
drop()#
The drop() method is used to remove column(s) and row(s) from the DataFrame.
axis=0 means that the given row(s) are dropped (default).
axis=1 means that the given column(s) are dropped.
# row with label 'Liz' is removed
df_sh.drop('Liz', axis=0)
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Tim | TX | 140 |
# column with label 'Height' is removed
df_sh.drop('Height', axis=1)
| State | |
|---|---|
| John | NY |
| Liz | OK |
| Tim | TX |
By default, the drop() method does not change the DataFrame.
It just displays how the DataFrame would look if the dropping were performed.
# no change on df_sh
df_sh
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Liz | OK | 180 |
| Tim | TX | 140 |
To make a change to the DataFrame, the parameter
inplaceshould be set to True, which is False by default.
# no 'State' column any more
df_sh.drop('Height', axis=1, inplace=True)
df_sh
| State | |
|---|---|
| John | NY |
| Liz | OK |
| Tim | TX |
# no 'Tim' row any more
df_sh.drop('Tim', inplace=True) #default value of axis=0
df_sh
| State | |
|---|---|
| John | NY |
| Liz | OK |
reset_index()#
The reset_index() method is used to reset the row labels of a DataFrame to default values: \(0, 1, 2, ...\)
It also adds the initial row labels as a new column with label index.
df_sh.reset_index()
| index | State | |
|---|---|---|
| 0 | John | NY |
| 1 | Liz | OK |
If you do not want to keep the initial index values, you can set the drop parameter to True.
# no index column
df_sh.reset_index(drop=True)
| State | |
|---|---|
| 0 | NY |
| 1 | OK |
reset_index() does not change the DataFrame unless inplace=True.
# no change on row labels
df_sh
| State | |
|---|---|
| John | NY |
| Liz | OK |
# row labels are changed
df_sh.reset_index(inplace=True)
df_sh
| index | State | |
|---|---|---|
| 0 | John | NY |
| 1 | Liz | OK |
set_index()#
The set_index() method is used to set the DataFrame row labels to an existing column.
The label of the column that becomes the new index is preserved for future use.
# Use 'State' column values as new row labels
df_sh.set_index('State')
| index | |
|---|---|
| State | |
| NY | John |
| OK | Liz |
set_index() does not change the DataFrame unless inplace=True
# no change on row labels
df_sh
| index | State | |
|---|---|---|
| 0 | John | NY |
| 1 | Liz | OK |
# row labels are changed
df_sh.set_index('State', inplace=True)
df_sh
| index | |
|---|---|
| State | |
| NY | John |
| OK | Liz |
Let’s update the index of df_stock using the Date column.
df_stock.set_index('Date', inplace=True)
df_stock.head()
| APPLE | TESLA | AMAZON | VISA | SP500 | |
|---|---|---|---|---|---|
| Date | |||||
| 2020-01-02 | 74.33 | 86.05 | 1898.01 | 189.66 | 3257.85 |
| 2020-01-03 | 73.61 | 88.60 | 1874.97 | 188.15 | 3234.85 |
| 2020-01-06 | 74.20 | 90.31 | 1902.88 | 187.74 | 3246.28 |
| 2020-01-07 | 73.85 | 93.81 | 1906.86 | 187.24 | 3237.18 |
| 2020-01-08 | 75.04 | 98.43 | 1891.97 | 190.45 | 3253.05 |
sum(), mean(), median(), std()#
When axis=0, computation is column-wise (vertically). This means that for each column, the operation is performed on different row values.
This is the default value.
When axis=1, computation is row-wise (horizontally). This means that for each row, the operation is performed on different column values.
# axis=0
df_stock.sum(0) # sum of values in each column
APPLE 23877.77
TESLA 72663.59
AMAZON 674999.41
VISA 48632.72
SP500 810361.46
dtype: float64
# axis=1
df_stock.sum(1) # sum of values in each row
Date
2020-01-02 5505.90
2020-01-03 5460.18
2020-01-06 5501.41
2020-01-07 5498.94
2020-01-08 5508.94
...
2020-12-23 7857.00
2020-12-24 7877.67
2020-12-28 8031.80
2020-12-29 8063.74
2020-12-30 8064.21
Length: 252, dtype: float64
# mean of each column
df_stock.mean()
APPLE 94.753056
TESLA 288.347579
AMAZON 2678.569087
VISA 192.986984
SP500 3215.720079
dtype: float64
# median of each column
df_stock.median()
APPLE 90.960
TESLA 232.830
AMAZON 2884.500
VISA 196.205
SP500 3275.360
dtype: float64
# standard deviation of each column
df_stock.std()
APPLE 21.836181
TESLA 167.278855
AMAZON 545.632095
VISA 15.541861
SP500 318.045077
dtype: float64
rename()#
The rename() method is used to rename column or row labels.
df_sh
| index | |
|---|---|
| State | |
| NY | John |
| OK | Liz |
Let’s rename the index column as “Name”.
df_sh.rename(columns={'index':'Name'})
| Name | |
|---|---|
| State | |
| NY | John |
| OK | Liz |
# no change on df_sh
df_sh
| index | |
|---|---|
| State | |
| NY | John |
| OK | Liz |
rename() does not change the DataFrame unless inplace=True.
# 'index' --> 'Name'
df_sh.rename(columns={'index':'Name'}, inplace=True)
df_sh
| Name | |
|---|---|
| State | |
| NY | John |
| OK | Liz |
sort_values()#
It sorts the dataframe based on the provided column in ascending or descending order.
By default, sorting is done in ascending order.
To sort in descending order, set the
ascendingparameter to False.
# ascending
df_grades.sort_values('Test-2').head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 12 | jjgag | 46794501 | 11 | F | 8 | 43 | 1 | 7 | 82 | 36 |
| 19 | cxkpb | 71808293 | 10 | M | 87 | 80 | 4 | 14 | 77 | 87 |
| 62 | mwsqn | 85824590 | 11 | M | 89 | 95 | 4 | 34 | 94 | 21 |
| 65 | yxchm | 48696307 | 9 | M | 33 | 54 | 5 | 39 | 94 | 99 |
| 61 | rulsy | 40694890 | 12 | M | 4 | 8 | 6 | 81 | 62 | 46 |
# descending
df_grades.sort_values('Test-2', ascending=False).head()
| Name | ID | Grade | Gender | HW | Test-1 | Test-2 | Test-3 | Test-4 | Final | |
|---|---|---|---|---|---|---|---|---|---|---|
| 16 | abqln | 59927557 | 11 | F | 26 | 97 | 100 | 46 | 94 | 78 |
| 98 | oesms | 36171468 | 10 | M | 88 | 16 | 100 | 62 | 76 | 19 |
| 74 | xohww | 81223224 | 9 | M | 32 | 24 | 100 | 81 | 59 | 73 |
| 90 | wxtpu | 98301627 | 12 | M | 44 | 57 | 99 | 87 | 9 | 62 |
| 25 | tkhtm | 56044667 | 12 | F | 20 | 67 | 97 | 13 | 73 | 41 |
duplicated()#
It returns a Series with boolean values showing if a row is a repeat or not.
state_height_dic = {'State':['NY', 'OK', 'TX', 'TX'], 'Height':[160, 180, 140, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim', 'Tim'])
df_sh
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Liz | OK | 180 |
| Tim | TX | 140 |
| Tim | TX | 140 |
df_sh.duplicated()
John False
Liz False
Tim False
Tim True
dtype: bool
drop_duplicates()#
It returns a DataFrame with duplicate rows removed.
df_sh.drop_duplicates()
| State | Height | |
|---|---|---|
| John | NY | 160 |
| Liz | OK | 180 |
| Tim | TX | 140 |