DataFrames Attributes and Methods

DataFrames Attributes and Methods#

Section Title: DataFrames Attributes and Methods

import pandas as pd

In this section, we will cover basic attributes and methods of dataframes.

Throughout this section, we’ll be using the following three dataframes.

state_height_dic = {'State':['NY', 'OK', 'TX'], 'Height':[160, 180, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim'])
df_sh

	State	Height
John	NY	160
Liz	OK	180
Tim	TX	140

df_grades = pd.read_csv('https://raw.githubusercontent.com/datasmp/datasets/main/grades.csv')
df_grades.head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
1	jbbsx	35439616	11	F	6	18	93	9	98	91
2	mrvab	35543247	11	M	78	92	60	43	34	26
3	bjyve	61282135	9	M	60	8	10	99	80	87
4	rlpsr	53448034	10	M	3	38	45	43	79	69

df_stock = pd.read_excel('https://raw.githubusercontent.com/datasmp/datasets/main/stock.xlsx')
df_stock.head()

	Date	APPLE	TESLA	AMAZON	VISA	SP500
0	2020-01-02	74.33	86.05	1898.01	189.66	3257.85
1	2020-01-03	73.61	88.60	1874.97	188.15	3234.85
2	2020-01-06	74.20	90.31	1902.88	187.74	3246.28
3	2020-01-07	73.85	93.81	1906.86	187.24	3237.18
4	2020-01-08	75.04	98.43	1891.97	190.45	3253.05

Attributes#

Attributes provide information about the dataframe.

shape and len()#

shape returns the tuple (number of rows, number of columns).
len() returns the number of rows.

# 3 rows and 2 columns
df_sh.shape

(3, 2)

# 3 rows
len(df_sh)

# 100 rows and 10 columns
df_grades.shape

(100, 10)

# 100 rows 
len(df_grades)

index and columns#

index returns row labels.
columns returns column labels.

# row labels
df_sh.index

Index(['John', 'Liz', 'Tim'], dtype='object')

# column labels
df_sh.columns

Index(['State', 'Height'], dtype='object')

values#

The values attribute removes labels and returns an array.

df_sh.values

array([['NY', 160],
       ['OK', 180],
       ['TX', 140]], dtype=object)

Methods#

head() and tail()#

head(n) displays the first n rows.
tail(n) displays the last n rows.
The default value of \(n\) is 5.

# first 5 rows
df_grades.head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
1	jbbsx	35439616	11	F	6	18	93	9	98	91
2	mrvab	35543247	11	M	78	92	60	43	34	26
3	bjyve	61282135	9	M	60	8	10	99	80	87
4	rlpsr	53448034	10	M	3	38	45	43	79	69

# first 7 rows
df_grades.head(7)

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
0	mqtvy	37047871	10	M	30	91	69	93	17	50
1	jbbsx	35439616	11	F	6	18	93	9	98	91
2	mrvab	35543247	11	M	78	92	60	43	34	26
3	bjyve	61282135	9	M	60	8	10	99	80	87
4	rlpsr	53448034	10	M	3	38	45	43	79	69
5	hotxq	96808584	9	F	29	18	61	17	72	39
6	ylskw	37901138	9	M	53	85	61	87	9	94

# last 5 rows
df_grades.tail()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
95	zxuaq	58332078	12	M	57	1	48	61	58	99
96	vegye	10019910	11	F	88	30	48	85	48	17
97	euwlj	27967150	12	F	70	15	72	77	20	65
98	oesms	36171468	10	M	88	16	100	62	76	19
99	buugr	88783139	10	M	23	74	14	17	40	59

# last 7 rows
df_grades.tail(7)

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
93	yzljj	10884705	10	M	92	75	41	33	61	66
94	jloze	30408806	10	M	19	24	73	98	48	95
95	zxuaq	58332078	12	M	57	1	48	61	58	99
96	vegye	10019910	11	F	88	30	48	85	48	17
97	euwlj	27967150	12	F	70	15	72	77	20	65
98	oesms	36171468	10	M	88	16	100	62	76	19
99	buugr	88783139	10	M	23	74	14	17	40	59

info()#

The info() method provides basic information about the DataFrame, including:

row labels: RangeIndex: 100 entries, 0 to 99
column labels: ‘Name’, ‘ID’, ‘Grade’, ‘Gender’, ‘HW’, ‘Test-1’, ‘Test-2’, ‘Test-3’, ‘Test-4’, ‘Final’
number of columns: 10
Non-null (non-missing) values: 100 for each column
dtype: object (strings), int64
memory usage: 7.9+ KB

df_grades.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    100 non-null    object
 1   ID      100 non-null    int64 
 2   Grade   100 non-null    int64 
 3   Gender  100 non-null    object
 4   HW      100 non-null    int64 
 5   Test-1  100 non-null    int64 
 6   Test-2  100 non-null    int64 
 7   Test-3  100 non-null    int64 
 8   Test-4  100 non-null    int64 
 9   Final   100 non-null    int64 
dtypes: int64(8), object(2)
memory usage: 7.9+ KB

df_grades.columns

Index(['Name', 'ID', 'Grade', 'Gender', 'HW', 'Test-1', 'Test-2', 'Test-3',
       'Test-4', 'Final'],
      dtype='object')

describe()#

It returns the descriptive statistics of the columns of the DataFrame, including:

Count
Mean
Standard deviation (std)
Minimum (min)
25th percentile
50th percentile (median)
75th percentile
Maximum (max)

df_stock.describe()

	APPLE	TESLA	AMAZON	VISA	SP500
count	252.000000	252.000000	252.000000	252.000000	252.000000
mean	94.753056	288.347579	2678.569087	192.986984	3215.720079
std	21.836181	167.278855	545.632095	15.541861	318.045077
min	55.660000	72.240000	1676.610000	134.900000	2237.400000
25%	76.622500	149.745000	2146.120000	187.957500	3034.530000
50%	90.960000	232.830000	2884.500000	196.205000	3275.360000
75%	115.350000	421.035000	3164.790000	203.785000	3403.050000
max	136.490000	695.000000	3531.450000	218.020000	3735.360000

value_counts()#

It returns the number of occurrences for each value of the column.

# There are 54 occurrences of 'M' and 46 occurrences of 'F' in the column Gender.
df_grades['Gender'].value_counts()

Gender
M    54
F    46
Name: count, dtype: int64

# There are 27 occurrences of 'nine', 26 occurrences of 'eleven', 24 occurrences of 'twelve', 
# and 23 occurrences of 'ten' in the Grade column.
df_grades['Grade'].value_counts()

Grade
9     27
11    26
12    24
10    23
Name: count, dtype: int64

copy()#

The copy() method returns a copy of the DataFrame.

df_sh_copy = df_sh.copy()
df_sh_copy

	State	Height
John	NY	160
Liz	OK	180
Tim	TX	140

drop()#

The drop() method is used to remove column(s) and row(s) from the DataFrame.

axis=0 means that the given row(s) are dropped (default).
axis=1 means that the given column(s) are dropped.

# row with label 'Liz' is removed
df_sh.drop('Liz', axis=0)

	State	Height
John	NY	160
Tim	TX	140

# column with label 'Height' is removed
df_sh.drop('Height', axis=1)

	State
John	NY
Liz	OK
Tim	TX

By default, the drop() method does not change the DataFrame.
It just displays how the DataFrame would look if the dropping were performed.

# no change on df_sh
df_sh

	State	Height
John	NY	160
Liz	OK	180
Tim	TX	140

To make a change to the DataFrame, the parameter inplace should be set to True, which is False by default.

# no 'State' column any more
df_sh.drop('Height', axis=1, inplace=True)  

df_sh

	State
John	NY
Liz	OK
Tim	TX

# no 'Tim' row any more
df_sh.drop('Tim', inplace=True) #default value of axis=0  

df_sh

	State
John	NY
Liz	OK

reset_index()#

The reset_index() method is used to reset the row labels of a DataFrame to default values: \(0, 1, 2, ...\)

It also adds the initial row labels as a new column with label index.

df_sh.reset_index()

	index	State
0	John	NY
1	Liz	OK

If you do not want to keep the initial index values, you can set the drop parameter to True.

# no index column
df_sh.reset_index(drop=True)

	State
0	NY
1	OK

reset_index() does not change the DataFrame unless inplace=True.

# no change on row labels
df_sh

	State
John	NY
Liz	OK

# row labels are changed
df_sh.reset_index(inplace=True)

df_sh

	index	State
0	John	NY
1	Liz	OK

set_index()#

The set_index() method is used to set the DataFrame row labels to an existing column.

The label of the column that becomes the new index is preserved for future use.

# Use 'State' column values as new row labels 
df_sh.set_index('State')

	index
State
NY	John
OK	Liz

set_index() does not change the DataFrame unless inplace=True

# no change on row labels
df_sh

	index	State
0	John	NY
1	Liz	OK

# row labels are changed
df_sh.set_index('State', inplace=True)

df_sh

	index
State
NY	John
OK	Liz

Let’s update the index of df_stock using the Date column.

df_stock.set_index('Date', inplace=True)

df_stock.head()

	APPLE	TESLA	AMAZON	VISA	SP500
Date
2020-01-02	74.33	86.05	1898.01	189.66	3257.85
2020-01-03	73.61	88.60	1874.97	188.15	3234.85
2020-01-06	74.20	90.31	1902.88	187.74	3246.28
2020-01-07	73.85	93.81	1906.86	187.24	3237.18
2020-01-08	75.04	98.43	1891.97	190.45	3253.05

sum(), mean(), median(), std()#

When axis=0, computation is column-wise (vertically). This means that for each column, the operation is performed on different row values.

This is the default value.

When axis=1, computation is row-wise (horizontally). This means that for each row, the operation is performed on different column values.

# axis=0
df_stock.sum(0)    # sum of values in each column

APPLE      23877.77
TESLA      72663.59
AMAZON    674999.41
VISA       48632.72
SP500     810361.46
dtype: float64

# axis=1
df_stock.sum(1)    # sum of values in each row

Date
2020-01-02    5505.90
2020-01-03    5460.18
2020-01-06    5501.41
2020-01-07    5498.94
2020-01-08    5508.94
               ...   
2020-12-23    7857.00
2020-12-24    7877.67
2020-12-28    8031.80
2020-12-29    8063.74
2020-12-30    8064.21
Length: 252, dtype: float64

# mean of each column
df_stock.mean()

APPLE       94.753056
TESLA      288.347579
AMAZON    2678.569087
VISA       192.986984
SP500     3215.720079
dtype: float64

# median of each column
df_stock.median()

APPLE       90.960
TESLA      232.830
AMAZON    2884.500
VISA       196.205
SP500     3275.360
dtype: float64

# standard deviation of each column
df_stock.std()

APPLE      21.836181
TESLA     167.278855
AMAZON    545.632095
VISA       15.541861
SP500     318.045077
dtype: float64

rename()#

The rename() method is used to rename column or row labels.

df_sh

	index
State
NY	John
OK	Liz

Let’s rename the index column as “Name”.

df_sh.rename(columns={'index':'Name'})

	Name
State
NY	John
OK	Liz

# no change on df_sh
df_sh

	index
State
NY	John
OK	Liz

rename() does not change the DataFrame unless inplace=True.

# 'index' --> 'Name'
df_sh.rename(columns={'index':'Name'}, inplace=True)

df_sh

	Name
State
NY	John
OK	Liz

sort_values()#

It sorts the dataframe based on the provided column in ascending or descending order.

By default, sorting is done in ascending order.
To sort in descending order, set the ascending parameter to False.

# ascending
df_grades.sort_values('Test-2').head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
12	jjgag	46794501	11	F	8	43	1	7	82	36
19	cxkpb	71808293	10	M	87	80	4	14	77	87
62	mwsqn	85824590	11	M	89	95	4	34	94	21
65	yxchm	48696307	9	M	33	54	5	39	94	99
61	rulsy	40694890	12	M	4	8	6	81	62	46

# descending
df_grades.sort_values('Test-2', ascending=False).head()

	Name	ID	Grade	Gender	HW	Test-1	Test-2	Test-3	Test-4	Final
16	abqln	59927557	11	F	26	97	100	46	94	78
98	oesms	36171468	10	M	88	16	100	62	76	19
74	xohww	81223224	9	M	32	24	100	81	59	73
90	wxtpu	98301627	12	M	44	57	99	87	9	62
25	tkhtm	56044667	12	F	20	67	97	13	73	41

duplicated()#

It returns a Series with boolean values showing if a row is a repeat or not.

state_height_dic = {'State':['NY', 'OK', 'TX', 'TX'], 'Height':[160, 180, 140, 140]}
df_sh = pd.DataFrame(state_height_dic, index=['John', 'Liz', 'Tim', 'Tim'])
df_sh

	State	Height
John	NY	160
Liz	OK	180
Tim	TX	140
Tim	TX	140

df_sh.duplicated()

John    False
Liz     False
Tim     False
Tim      True
dtype: bool

drop_duplicates()#

It returns a DataFrame with duplicate rows removed.

df_sh.drop_duplicates()

	State	Height
John	NY	160
Liz	OK	180
Tim	TX	140