Importing and Storing Data

Importing and Storing Data#

In this section, we will explore how to import comma-separated values (CSV) and Excel (XLSX) files into a DataFrame, as well as how to export DataFrames back to CSV or Excel formats using the Pandas library. We will also review the various parameters available to customize the import and export process effectively.

Since we will be using methods from the Pandas library, we first need to import it using:

import pandas as pd

read_csv()#

The read_csv() function reads a comma-separated values (CSV) file and loads it into a DataFrame.

The file path is provided as a string inside the read_csv() function, including the file name and its extension.
In the following code, the CSV file person_info_v1.csv is located in the sample_files folder.

pd.read_csv('sample_files/person_info_v1.csv')

	Name	Birthdate	State	Age	Weight
0	Amy	10/25/1972	NY	34	150
1	Joe	2/5/2010	NJ	21	175
2	Mia	10/23/2015	AZ	67	130
3	Amy	11/6/1983	FL	19	145
4	Liz	4/19/1994	CA	29	150
5	John	8/29/2023	OH	34	170

index_col#

Specifies which column to use as the index of the DataFrame, using the column’s index number or label.

df = pd.read_csv('sample_files/person_info_v1.csv', index_col=0)
df

	Birthdate	State	Age	Weight
Name
Amy	10/25/1972	NY	34	150
Joe	2/5/2010	NJ	21	175
Mia	10/23/2015	AZ	67	130
Amy	11/6/1983	FL	19	145
Liz	4/19/1994	CA	29	150
John	8/29/2023	OH	34	170

parse_dates#

Specifies columns to be parsed as Timestamp objects during import.

The values in the Date column are currently stored as strings.

df.loc['Joe', 'Birthdate']

'2/5/2010'

In the following code, Date column values imported as a Timestamp object.

df = pd.read_csv('sample_files/person_info_v1.csv', index_col=0, parse_dates=['Birthdate'])
df

	Birthdate	State	Age	Weight
Name
Amy	1972-10-25	NY	34	150
Joe	2010-02-05	NJ	21	175
Mia	2015-10-23	AZ	67	130
Amy	1983-11-06	FL	19	145
Liz	1994-04-19	CA	29	150
John	2023-08-29	OH	34	170

df.loc['Joe', 'Birthdate']

Timestamp('2010-02-05 00:00:00')

type(df.loc['Joe', 'Birthdate'])

pandas._libs.tslibs.timestamps.Timestamp

header#

The person_info_v2.csv file contains some unnecessary information in certain cells. As a result, importing the entire file may lead to many missing values.

pd.read_csv('sample_files/person_info_v2.csv')

	Name	Unnamed: 1	Person Info	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6
0	Birthdate	NaN	N	D	S	A	W
1	State	NaN	NaN	NaN	NaN	NaN	NaN
2	Age	NaN	NaN	NaN	NaN	NaN	NaN
3	Weight	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	Name	Birthdate	State	Age	Weight
5	NaN	NaN	Amy	10/25/1972	NY	34	150
6	NaN	NaN	Joe	2/5/2010	NJ	21	175
7	NaN	NaN	Mia	10/23/2015	AZ	67	130
8	NaN	NaN	Amy	11/6/1983	FL	19	145
9	NaN	NaN	Liz	4/19/1994	CA	29	150
10	NaN	NaN	John	8/29/2023	OH	34	170

To avoid this, we should import only the relevant data, as demonstrated in the previous section.

First, the header parameter is used to skip the first 5 rows and treat the sixth row (index 5) as the header.

pd.read_csv('sample_files/person_info_v2.csv',header=5)

	Unnamed: 0	Unnamed: 1	Name	Birthdate	State	Age	Weight
0	NaN	NaN	Amy	10/25/1972	NY	34	150
1	NaN	NaN	Joe	2/5/2010	NJ	21	175
2	NaN	NaN	Mia	10/23/2015	AZ	67	130
3	NaN	NaN	Amy	11/6/1983	FL	19	145
4	NaN	NaN	Liz	4/19/1994	CA	29	150
5	NaN	NaN	John	8/29/2023	OH	34	170

The index_col parameter is used to exclude the first two columns and setting third one (index 2) as the index.

pd.read_csv('sample_files/person_info_v2.csv',header=5, index_col=2)

	Unnamed: 0	Unnamed: 1	Birthdate	State	Age	Weight
Name
Amy	NaN	NaN	10/25/1972	NY	34	150
Joe	NaN	NaN	2/5/2010	NJ	21	175
Mia	NaN	NaN	10/23/2015	AZ	67	130
Amy	NaN	NaN	11/6/1983	FL	19	145
Liz	NaN	NaN	4/19/1994	CA	29	150
John	NaN	NaN	8/29/2023	OH	34	170

The iloc indexer can be used to exclude the first two unnecessary columns by selecting only the desired columns based on their integer positions.

pd.read_csv('sample_files/person_info_v2.csv',header=5, index_col=2).iloc[:,2:]

	Birthdate	State	Age	Weight
Name
Amy	10/25/1972	NY	34	150
Joe	2/5/2010	NJ	21	175
Mia	10/23/2015	AZ	67	130
Amy	11/6/1983	FL	19	145
Liz	4/19/1994	CA	29	150
John	8/29/2023	OH	34	170

usecols#

This parameter is used to specify which columns should be imported from the file.

pd.read_csv('sample_files/person_info_v2.csv',header=5, index_col=0, 
            usecols=['Name', 'Birthdate', 'State', 'Age', 'Weight'])

	Birthdate	State	Age	Weight
Name
Amy	10/25/1972	NY	34	150
Joe	2/5/2010	NJ	21	175
Mia	10/23/2015	AZ	67	130
Amy	11/6/1983	FL	19	145
Liz	4/19/1994	CA	29	150
John	8/29/2023	OH	34	170

read_excel()#

It is similar to the read_csv() method but is used to import data from Excel files.

to_csv()#

The to_csv() method is used to write a DataFrame to a comma-separated values (CSV) file.

The file path is passed as a string to the to_csv() function, including the file name and its extension.

In the following code, the DataFrame df is saved as a CSV file named countries.csv inside the sample_files folder.

df = pd.DataFrame( index=['France', 'Germany', 'North America', 'Japan'],
                 columns=['Continent', 'Company'],
                 data = [['Europe', 'Carrefour'], ['Europe', 'Siemens'],
                         ['USA', 'Apple'],
                         ['Asia', 'Sony']])
df

	Continent	Company
France	Europe	Carrefour
Germany	Europe	Siemens
North America	USA	Apple
Japan	Asia	Sony

df.to_csv('sample_files/countries.csv')

A screenshot of an Excel file containing index, 2 columns and 4 rows of data.

index#

Set the index parameter False to prevent the DataFrame’s index from being written to the output file.

df.to_csv('sample_files/countries.csv', index=False)

A screenshot of an Excel file containing 2 columns and 4 rows of data.

columns#

Specifies which columns of the DataFrame should be written to the output file.

df.to_csv('sample_files/countries.csv', columns=['Company'])

A screenshot of an Excel file containing 2 columns and 4 rows of data.

header#

Specifies whether to write column labels to the output file. It can also accept a list of custom labels.

df.to_csv('sample_files/countries.csv', header=False)

A screenshot of an Excel file containing 2 columns and 4 rows of data.

df.to_csv('sample_files/countries.csv', header=['CONTINENT', 'COMPANY'])

A screenshot of an Excel file containing 2 columns and 4 rows of data.

to_excel()#

It is similar to the to_csv() method but is used to save data as an Excel files.

Importing and Storing Data

Contents

Importing and Storing Data#

read_csv()#

index_col#

parse_dates#

header#

usecols#

read_excel()#

to_csv()#

index#

columns#

header#

to_excel()#