# Stock Data Preparation
![](title_pict/stock_data_preparartion2.png)

In [2]:
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt

There are various methods to import stock data for a specific period using its symbol. One of the most popular tools is 
[yfinance](https://pypi.org/project/yfinance/), a Python module that allows users to fetch historical stock price data conveniently via the Yahoo Finance API.

## Description of Data
- Each stock has a symbol, which is unique to it and consists of up to five characters, including letters, '.', and '-'."  

**Examples:**
|Company|Symbol|
|:--:|:--:|
|Alphabet|AALP|
|Amazon|AMZN|
|Apple|AAPL|
|Visa|V|
|Allstate|ALL|
|Tesla|TSLA|

- The symbol is used to import historical data.
- `history()` method returns a dataframe with date as index and 7 columns:
    - *Open*  : The intial price of the stock in the beginning of the day
    - *High*  : The highest price of the stock during the day
    - *Low*   : The lowest price of the stock during the day
    - *Close* : The final price of the stock at the end of the day
    - *Volume*: The number of stocks traded during the day
    - *Dividens*: This is the share of company earnings distributed among its investors 
    - *Stock Splits*: It is subdividing each share of its stock into a fixed number of units.


- The `history()` method by default returns data for the business days of the last month.
    - Some dates may be missing, representing days when the market is closed.
    - The index includes dates and times.
    - *start* and *end* parameters allow access to data within a specific range.
        - Dates should be in the format 'YEAR-MONTH-DAY', where the month is numerical.

## Import Data

- By default, it returns only the last month of historical stock price data.

In [36]:
df = yf.Ticker('AAPL').history()
df.head().round(2)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-10-25 00:00:00-04:00,229.49,232.96,229.32,231.16,38802300,0.0,0.0
2024-10-28 00:00:00-04:00,233.06,234.47,232.29,233.14,36087100,0.0,0.0
2024-10-29 00:00:00-04:00,232.84,234.07,232.06,233.41,35417200,0.0,0.0
2024-10-30 00:00:00-04:00,232.35,233.21,229.3,229.85,47070900,0.0,0.0
2024-10-31 00:00:00-04:00,229.09,229.58,225.12,225.66,64370100,0.0,0.0


- Data is available only for business days when the stock market is open.

In [41]:
# data of 22 days
df.shape

(22, 7)

- Setting `period='max'` returns all available data for a stock.

In [8]:
df = yf.Ticker('AAPL').history(period='max')
df.head().round(2)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1980-12-12 00:00:00-05:00,0.098834,0.099264,0.098834,0.098834,469033600,0.0,0.0
1980-12-15 00:00:00-05:00,0.094108,0.094108,0.093678,0.093678,175884800,0.0,0.0
1980-12-16 00:00:00-05:00,0.087232,0.087232,0.086802,0.086802,105728000,0.0,0.0
1980-12-17 00:00:00-05:00,0.088951,0.089381,0.088951,0.088951,86441600,0.0,0.0
1980-12-18 00:00:00-05:00,0.09153,0.091959,0.09153,0.09153,73449600,0.0,0.0


- Daily data between '1995-1-1' and '2000-12-31', in the form of 'YEAR-MONTH-DAY'. 

In [10]:
df = yf.Ticker('AAPL').history(start='1995-1-1', end='2000-12-31')
df.head().round(2)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1995-01-03 00:00:00-05:00,0.289489,0.289489,0.282043,0.285766,103868800,0.0,0.0
1995-01-04 00:00:00-05:00,0.287627,0.295074,0.287627,0.293213,158681600,0.0,0.0
1995-01-05 00:00:00-05:00,0.292281,0.293213,0.288558,0.289489,73640000,0.0,0.0
1995-01-06 00:00:00-05:00,0.309967,0.321138,0.306244,0.31276,1076622400,0.0,0.0
1995-01-09 00:00:00-05:00,0.309968,0.311829,0.305313,0.306826,274086400,0.0,0.0


- Remove the last the two columns.

In [12]:
df = yf.Ticker('AAPL').history(start='1995-1-1', end='2000-12-31').iloc[:,:-2]
df.head().round(2)

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1995-01-03 00:00:00-05:00,0.289489,0.289489,0.282043,0.285766,103868800
1995-01-04 00:00:00-05:00,0.287627,0.295074,0.287627,0.293213,158681600
1995-01-05 00:00:00-05:00,0.292281,0.293213,0.288558,0.289489,73640000
1995-01-06 00:00:00-05:00,0.309968,0.321138,0.306244,0.31276,1076622400
1995-01-09 00:00:00-05:00,0.309967,0.311829,0.305313,0.306826,274086400


## Remove Time
In this part, we will remove the time  from the index dates.

In [14]:
# Reset the index and set the previous index as the Date column.
df.reset_index(inplace=True)
df.head().round(2)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,1995-01-03 00:00:00-05:00,0.289489,0.289489,0.282043,0.285766,103868800
1,1995-01-04 00:00:00-05:00,0.287627,0.295074,0.287627,0.293213,158681600
2,1995-01-05 00:00:00-05:00,0.292281,0.293213,0.288558,0.289489,73640000
3,1995-01-06 00:00:00-05:00,0.309968,0.321138,0.306244,0.31276,1076622400
4,1995-01-09 00:00:00-05:00,0.309967,0.311829,0.305313,0.306826,274086400


In [15]:
# Use date() method to access only the date part and assign it as the new values for the 'Date' column.
df['Date'] = [i.date() for i in df.Date]
df.head().round(2)

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,1995-01-03,0.289489,0.289489,0.282043,0.285766,103868800
1,1995-01-04,0.287627,0.295074,0.287627,0.293213,158681600
2,1995-01-05,0.292281,0.293213,0.288558,0.289489,73640000
3,1995-01-06,0.309968,0.321138,0.306244,0.31276,1076622400
4,1995-01-09,0.309967,0.311829,0.305313,0.306826,274086400
