Question-1: Threshold

Question-1: Threshold#

import yfinance as yf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statistics
from sklearn.metrics import accuracy_score

Title#

Threshold-Based Pattern Mining for Stock Market Candlestick Analysis

Abstract#

Pattern mining is an essential method for uncovering meaningful patterns in historical stock data. Among various techniques, candlestick analysis is widely used to capture daily stock price movements—open, high, low, and close—through a coded representation. However, small, insignificant differences in these values can impact the pattern coding and, consequently, predictions

This project introduces a threshold-based approach to encoding and pattern mining for candlesticks, aiming to mitigate the influence of minor fluctuations. By optimizing a threshold to exclude minimal price differences, we aim to improve pattern reliability and maximize returns. Performance is evaluated through backtesting, with a focus on determining the ideal threshold for enhanced accuracy in stock price forecasting.

Data#

def get_training_test_data(stock='AMZN', start='2019-1-1', end='2021-1-31', training_ratio=0.96):
    df = yf.Ticker(stock).history(start=start, end=end)
    df = df.iloc[:,:-3]
    df.reset_index(inplace=True)
    df['Date'] = [i.date() for i in df.Date]
    df['fcc'] = [np.sign(df.Close.loc[i+1]-df.Close.loc[i]) for i in range(len(df)-1)]+[np.nan]
    training_length = int(len(df)*training_ratio)
    training_data = df.iloc[:training_length,:] 
    test_data = df.iloc[training_length:,:]
    test_data.reset_index(inplace=True, drop=True)
    return (training_data, test_data)

df_train, df_test = get_training_test_data()
df_train.shape, df_test.shape

((503, 6), (21, 6))

Encoding with Threshold#

Encoding can be performed based on the lengths of the lower shadow (\(l1\)), body (\(l2\)), and upper shadow (\(l3\)), while disregarding segments with small sizes.

For example, if \(l1 = close-low\) is smaller than a defined threshold for a red candlestick with code ‘a’, \(l1\) can be ignored and set to 0, changing the candlestick’s code to ‘d’.

def encoder_threshold(hp, op, cp, lp, threshold_pct=0.01):
    threshold = threshold_pct * cp

    l1 = min(op,cp) - lp
    l2 = max(op,cp) - min(op,cp)
    l3 = hp - max(op,cp)

    if (l1 > threshold) & (l2 > threshold) & (l3 > threshold):
        if op > cp: return 'a'
        if cp > op: return 'e'

    elif (l1 <= threshold) & (l2 > threshold) & (l3 > threshold):
        if op > cp: return 'd'
        if cp > op: return 'h'

    elif (l1 > threshold) & (l2 > threshold) & (l3 <= threshold):
        if op > cp: return 'b'
        if cp > op: return 'f'

    elif (l1 > threshold) & (l2 <= threshold) & (l3 > threshold):
        return 'i'

    elif (l1 <= threshold) & (l2 > threshold) & (l3 <= threshold):
        if op > cp: return 'c'
        if cp > op: return 'g'

    elif (l1 > threshold) & (l2 <= threshold) & (l3 > threshold):
        return 'i'

    elif (l1 > threshold) & (l2 <= threshold) & (l3 <= threshold):
        return 'j'

    elif (l1 < threshold) & (l2 < threshold) & (l3 < threshold):
        return 'k'

    elif (l1 < threshold) & (l2 < threshold) & (l3 > threshold):
        return 'l'

def df_encoder_threshold(data, threshold_pct=0.01):
    data_ = data.copy()
    encoder_list = []
    for i in data_.index:
        hp, op, cp, lp = data_[['High','Open', 'Close', 'Low']].loc[i]
        encoder_list.append(encoder_threshold(hp, op, cp, lp, threshold_pct))
    data_['code'] = encoder_list
    return data_

df_train_ext = df_encoder_threshold(df_train, threshold_pct=0.03)
df_train_ext.head()

	Date	Open	High	Low	Close	fcc	code
0	2019-01-02	73.260002	77.667999	73.046501	76.956497	-1.0	g
1	2019-01-03	76.000504	76.900002	74.855499	75.014000	1.0	k
2	2019-01-04	76.500000	79.699997	75.915497	78.769501	1.0	k
3	2019-01-07	80.115501	81.727997	79.459503	81.475502	1.0	k
4	2019-01-08	83.234497	83.830498	80.830498	82.829002	1.0	k

df_train_ext.code.value_counts()

code
k    466
g     14
c     12
j      6
l      4
d      1
Name: count, dtype: int64

df_train_ext = df_encoder_threshold(df_train, threshold_pct=0.01)
df_train_ext.head()

	Date	Open	High	Low	Close	fcc	code
0	2019-01-02	73.260002	77.667999	73.046501	76.956497	-1.0	g
1	2019-01-03	76.000504	76.900002	74.855499	75.014000	1.0	d
2	2019-01-04	76.500000	79.699997	75.915497	78.769501	1.0	h
3	2019-01-07	80.115501	81.727997	79.459503	81.475502	1.0	g
4	2019-01-08	83.234497	83.830498	80.830498	82.829002	1.0	j

df_train_ext.code.value_counts()

code
k    210
g     81
c     71
j     47
l     29
b     14
f     13
h     12
i     10
d      8
a      6
e      2
Name: count, dtype: int64