Question-1: Threshold#

import yfinance as yf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statistics
from sklearn.metrics import accuracy_score

Title#

Threshold-Based Pattern Mining for Stock Market Candlestick Analysis

Abstract#

Pattern mining is an essential method for uncovering meaningful patterns in historical stock data. Among various techniques, candlestick analysis is widely used to capture daily stock price movements—open, high, low, and close—through a coded representation. However, small, insignificant differences in these values can impact the pattern coding and, consequently, predictions

This project introduces a threshold-based approach to encoding and pattern mining for candlesticks, aiming to mitigate the influence of minor fluctuations. By optimizing a threshold to exclude minimal price differences, we aim to improve pattern reliability and maximize returns. Performance is evaluated through backtesting, with a focus on determining the ideal threshold for enhanced accuracy in stock price forecasting.

Data#

def get_training_test_data(stock='AMZN', start='2019-1-1', end='2021-1-31', training_ratio=0.96):
    df = yf.Ticker(stock).history(start=start, end=end)
    df = df.iloc[:,:-3]
    df.reset_index(inplace=True)
    df['Date'] = [i.date() for i in df.Date]
    df['fcc'] = [np.sign(df.Close.loc[i+1]-df.Close.loc[i]) for i in range(len(df)-1)]+[np.nan]
    training_length = int(len(df)*training_ratio)
    training_data = df.iloc[:training_length,:] 
    test_data = df.iloc[training_length:,:]
    test_data.reset_index(inplace=True, drop=True)
    return (training_data, test_data)
df_train, df_test = get_training_test_data()
df_train.shape, df_test.shape
((503, 6), (21, 6))

Encoding with Threshold#

Encoding can be performed based on the lengths of the lower shadow (\(l1\)), body (\(l2\)), and upper shadow (\(l3\)), while disregarding segments with small sizes.

  • For example, if \(l1 = close-low\) is smaller than a defined threshold for a red candlestick with code ‘a’, \(l1\) can be ignored and set to 0, changing the candlestick’s code to ‘d’.

def encoder_threshold(hp, op, cp, lp, threshold_pct=0.01):
    threshold = threshold_pct * cp

    l1 = min(op,cp) - lp
    l2 = max(op,cp) - min(op,cp)
    l3 = hp - max(op,cp)

    if (l1 > threshold) & (l2 > threshold) & (l3 > threshold):
        if op > cp: return 'a'
        if cp > op: return 'e'

    elif (l1 <= threshold) & (l2 > threshold) & (l3 > threshold):
        if op > cp: return 'd'
        if cp > op: return 'h'

    elif (l1 > threshold) & (l2 > threshold) & (l3 <= threshold):
        if op > cp: return 'b'
        if cp > op: return 'f'

    elif (l1 > threshold) & (l2 <= threshold) & (l3 > threshold):
        return 'i'

    elif (l1 <= threshold) & (l2 > threshold) & (l3 <= threshold):
        if op > cp: return 'c'
        if cp > op: return 'g'

    elif (l1 > threshold) & (l2 <= threshold) & (l3 > threshold):
        return 'i'

    elif (l1 > threshold) & (l2 <= threshold) & (l3 <= threshold):
        return 'j'

    elif (l1 < threshold) & (l2 < threshold) & (l3 < threshold):
        return 'k'

    elif (l1 < threshold) & (l2 < threshold) & (l3 > threshold):
        return 'l'

def df_encoder_threshold(data, threshold_pct=0.01):
    data_ = data.copy()
    encoder_list = []
    for i in data_.index:
        hp, op, cp, lp = data_[['High','Open', 'Close', 'Low']].loc[i]
        encoder_list.append(encoder_threshold(hp, op, cp, lp, threshold_pct))
    data_['code'] = encoder_list
    return data_
df_train_ext = df_encoder_threshold(df_train, threshold_pct=0.03)
df_train_ext.head()
Date Open High Low Close fcc code
0 2019-01-02 73.260002 77.667999 73.046501 76.956497 -1.0 g
1 2019-01-03 76.000504 76.900002 74.855499 75.014000 1.0 k
2 2019-01-04 76.500000 79.699997 75.915497 78.769501 1.0 k
3 2019-01-07 80.115501 81.727997 79.459503 81.475502 1.0 k
4 2019-01-08 83.234497 83.830498 80.830498 82.829002 1.0 k
df_train_ext.code.value_counts()
code
k    466
g     14
c     12
j      6
l      4
d      1
Name: count, dtype: int64
df_train_ext = df_encoder_threshold(df_train, threshold_pct=0.01)
df_train_ext.head()
Date Open High Low Close fcc code
0 2019-01-02 73.260002 77.667999 73.046501 76.956497 -1.0 g
1 2019-01-03 76.000504 76.900002 74.855499 75.014000 1.0 d
2 2019-01-04 76.500000 79.699997 75.915497 78.769501 1.0 h
3 2019-01-07 80.115501 81.727997 79.459503 81.475502 1.0 g
4 2019-01-08 83.234497 83.830498 80.830498 82.829002 1.0 j
df_train_ext.code.value_counts()
code
k    210
g     81
c     71
j     47
l     29
b     14
f     13
h     12
i     10
d      8
a      6
e      2
Name: count, dtype: int64