Skip to content

Latest commit

 

History

History

Project

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Predict low-risk profitable trading opportunity with high frequency trading data

Team members:

  • 梁康华 1601213555
  • 李君涵 1601213559
  • 纪雪云 1601213544
  • 王昊炜 1601213612

0 Structure

1 Motivation

  • On one hand, in the study of market microstructure, many researches have proved that people with private information will buy and sell before non-informed traders.
  • On the other hand, in China, manipulating the market does exist. People with private information tends to manipulate the market and earn money
  • Combining the theory and the phenomenon in China, we would like to see whether we can make a profit based on them.
  • Traders manipulating the market tend to generate abnormal trading volume or price fluctuation in the market.
  • High-frequency data is a more precise and instant catch of these change and behavior.
  • These behavior may generate complicated pattern and our team will attempt to employ machine learning algorithms to find these patterns and exploit profitable opportunity. <<<

2 Data Descriptions

  • The dataset we used in this project consists of two basic dataset:
  • (1)High Frequency Trading Volume Dataset. It is collected by webspider from Sina Finance, which is a level-2 data consisting the 'Active Buy' or 'Active Sell' high frequency data
  • (2)5-min Frequency Trading Data. It is collected from wind, which consists open, high, low ,close, trading volume and trading amount in 5-min frequency level
  • We will combine these two basic dataset as the high frequency dataset we use in this project
  • All A-Stocks in China are included<<<

3 Feature Generation

3.1. Generate 5-min Features

Procedures in Code
Variables Descriptions
dg_1

  • buyprice: It is the highest price in the last 20 miniutes of a trading day. We are going to assume we buy at this price.
  • canbuy: If the stock has reached the price ceiling or floor, we assume that we cannot buy this stock and the variable will be 0. Otherwise, we assume that we can buy this stock and the variable will be 1. When doing the training, we will exclude all the samples with canbuy == 0.
  • buyret: The return of buying at buyprice and sell in the highest price in the next two days.
  • risk: The loss of buying at buyprice and sell in the lowest price in the next two days.
  • target: The training target we use in this project. If in next two days, the buyret > 0.03 and risk > -0.02, we consider it as a low-risk profitable trading opportunity and we set the label to be 1. Otherwise, we set it to be 0.

Within the variables, there are two features we construct from the 5-min frequency data

  • amplitude: This feature equals to 'daily highest price/daily lowest price - 1', to measure the stock's variation.
  • above_mean: It is an indicator, which equals to 1 if closing price is higher than the mean price at closing time, and 0 otherwise.

3.2. Generate High-Frequency Features

Procedures in Code
Variable Descriptions
dg_2

  • The high frequency volume will be divided in a quintile fashion based on the following thresholds:
    vollist = [0, 10000, 50000, 100000, 200000, 300000, 400000, 500000, infinity]
  • The '0' in 'buy_rate_0' is the ratio of buying volume in [0, 10000] to total selling volume, showing the small traders buying power over total selling power. The relationship is as follow:
    '0': [0, 10000]
    '1': [10000, 50000]
    .
    .
    .
    '6': [400000, 500000]
    '7': [500000, infinity]
    The larger number will show the larger buying power over total selling power.
  • The 'sell_rate_...' is the opposite. It shows the selling power in a certain range over the total buying power.
  • 'total_rate' indicate the total buying power over the total selling power
  • 'pchange' calculated as 'close price of the day / open price of the day - 1'
  • The 'lag_1' means the feature one day before the trading day. We have 'lag_1', 'lag_2', 'lag_3' in this dataset.<<<

3.3 Conclusion on Features

  • There are 74 features to train
  • Features should be standardized before training because they are different in units
  • PCA might be needed to handle the large amount of features<<<

4 Exploratory Data Analaysis

After data preprocessing, we can check whether there is any problem inside the datset

4.1. Missing Value Detection

  • No missing values in the datasets after features generations

4.2. Imbalanced Check

  • The imbalanced dataset problem is serious because only 16% of the labels are 1 and the rest are 0.

4.3. Conclusion on Data

  • Total observations: 347364
  • Timerange: 2017-09-04 --- 2018-02-28
  • No missing data
  • Serious imbalanced dataset<<<

5 Training

5-1 Logistic Regression

Procedures in Code

5.1.1. Feature Preprocessing and choose of hyperparameters

  • We train on 70% of the sample and test on 30% of the sample
  • SMOTE transformation is used to tackle the 'imbalance dataset' problem
  • To increase training speed, the data are standardized
  • We use PCA to reduce dimension

5.1.2. Hyperparameter Tunning

  • To determine the parameter C in logistic regression, we use the grid search.
  • The best parameter is C = 1.

5.1.3. Result

  • The precision is 0.37, which is based on default threshold 50%
  • However, if we increase the threshold, the precision is better. If we buy stocks with prediction of more than 85%, we have 75% probability to succeed
    log-1
  • The AUC is 0.8, whichs is a good proof that the model works quite well.<<<
    log-2

5-2 Decision Tree

Procedures in Code

5.2.1. Feature Preprocessing and choose of hyperparameters

  • We train on 70% of the sample and test on 30% of the sample
  • SMOTE transformation is used to tackle the 'imbalance dataset' problem
  • To increase training speed, the data are standardized
  • We use PCA to reduce dimension

5.2.2. Hyperparameter Tunning

  • To determine the parameter max_depth in decision tree, we use the grid search.
  • The best parameter is max_depth = 10.

5.2.3. Result

  • The precision is 0.30, which is based on default threshold 50%
  • Even if we increase the threshold, the result is still not good
    dtree-1
  • The ROC Curve shows that the performance is not as good as that of logistic regression<<<
    dtree-2

5-3 Deep Neural Network

Procedures in Code

5.3.1. Feature Preprocessing

  • All features are used in training
  • To increase training speed, the data are standardized
  • We train on 70% of the sample and test on 30% of the sample

5.3.2. DNN Structures and hyperparameter tunning

  • We use keras package with tensorflow as kernel
  • Sequential Model is used
  • 1 input layer, 5 hidden layers, 1 output layers
  • Input and all the hidden layers employ 'ReLu' activation function
  • The output layer employs 'Sigmoid' activation function
  • The loss function we used in back propagation is 'binary_crossentropy'
  • Adam optimizer is used because it considers both momentum effect and avoids gradient exposure

5.3.3. Result

When we are actually trading, we focus on whether we can profit from the model result. If the stock features predict '1', we will buy the stock and wait for profit. Therefore, 'Precision' is the right metric for us to evaluate the model.
dnn-1
The result is very encouraging. We use the trained models to predict out-of-sample data. The graph above shows that if we increase the thredsholds of predicting labels as 1, the precision increases gradually. We have 85% probability to succeed if we buy stocks with model prediction probabilities more than 75%.
dnn-2
The AUC is 0.81, which is also another proof of the good result <<<

6 Conclusion

  • Based on high frequency trading data, we successfully predict a profitable trading opportunity
  • The best model is DNN because the precision increases much faster with the threshold and reaches the highest compared to the other two methods
  • Parametric model (Logistic, DNN) is better than Non-parametric model (Decision Tree) according to the test
  • However, the test may be biased because of the short time period <<<