# Can we predict Thailand stock with Google trends and Covid19 stats?

Have you ever wondered whether google search trends, or covid19 stats would have an impact on Thailand stock market prices?

If so, are we able to predict the stocks with these values then and optimise them along the way?

# Thailand Stock market prediction with Google trends and Covid-19 stats

As you know, covid19 has dominated the world for a while now but is it not only affecting just our lives but our stock market as well? If so, are we able to use these information to further predict the stock prices then?

In this project https://github.com/petetae/SET_stock_prediction, we explore how there is a correlation or not and then try to predict the stock price and then see whether the google trends and covid19 stats are listed in one of the features to be used or not.

The process through the project is simple:

- Identify the data sources to use
- Explore the data to see what we can extract and use
- Prepare data for our prediction
- Train models from different algorithms
- Optimise the model
- Review finalised performance

The main outcome of this project is to be able to predict stock prices, and we are looking to see whether the covid19 stats and Google trend information is used as the main features or not.

# Data sources

Let’s get started with the data sources we are going to use. The first one is called StarfishX.

This is a library that allows us to fetch the information regarding the Thailand stock details such as pricing and other stock industry information.

The second data source that we are going to use is the google trends information which is from pytrends.

This is a library that allows us to input a keyword, and then see how the interest over time of the keyword search is.

Lastly, this is the covid19 statistics for Thailand which can be found at the following link. https://covid19.ddc.moph.go.th/api/Cases/timeline-cases-all

Now let’s move onto exploring the data from these datasets!

# Performance metrics first!

Before exploring the data, there is one thing that we need to be clear of first is the performance metric for this project. That is the one thing to keep in mind.

For this project, we are looking to use the Root Mean Square Error as a measure of performance.

Root Mean Square Error can be a good indicator for the cases where the stocks are trending sideways however, it may not be as well if the stock have spikes and swings. This is where a visualised plot for analysis comes in.

For example, as shown in the plot below, the model might predict the “red” line and shows a lower RMSE value than the “yellow” line. But ultimately, we would say that the “yellow” line is better. Due to this issue, we need to use a combination of the RMSE scoring and also by visual cue to find the optimal model and parameter tuning.

# Explore data

Now onto the topic of data exploration. We are going to go through an overview of what is available in each data source.

Starting with starfishX, we are able to capture the following information

These two information are the main things that we can use to further create more features. Having said that, there is a look of the stock that we will mainly be working with today, DELTA.

We are able to create the Moving average, Relative Strength index, for all the above information too. For more details on that, please refer to the detailed report for this project https://github.com/petetae/SET_stock_prediction.

So… what else can this starfishX library do?

Of course, we are able to fetch the list of SET50, SET100 stock symbols, the SET50 and SET100 index values, and the foreign trade prices. This can become valuable for our feature engineering :D

From using this dataset, the point to take note is that some data cannot be loaded. This is due to the server not responding issue or if there is a problem with the stock that is not available itself. We suggest you should skip these stocks for your analysis too.

Moving onto the next data source, Google trends API. Usage of this is quite straightforward, by taking in the list of keywords, and then calling the interest_over_time() method in the pytrend model to get the results.

Here, DELTA is the keyword we used to see the interest over time being searched, impartial shows whether the data is complete or not where True means that the data is not completely calculated for the day. Date in the index is self-explanatory as date.

As a quick preview, let’s see how some search trends correlate with the stock price itself

As you can see, there is a massive spike just before the stock price spikes as well! Could this mean that there is a correlation and that we can use it to further predict the price trend?

Moving on, we have identified the different keyword types to use and results in the following:

- Stock symbol keyword interest overtime
- Stock sector name keyword interest overtime
- Stock industry name keyword interest overtime
- Stock top rank 1 related queries to stock symbol keyword interest overtime
- Stock top rank 2 related queries to stock symbol keyword interest overtime

Lastly, the covid19 data source. This is quite straight forward. What we needed was the number of cases in Thailand overtime which can be easily found from this link https://covid19.ddc.moph.go.th/api/Cases/timeline-cases-all.

As shown in the plot, and the data itself, the case with exclude abroad data is very small and insignificant so we will remove these when using them in our prediction and feature engineering.

# Prepare dataset for modelling?

The next part that we are going to go through is the steps taken to prepare the dataset for modelling. In our case, we need to go through the following steps:

- Putting all the stock data and features together
- Apply Granger causality testing and create new support dataframes
- Create new train-test and train-predict dataset
- Scale the data for modelling

Starting with putting all the stock data and features together with the following features, we have our dataframe shown below. For further details, please refer to (https://github.com/petetae/SET_stock_prediction)

- Stock price open/high/low/close
- Stock volume
- Stock exponential moving average (10, 30, 50)
- Stock moving average (10, 30, 50, 100)
- Stock RSI (2, 6, 14, 30)
- SET index (SET50, SET100)
- Google trend interests (stock_symbol, sector_name, industry_name, top1_related, top2_related)
- Sim stock price ohlc by sector
- Sim stock volume by sector
- Sim stock RSI values by sector
- Sim stock moving average
- Sim stock exponential movinig average
- Sim stock price ohlc by top 5 price correlation on SET100
- Sim stock volume by top 5 price correlation on SET100
- Sim stock Exponential moving average by top 5 price correlation on SET100
- Sim stock Moving average by top 5 price correlation on SET100
- Sim stock RSI values by top 5 price correlation on SET100
- Covid19 stats data (new_cases, total_cases)
- Foreign SET buy/sell

Note, these features engineered can be created based on the stock data itself.

So next is…what is Granger causality testing then? What’s the train-predict dataset?

Let’s go through why that is the case. First off, sorry for the messy handwriting but I tried my best to illustrate this by drawing this out.

As seen here, the utmost right side is our normal train and test set. This is perfectly fine as we have the data up to today.

However, what do we do if we want to predict…the future then???

Well, this is where Granger causality testing comes in. We can identify whether there is a lag correlation between the two features (the interested stock and feature A), and then shift the feature A value down. From there, we can then use the parts that is after today as the prediction dataset (test_x dataset) for prediction.

For example, stock A has a significance with feature B and has a lag value of 3, and significance with feature C and has lag value of 7. This means that using feature A and B, we can predict for 3 days ahead, and using only feature C, can we predict for 7 days ahead. This is the main reason that the granger causality testing comes into play for future stock price prediction.

With that being said, the Granger Causality testing definition are as follows:

- The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another.
- Granger causality is a statistical concept of causality that is based on prediction.
- According to Granger causality, if a signal X1 “Granger-causes” (or “G-causes”) a signal X2, then past values of X1 should contain information that helps predict X2 above and beyond the information contained in past values of X2 alone.
- Based on f test case sum square regression (error)
- The Sum of Squared regression is the sum of the differences between the predicted value and the mean of the dependent variable.

If that truly confuses you, think of it this way. As we run the granger causality test (can find details in the github project), we get the following information. If the p value is less than 0.05, we can say that there is a significance that there is a causality between the interested stock and the feature. Meaning that there is a lag between them and correlates to each other. We can then use these features to predict.

In our case, we have the following result for our stock DELTA. The table shows that, for example, in the lag period of 6, there is a correlation to DELTA_VOLUME feature.

For example, out of 100 features, only 5 features have a granger causality lag of 7 days. This means that we can only predict up to 7 days for those 5 features.

Another case that we have to consider is that out of 100 features, 5 features (A, B, C, D, E) have a lag of 7 days, and 3 features (F, G, H), have a lag of 4 days. This means that there are 2 groups that we can predict:

- Predict 7 days — Features used (A, B, C, D, E)

- Predict 4 days — Features used (A, B, C, D, E, F, G, H)

This is where we have to make the train-predict dataset different to the train-test dataset.

Now lastly for this section, we have to scale our features down. What this mean is that because the data is of different dimension, we will not be able to use them to predict each other.

For example, a stock price might be in the range of 700–800 units but the Google trend interest results is within 10–20 range. This causes bias and skewness in our training set. therefore, we are going to use a MinMaxScaler() function to transform our data to be between 0–1.

Next, let’s go through our training models and algorithms used.

# Training models

This is the part where the fun happens! Let’s go through training our models as if its like our baby!

First, let’s go through the algorithms that we are looking to use to solve this problem and why.

**Linear regression**— Basic algorithm and should be able to predict the trend**Random forest regressor**— Algorithm to predict based on random forest tree so the data is also based on previous information so suits stock prediction**SVM (SVR)**— Support vector machine, linear SVR, is used as it is good with series data and tuning parameters, along with supporting scientific articles https://ieeexplore.ieee.org/document/6572570.**Gradient boosting regressor**— Based on online research and usage trend as it can predict with gradient boosting and tuning parameters https://towardsdatascience.com/forecasting-stock-prices-using-xgboost-a-detailed-walk-through-7817c1ff536a.**SVR (poly)**— Similar case with the above but for polynomial.**Facebook prophet**— Using as it supports forecasting at scale https://facebook.github.io/prophet/ and is shown to be able to predict stock prices well to a certain extent through a combination of techniques.**LSTM (Long Short Term Memory)**— A neural network algorithm which is good for predicting stock prices as it is based on their previous learnt data https://www.kdnuggets.com/2018/11/keras-long-short-term-memory-lstm-model-predict-stock-prices.html.

With this in mind, and a training size of 200 and testing set of 50, we can start to see our performance results (with RMSE).

Note, for further reasoning and information why 200 training size is chosen, details can be found in the github project.

So running each said algorithm with basic default parameters, we get the result as shown below.

This does not look too bad but how about it’s RMSE?

Okay. So the RMSE does not look that great. So let’s go through some parameter tuning and optimisation.

# Optimisation

For this project, there are multiple methods used to try and optimise the results. The ones that we are going to discuss here are as follows:

- Optimising model through parameter tuning
- Optimising model through ensemble model tuning

By parameter tuning, we are doing this both manually, for some cases, and also using RandomizedSearchCV.

Through a process of manual tuning and testing, we are able to see an enhanced result.

From here, we decided to choose only a few algorithms that were really relevant and optimise the parameters to get the following result.

Furthermore, we use ensemble model optimisation (taking the mean of the predicted results of selected multiple models) to try and improve its performance.

This resulted in the following predicted result.

To conclude our optimisation, we have chosen the combination of:

- Gradient boosting 1
- Gradient boosting 2
- Facebook prophet 2
- LSTM 1

Models and then taking the mean as it yields the best RMSE value along with showing signs that it is able to adapt to sudden changes too which was unable to be done using one model alone.

## Parameter tuning (details)

Here, we are going to go through the specifics relating the the parameters that we have tuned and why. Note, for each of the RandomizedSearchCV tuning cases, we have run 5 cross validation fold with the default 10 iterations.

**Linear regression**

No special optimisation is done for this case as it is linear regression.

**Random forest regressor**For Random forest regressor, we have optimised it by changing the parameters of:

- max_depth — Maximum depth of the tree. Here, it is best to not use too many trees as it means that the prediction may deviate a lot from the price.
- min_samples_leaf — Minimum samples required to be a leaf node. For our case, we only need a couple of leafs so ensure it does not fluctuate much from the price.
- n_estimators — Number of trees in the forest. In our case, we have tested with various number of trees.

We then manually test and also run the testing through the randomized search cross validation technique with the following parameters:

*max_depth = [5,10,15]**n_estimators = [50,100,200]**min_samples_leaf = [1,2,5]*

Our first test is with its default values, then with a manual input and then through randomized search cv.

The final best parameters were max_depth = 5, n_estimators = 50, min_samples_leaf = 1.

With these parameters, we were able to improve the RMSE score from 0.144433 to 0.127165. (lower is better!)

**SVR (Support vector machine) — Linear**

For LinearSVR, we have optimised it by changing the parameters of:

- C — Regularization parameter. For our case, we try different C values to see how it results.
- epsilon — Epsilon-insensitive loss function. Here, we work on different epsilon values from 0.05 to 0.5 to see its effect.

We have run the randomized search cv to get the optimal results for linear svr with the following inputs.

*C = [1,3,5,10,20]**epsilon = [0.05,0.1,0.2,0.3,0.4,0.5]*

Our first test is with its default values, then with randomized search cv.

The final best parameters were the default values of C=1, epsilon=0.

With that, the best RMSE results were from the default parameters as 0.169136 compared to the randomized search cv which returns 0.171206.

**Gradient boosting regressor**

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

Here, we optimise it by changing the following parameters

- learning_rate — Learning rate shrinks the distribution of each tree. Here, we test with multiple learning rates to ensure that the previous data does not have much effect in the predicted data 10 periods away etc.
- max_depth — Maximum depth of the individual regression estimators. For our case, we tested with varying depth values.
- n_estimators — Number of boosting stages to perform. As Gradient boosting is fairy robust to over-filling so large numbers are usually better.
- min_samples_leaf — Minimum samples required to be a leaf node. For our case, we only need a couple of leafs so ensure it does not fluctuate much from the price.

We then run through the randomized search cross validation technique with the following parameters:

*learning_rate = [0.05,0.1,0.2]**max_depth = [5,10,15]**n_estimators = [50,100,200,500]**min_samples_leaf = [1,2,5]*

Our first test is with its default values, then with a manual input and then through randomized search cv.

The final best parameters were learning_rate=0.1, max_depth = 53 n_estimators = 100, min_samples_leaf = 1.

These were the default parameters giving us a RMSE of 0.075891 comparing to the RMSE of the randomized search cv of 0.246699. This is due to the fact that the randomized search does a random search and may not cover the best possible case so it is always best to compare with some inputted parameters.

**SVR (Support vector machine) — Polynomial**

For SVR polynomial, we have optimised it by changing the parameters of:

- C — Regularization parameter. For our case, we try different C values to see how it results.
- epsilon — Epsilon-insensitive loss function. Here, we work on different epsilon values from 0.05 to 0.5 to see its effect.
- degree — Degree of the polynomial kernel function. For our case, we try different values but too high degree may mean that it can overfit or increase in modelling complexity.

We have run manual testing for different parameters for the following cases to get the following results.

*(C = 1, epsilon=0, degree=3) — (RMSE = 0.148033) — Default**(C = 1, epsilon=0.05, degree=3) — (RMSE = 0.161872)**(C = 3, epsilon=0, degree=2) — (RMSE = 0.087236)**(C = 3, epsilon=0.2, degree=3) — (RMSE = 0.095855)**(C = 5, epsilon=0.1, degree=2) — (RMSE = 0.083472)*

From the above test, we can see that the best parameters were C=5, epsilon=0.1, degree=2 resulting in an RMSE of 0.083472 comparing to the initial default of 0.148033.

**Facebook prophet**

For Facebook prophet, this is optimised through adding more features and the library will optimise it for us.

Below here shows the code to setup Facebook prophet without any optimisation (or with no additional features-aggressors).

For optimisation, we have referred to the following documentation from prophet (link below) and optimised it in code by adding features as follows.

As shown in the figure below, this is the model without any added regressors.

Below is the model with added regressors to Facebook prophet model.

As shown, with added features, the Facebook prophet package is able to fit and predict a more optimal output compared to having none so we chose to add regressors in using it.

**LSTM (Long Short Term Memory)**

Lastly, is the part about the LSTM algorithm. https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

For this case, we choose to optimise the following parts.

- units —Dimensionality of the output space. Here, we test with different dimensions to find the most optimal output.
- dropout layer — This adds a layer to help prevent overfitting. Here, we also test with various rates.
- epoch — Number of complete passes through the training dataset
- batch_size — Batch size must be less than or equal to the number of samples in the training dataset

We have run manual testing for different parameters for the following cases to get the following results.

*(units = 10, no dropout layer, epoch=10, batch_size=30) — (RMSE = 0.447728)**(units = 30, 3 dropout layers of 0.1 each, epoch=10, batch_size=5) — (RMSE = 0.11797)**(units = 40, 1 dropout layer of 0.1, 1 dropout layer of 0.2, epoch=15, batch_size=20) — (RMSE = 0.11542)*

From the above test, we can see that the best parameters were units = 40, 1 dropout layer of 0.1, 1 dropout layer of 0.2, epoch=15, batch_size=20 resulting in an RMSE of 0.11542.

**Tuning summary**

Here is the summary of the RMSE data that we have discussed above for further reference. Each row is each data modal and the running number shows what is tested where 1 is the default values, and 2, 3,… are the manually tested or randomized search cv results.

# Review

Finally, its time to say goodbye…. and see our predicted results as well!

From all of the above information, we have created the train-predict dataset, and then predict it with our optimised model to get the following results.

The graph shows that having 1, or 3 features can help you predict up to 9 or 10 days in advanced, but the result is totally different to if we are predicting with 13, 28, or 36 features, predicting in the next 2–4 days. Although, since both may overfit or underfit in terms of model and results, we will select around the mid range value for having 5 or 6 features used so it does not overfit.

Combining this with the features that has been selected, we can see that there may not be that significant of an impact between covid19 stats and stock price prediction. However, the google trends data is shown to be considered for lag value 4 (ie 13 features used case).

# Improvement

As an improvement, if I were to do this project again, I would test with more keywords, build more features, explore more different training size to use, have more ensemble model performance tuning through adding weights, and even add more features. Other side things that I would improve would be regarding saving the data and not using real-time data as that takes awhile to load each time, and function structure in the notebook.

These factors could potentially improve the final result as the nature of the problem is related to stock prices, which can be changed based on a specific news release about the company or any global crisis that can occur.

The current solution only takes into account the data from stocks, the covid statistics, which will become unused later on, and the google search trends which may not fit in Thailand’s demographics, as not a lot of people may be searching on Google but referring their news on Twitter or other social media platform instead.

Another main thing that could be improved is the performance metric as currently, we are using the RMSE of the whole prediction set, but if we scope the RMSE for different periods instead breaking it down, or having another performance measure, it could improve how we optimise our models.

# Conclusion

To conclude, there may still be a lot of errors and issue with the way this has been modelled and prepared. This is because stock price changes may be due to other unknown factors that has not been considered for this project such as changing board of directors, or any other global crisis issues.

All in all, I have personally learnt a lot from this project having done the whole flow of getting the data, cleaning it, prepping it, and then modelling and predicting the final results with performance testing.

For future reference, there are many things that can be improved including the performance metrics used, more features, and storing data locally for training and running the project as runtime was another part of the issue faced.

For more information or questions relating to the project, please feel free to comment, suggest any further ideas, or drop me a hi :D. Documentation and report of the project can be found from https://github.com/petetae/SET_stock_prediction.

Hope you enjoyed the read!