2. Problem Statement
We have two datasets, the economy news and stock market closing values, and we have applied a layered approach in this study, as illustrated by Figure 1. At the bottom level starting from the datasets, we build a feature extraction methodology. For the economic news, which is in natural language, we have applied sentimental analysis via the text mining methodologies. On the other hand, we have applied the random walk method for the feature extraction from the stock market closing values.
Bottom up overview of study
The correlation between news and the stock market is one of the indicators of the speculative markets (Nikfarjam, Emadzadeh, & Muthaiyah, 2010).
One of the difficulties in this study is dealing with the natural language data source, which requires a feature extraction. The other difficulty is dealing with a stock market value, which is considered as a signal. The size of data we are dealing with, which can be considered big data, is also problematic. The dataset holds 131,248 distinct words and when the feature vector of each economic news item is collected, the total size of the feature vector is beyond 2.5 GByte, which is beyond the computation capacity of a single computer with these classification algorithms. For example, only one of the classification algorithms, we will call SVM requires slightly more memory than 1TByte in this case.
Research about stock markets has always been an interesting area of study because of its impact on the business and financial world. Furthermore, the developments in the computer science field have opened a door to research on the stock markets parallel to text-based news after 2000s. Most of the works use text mining tools over the news. For example, one of the popular methods is to use the bag of words (2-6) where some use local dictionaries (2,6) or some use some text mining tools like IBM Text Miner(Fung, Yu, & Lam, 2002). Others use the TF-IDF approach (3,6), some predefined term dictionaries(Rachlin, Last, Alberg, & Kandel, 2007) or part of speech tagging (Mahajan, Dey, & Haque, 2008) and concept maps (Soni, Eck, Jan, & Uzay, 2007). Also classification varies from latent Drichlet allocation (Mahajan, Dey, & Haque, 2008) to SVM (3-8) or decision tree (Rachlin, Last, Alberg, & Kandel, 2007). The success rates vary from 45% to 82% and the data sources are Reuters market 3000 extra (Fung, Yu, & Lam, 2002), PRNewswire (Mittermayer & Knolmayer, NewsCATS: A News Categorization and Trading System, 2006), FT Intelligence (Soni, Eck, Jan, & Uzay, 2007), Australian Financial Review(Halgamuge, Y, & Hsu, 2007), Forbes and Reuters web sites (Rachlin, Last, Alberg, & Kandel, 2007), Yahoo Finance (Schumaker & Chen, 2009) or Wall Street Journal (Mahajan, Dey, & Haque, 2008).