1 Introduction
In finance, the classic strong efficient market hypothesis (EMH) posits that the stock prices follow random walk and cannot be predicted fama1965behavior . Consequently, the wellknown capital assets pricing model (CAPM) sharpe1964capital ; lintner1975valuation ; black1972capital serves as the foundation for portfolio management, asset pricing, among many applications in financial engineering. The CAPM assumes a linear relationship between the expected return of an asset (e.g., a portfolio, an index, or a single stock) and its covariance with the market return, i.e., for a single stock, CAPM simply predicts its return within a certain market with the linear equation
where the Alpha () describes the stock’s ability to beat the market, also refers to as its “excess return” or “edge”, and the Beta () is the sensitivity of the expected returns of the stock to the expected market returns (
). Both Alpha and Beta are often fitted using simple linear regression based on the historical data of returns. With the efficient market hypothesis (EMH), the Alphas are entirely random with expected value of zero, and can not be predicted.
In practice, however, financial markets are more complicated than the idealized and simplified strong EMH and CAPM. Active traders and empirical studies suggest that the financial market is never perfectly efficient and thus the stock prices as well as the Alphas can be predicted, at least to some extent. Based on this belief, stock prediction has long played a key role in numerous datadriven decisionmaking scenarios in financial market, such as deriving trading strategies, etc. Among various methods for stock market prediction, the classical BoxJenkins models box1968some , exponential smoothing techniques, and state space models hyndman2008forecasting
for time series analysis are most widely adopted, in which the factors of autoregressive structure, trend, seasonality, etc. are independently estimated from the historical observations of each single series. In recent years, researchers as well as the industry have deployed various machine learning models to forecast the stock market, such as knearest neighbors (kNN)
alkhatib2013stock ; chen2017feature, hidden Markove model (HMM)
hassan2007fusion ; hassan2013hmm, support vector machine (SVM)
yang2002support ; huang2009hybrid, artificial neural network (ANN)
wang2011forecasting ; guresen2011using ; kristjanpoller2014volatility ; wang2015back ; goccken2016integrating , and various hybrid and ensemble methods patel2015predicting ; booth2014automated ; barak2015developing ; patel2015predicting ; weng2018macroeconomic, among many others. The literature has demonstrated that machine learning models typically outperform traditional statistical time series models, which might be mainly due to the following reasons: 1) less strict assumption for the data distribution requirement, 2) various model architecture can effectively learn complex linear and nonliner from data, 3) sophisticated regularization techniques and feature selection procedures provide flexibility and strength in handling correlated input features and control of overfitting, so that more features can be thrown in the machine learning models. As the fluctuation of the stock market indeed depends on a variety of related factors, in addition to utilizing the historical information of stock prices and volumes as in traditional technical analysis
murphy1999technical , recent research of stock market forecasting has been focusing on informative external source of data, for instance, the accounting performance of the company fama1993common , macroeconomic effects tetlock2008more ; weng2018macroeconomic , government intervention and political events li2016tensor , etc. With the increased popularity of web technologies and their continued evolution, the opinions of public from relevant news weng2018predicting and social media texts bollen2011twitter ; oliveira2017impact have an increasing effect on the stock movement, various studies have confirmed that combining the extensive crowdsourcing and/or financial news data facilitates more accurate prediction wang2018combining .During the last decade, with the emergence of deep learning, various neural network models have been developed and achieved success in a broad range of domains, such as computer vision
lecun1998gradient ; krizhevsky2012imagenet ; simonyan2014very ; redmon2016yoloand natural language processing
mikolov2013efficient ; pennington2014glove ; devlin2018bert. For stock prediction specifically, recurrent neural networks (RNNs) are the most preferred deep learning models to be implemented
rather2015recurrent ; fischer2018deep. Convolutional neural networks (CNNs) have also been utilized, however, most of the work transformed the financial data into images to apply 2D convolutions as in standard computer vision applications. For example, the authors of
sezer2018algorithmicconverted the technical indicators data to 2D images and classified the images with CNN to predict the trading signals. Alternatively,
hu2018candlestick directly used the candlestick chart graphs as inputs to determine the Buy, Hold and Sell behavior as a classification task, while in sezer2019bar , the bar chart images were fed into CNN. The authors of hoseinzade2019cnnpred uses a 3D CNNbased framework to extract various sources of data including different markets for predicting the next day’s direction of movement of five major stock indices, which showed a significant improved prediction performance compared to the baseline algorithms. There also exists research combining RNN and CNN together, in which the temporal patterns were learned by RNNs, while CNNs were only used for either capturing the correlation between nearby series (in which the order matters if there are more than 2 series) or learning from images, see long2019deep ; jiang2017deep . Deployment of CNN in all these studies differs significantly from ours, since we aim at capturing the temporal patterns without relying on twodimensional convolutions. In di2016artificial , 1D causal CNN was used for making predictions based on the history of closing prices only, while no other features were considered.Note that all of the aforementioned work has put their effort into learning more accurate Alphas, and most of the existing research focuses on deriving separate models for each of the stock, while only few authors consider the correlation among different stocks over the entire markets as a possible source of information. In other words, the Betas are often ignored. At the same time, since it is natural to assume that markets can have nontrivial correlation structure, it should be possible to extract useful information from group behavior of assets. Moreover, rather than the simplified linearity assumed in CAPM, the true Betas may exhibit more complicated nonlinear relationships between the stock and the market.
In this paper, we propose a new deep learning framework that leverages both the underlying Alphas and (nonlinear) Betas. In particular, our approach innovates in the following aspects:

from model architecture perspective, we build a hybrid model that combines the advantages of both representation learning and deep networks. With representation learning, specifically, we use embedding in the deep learning model to derive implicit Betas, which we refer to as Stock2Vec, that not only gives us insight into the correlation structure among stocks, but also helps the model more effectively learn from the features thus improving prediction performance. In addition, with recent advances on deep learning architecture, in particular the temporal convolutional network, we further refine Alphas by letting the model automatically extract temporal information from raw historical series.

and from data source perspective, unlike many time series forecasting work that directly learn from raw series, we generate technical indicators features supplemented with external sources of information such as online news. Our approach differs from most research built on machine learning models, since in addition to explicit handengineered temporal features, we use the raw series as augmented data input. More importantly, instead of training separate models on each single asset as in most stock market prediction research, we learn a global model on the available data over the entire market, so that the relationship among different stocks can be revealed.
The rest of this paper is organized as follows. Section 2 lists several recent advances that are related to our method, in particular deep learning and its applications in forecasting as well as the representation learning. Section 3 illustrates the building blocks and details of our proposed framework, specifically, Stock2Vec embedding and the temporal convolutional network, as well as how our hybrid models are built. Our models are evaluated on the S&P 500 stock price data and benchmarked with several others, the evaluation results as well as the interpretation of Stock2Vec are shown in Section 5. Finally, we conclude our findings and discuss the meaningful future work directions in Section 6.
2 Related Work
Recurrent neural network (RNN) and its variants of sequence to sequence (Seq2Seq) framework sutskever2014sequence have achieved great success in many sequential modeling tasks, such as machine translation cho2014learning , speech recognition sak2014long , natural language processing bengio2003neural , and extensions to autoregressive time series forecasting salinas2019deepar ; rangapuram2018deep
in recent years. However, RNNs can suffer from several major challenges. Due to its inherent temporal nature (i.e., the hidden state is propagated through time), the training cannot be parallelized. Moreover, trained with backpropagation through time (BPTT)
werbos1990backpropagation , RNNs severely suffer from the problem of gradient vanishing thus often cannot capture long time dependency pascanu2013difficulty. More elaborate architectures of RNNs use gating mechanisms to alleviate the gradient vanishing problem, with the long shortterm memory (LSTM)
hochreiter1997lstmand its simplified variant, the gated recurrent unit (GRU)
chung2014gru being the two canonical architectures commonly used in practice.Another approach, convolutional neural networks (CNNs) lecun1989backpropagation , can be easily parallelized, and recent advances effectively eliminate the vanishing gradient issue and hence help building very deep CNNs. These works include the residual network (ResNet) he2016deep and its variants such as highway network srivastava2015training , DenseNet huang2017densely , etc. In the area of sequential modeling, 1D convolutional networks offered an alternative to RNNs for decades waibel1989phoneme . In recent years, oord2016wavenet proposed WaveNet, a dilated causal convolutional network as an autoregressive generative model. Ever since, multiple research efforts have shown that with a few modifications, certain convolutional architectures achieve stateoftheart performance in the fields of audio synthesis oord2016wavenet , language modeling dauphin2017language , machine translation gehring2017convolutional , action detection lea2017temporal , and time series forecasting binkowski2018autoregressive ; chen2020probabilistic . In particular, bai2018empirical abandoned the gating mechnism in WaveNet and proposed temporal convolutional network (TCN). The authors benchmarked TCN with LSTM and GRU on several sequence modeling problems, and demonstrated that TCN exhibits substantially longer memory and achieves better performance.
Learning of the distributed representation has also been extensively studied bengio2000modeling ; paccanaro2001learning ; hinton1986learning with arguably the most wellknown application being word embedding bengio2003neural ; mikolov2013efficient ; pennington2014glove
in language modeling. Word embedding maps words and phrases into distributed vectors in a semantic space in which words with similar meaning are closer, and some interesting relations among words can be revealed, such as
as shown in mikolov2013efficient . Motivated by Word2Vec, the neural embedding methods have been extended to other domains in recent years. The authors of barkan2016item2vec obtained item embedding for recommendation systems through a collaborative filtering neural model, and‘ called it Item2Vec which is capable of inferring relations between items even when user information is not available. Similarly, choi2016multi proposed Med2Vec that learns the medical concepts with the sequential order and cooccurrence of the concept codes within patients’ visit, and showed higher prediction accuracy in clinical applications. In guo2016entity , the authors mapped every categorical features into “entity embedding” space for structured data and applied it successfully in a Kaggle competition, they also showcased the learned geometric embedding coincides with the real map surprisingly well when projected to 2D space.
In the field of stock prediction, the term “Stock2Vec” has already been used before. Specifically, minh2018deep
trained word embedding that specializes in sentiment analysis over the original Glove and Word2Vec language models, and using such a “Stock2Vec” embedding and a twostream GRU model to generate the input data from financial news and stock prices, the authors predicted the price direction of S&P500 index. The authors of
wu2019deep proposed another “Stock2Vec” which also can be seen as a specialized Word2Vec, trained using the cooccurences matrix with the number of the news articles that mention both stocks as entries. Stock2Vec model proposed here differs from these homonymic approaches and has its distinct characteristics. First, our Stock2Vec is an entity embedding that represent the stock entities rather than a word embedding that denotes the stock names with language modeling. As the difference between entity embedding and word embedding may seem ambiguous, more importantly, instead of training the linguistic models with the cooccurrences of the words, our Stock2Vec embedding is trained directly as features through the overall predictive model, with the direct objective that minimizes prediction errors, thus illustrating the relationships among entities, while the others are actually finetuned subset of the original Word2Vec language model. Particularly inspiring for our work are the entity embedding guo2016entity and the temporal convolutional network bai2018empirical .3 Methodology
3.1 Problem Formulation
We focus on predicting the future values of stock market assets given the past. More formally speaking, our input consists of a fully observable time series signals together with another related multivariate series , in which , and is the total number of series in the data. We aim at generating the corresponding target series as the output, where is the prediction horizon in the future. To achieve the goal, we will learn a sequence modeling network with parameters to obtain a nonlinear mapping from the input state space to the predicted series, i.e., , so that the distribution of our output could be as close to the true future values distribution as possible. That is, we wish to find Here, we use KullbackLeibler (KL) divergence to measure the difference between the distributions of the true future values and the predictions . Note that our formulation can be easily extended to multivariate forecasting, in which the output and the corresponding input become multivariate series and , respectively, where is the number of forecasting variables, The related input series is then , and the overall objective becomes In this paper, in order to increase the sample efficiency and maintain a relatively small number of parameters, we will train separate models to forecast each series individually.
3.2 A Distributional Representation of Stocks: Stock2Vec
In machine learning fields, the categorical variables, if are not ordinal, are often onehot encoded into a sparse representation. i.e.,
where is the Kronecker delta, in which each dimension represents a possible category. Let the number of categories of be , then is a vector of length with the only element set to 1 for , and all others being zero. Note that although providing a convenient and simple way of representing categorical variables with numeric values for computation, onehot encoding has various limitations. First of all, it does not place similar categories closer to one another in vector space, within onehot encoded vectors, all categories are orthogonal to each other thus are totally uncorrelated, i.e., it cannot provide any information on similarity or dissimilarity between the categories. In addition, if is large, onehot encoded vectors can be highdimensional and often sparse, which means that a prediciton model has to involve a large number of parameters resulting in inefficient computaitons. For the crosssectional data that we use for stock market, the number of total interactions between all pairs of stocks increases exponentially with the number of symbols we consider, for example, there are approximately million pairwise interactions among the S&P 500 stocks. This number keeps growing exponentially as we add more features to describe the stock price performance Therefore, trading on crosssectional signals is remarkably difficult, and approximation methods are often applied.
We would like to overcome the abovementioned issue by reducing the dimensionality of the categorical variables. Common (linear) dimensionality reduction techniques include the principal component analysis (PCA), singular value decomposition (SVD), which operate by maintaining the first few eigen or singular vectors corresponding to the largest few engen or singular values PCA and SVD make efficient use of the statistics from the data and have been proven to be effective in various fields, yet they do not scale well for big matrices (e.g., the computational cost is
for amatrix), and they cannot adapt to minor changes in the data. In addition, the unsupervised transformation based on PCA or SVD do not use predictor variable, and hence it is possible that the derived components that serve as surrogate predictors provide no suitable relationship with the target. Moreover, since PCA and SVD utilize the first and second moments, they rely heavily on the assumption that the original data have approximate Gaussian distribution, which also limits the effectiveness of their usage.
Neural embedding is another approach to dimensionality reduction. Instead of computing and storing global information about the big dataset as in PCA or SVD, neural embedding learning provides us a way to learn iteratively on a supervised task directly. In this paper, we present a simple probabilistic method, Stock2Vec, that learns a dense distributional representation of stocks in a relatively lower dimensional space, and is able to capture the correlations and other more complicated relations between stock prices as well.
The idea is to design such a model whose parameters are the embeddings. We call a mapping a dimensional embedding of , and the embedded representation of . Suppose the transformation is linear, then the embedding representation can be written as
The linear embedding mapping is equivalent to an extra fullyconnected layer of neural network without nonlinearity on top of the onehot encoded input. Then each output of the extra linear layer is given as
where stands for the index of embedding layer, and is the weight connecting the onehot encoding layer to the embedding layer. The number of dimensions
for the embedding layer is a hyperparameter that can be tuned based experimental results, usually bounded between 1 and
. For our Stock2Vec, as we will introduce in Section 5, there are 503 different stocks, and we will map them into a 50dimensional space.The assumption of learning a distribuional representation is that the series that have similar or opposite movement tend to correlated with each other, which is consistent with the assumption of CAPM, that the return of a stock is correlated with the market return, which in turn is determined by all stocks’ returns in the market. We will learn the embeddings as part of the neural network for the target task of stock prediction. In order to learn the intrinsic relations among different stocks, we train the deep learning model on data of all symbols over the market, where each datum maintains the features for its particular symbol’s own properties, include the symbol itself as a categorical feature, with the target to predict next day’s price. The training objective is to minimize the mean squared error of the predicted prices as usual.
3.3 Temporal Convolutional Network
In contrast to standard fullyconnected neural networks in which a separate weight describes an interaction between each input and output pair, CNNs share the parameters for multiple mappings. This is achieved by constructing a collection of kernels (aka filters) with fixed size (which is generally much smaller than that of the input), each consisting of a set of trainable parameters, therefore, the number of parameters is greatly reduced. Multiple kernels are usually trained and used together, each specialized in capturing a specific feature from the data. Note that the socalled convolution operation is technically a crosscorrelation in general, which generates linear combinations of a small subset of input, thus focusing on local connectivity. In CNNs we generally assume that the input data has some gridlike topology, and the same characteristic of the pattern would be the same for every location, i.e., yields the property of equivariance to translation goodfellow2016deep
. The size of the output would then not only depend on the size of the input, but also on several settings of the kernels: the stride, padding, and the number of kernels. The stride
denotes the interval size between two consecutive convolution centers, and can be thought of as downsampling the output. Whereas with padding, we add values (zeros are used most often) at the boundary of the input, which is primarily used to control the output size, but as we will show later, it can also be applied to manage the starting position of the convolution operation on the input. The number of kernels adds another dimensionality on the output, and is often denoted as the number of channels.3.3.1 1D Convolutional Networks
Sequential data often display longterm correlations and can be though of as a 1D grid with samples taken at regular time intervals. CNNs have shown success in time series applications, in which the 1D convolution is simply an operation of sliding dot products between the input vector and the kernel vector. However, we make several modifications to traditional 1D convolutions according to recent advances. The detailed building blocks of our temporal CNN components are illustrated in the following subsections.
3.3.2 Causal Convolutions
As we mentioned above, in a traditional 1D convolutional layer, the filters are slided across the input series. As a result, the output is related to the connection structure between the inputs before and after it. As shown in Figure 2(a), by applying a filter of width 2 without padding, the predicted outputs are generated using the input series . The most severe problem within this structure is that we use the future to predict the past, e.g., we have used to generate , which is not appropriate in time series analysis. To avoid the issue, causal convolutions are used, in which the output is convoluted only with input data which are earlier and up to time from the previous layer. We achieve this by explicitly zero padding of length at the beginning of input series, as a result, we actually have shifted the outputs for a number of time steps. In this way, the prediction at time is only allowed to connect to historical information, i.e., in a causal structure, thus we have prohibited the future affecting the past and avoided information leakage. The resulted causal convolutions is visualized in Figure 2(b).
3.3.3 Dilated Convolutions
Time series often exhibits longterm autoregressive dependencies. With neural network models hence, we require for the receptive field of the output neuron to be large. That is, the output neuron should be connected with the neurons that receive the input data from many time steps in the past. A major disadvantage of the aforementioned basic causal convolution is that in order to have large receptive field, either very large sized filters are required, or those need to be stacked in many layers. With the former, the merit of CNN architecture is lost, and with the latter, the model can become computationally intractable. Following
oord2016wavenet , we adopted the dilated convolutions in our model instead, which is defined aswhere is a 1D series input, and is a filter of size , is called the dilation rate, and accounts for the direction of the past. In a dilated convolutional layer, filters are not convoluted with the inputs in a simple sequential manner, but instead skipping a fixed number () of inputs in between. By increasing the dilation rate multiplicatively as the layer depth (e.g., a common choice is at depth ), we increase the receptive field exponentially, i.e., there are input in the first layer that can affect the output in the th hidden layer. Figure 3 compares nondilated and dilated causal convolutional layers.
3.3.4 Residual Connections
In traditional neural networks, each layer feeds into the next. In a network with residual blocks, by utilizing skip connections, a layer may also shortcut to jump over several others. The use of residual network (ResNet) he2016deep has been proven to be very successful and become the standard way of building deep CNNs. The core idea of ResNet is the usage of shortcut connection which skips one or more layers and directly connects to later layers (which is the socalled identity mapping), in addition to the standard layer stacking connection . Figure 4 illustrates a residual block, which is the basic unit in ResNet. A residual block consists of the abovementioned two branches, and its output is then , where denotes the input to the residual block, and
is the activation function.
By reusing activation from a previous layer until the adjacent layer learns its weights, CNNs can effectively avoid the problem of vanishing gradients. In our model, we implemented doublelayer skips.
3.4 The Hybrid Model
Our overall prediction model is constructed as a hybrid, combining Stock2Vec embedding approach with an advanced implementation of TCN, schematically represented on Figure 5. Compared with Figure 1, it contains an additional TCN module. However, instead of producing the final prediction outputs of size 1, we let the TCN module output a vector as a feature map that contains information extracted from the temporal series. As a result, it adds a new source of features, which can be We concatenated with the learned Stock2Vec features. Note that the TCN module can be replaced by any other architecture that learns temporal patterns, for example, LSTMtype network. Finally, a series of fullyconnected layers (referred to as “head layers”) are applied to the combined features producing the final prediction output. Implementation details are discussed in Section 5.1.
Note that in each TCN block, the convolutional layers use dropout in order to limit the influence that earlier data have on learning srivastava2014dropout ; gal2016dropout
. It is then followed by a batch normalization layer
ioffe2015batchnorm. Both dropout and batch normalization provide a regularization effect that avoids overfitting. The most widely used activation function, the rectified linear unit (ReLU)
nair2010rectified is used after each layer except for the last one.4 Data Specification
The case study is based on daily trading data for 503 assets listed on S&P 500 index, downloaded from Yahoo!finance for the period of 2015/01/01–2020/02/18 (out of 505 assets listed on https://en.wikipedia.org/wiki/List_of_S%26P_500_companies, two did not have data spanning the whole period). Following the literature, we use the next day’s closing price as the target label for each asset, while the adjusted closing prices up until the current date can be used as inputs. In addition, we also use as augmented features the downloaded open/high/low prices and volume data for calculating some commonly used technical indicators that reflect price variation over time. In our study, eight commonly used technical indicators murphy1999technical are selected, which are described in Table 1. As we discussed in Section 1, it has also been shown in the literature that assets’ media exposure and the corresponding text sentiment are highly correlated with the stock prices. To account for this, we acquired another set of features through the Quandl API. The database “FinSentS Web News Sentiment” (https://www.quandl.com/databases/NS1/) is used in this study. The queried dataset includes the daily number of news articles about each stock, as well as the sentiment score that measures the texts used in media, based on proprietary algorithms for web scraping and natural language processing.
We further extracted several date/time related variables for each entry to explicitly capture the seasonality, these features include month of year, day of month, day of week, etc. All of the abovementioned features are all dynamic features that are timedependent. In addition, we gathered a set of static features that are timeindependent. Static covariates (e.g., the symbol name, sector and industry category, etc.) could assist the featurebased learner to capture seriesspecific information such as the scale level and trend for each series. The distinction between dynamic and static features is important for model architecture design, since it is unnecessary to process the static covariates by RNN cells or CNN convolution operations for capturing temporal relations (e.g., autocorrelation, trend, and seasonality, etc.).
Technical Indicators  Category  Description 

Moving average convergence or divergence (MACD)  Trend  Reveals price change in strength, direction and trend duration 
Parabolic Stop And Reverse (PSAR)  Trend  Indicates whether the current trend is to continue or to reverse 
Bollinger Bands (BB^{®} )  Volatility  Forms a range of prices for trading decisions 
Stochastic Oscillator (SO)  Momentum  Indicates turning points by comparing the price to its range 
Rate Of Change (ROC)  Momentum  Measures the percent change of the prices 
OnBalance Volume (OBV)  Volume  Accumulates volume on price direction to confirm price moves 
Force Index (FI)  Volume  Measures the amount of strength behind price move 
Note that the features can also be split into categorical and continuous. Each of the categorical features is mapped to dense numeric vectors via embedding, in particular, the vectors embedded from the stock name as a categorical feature are called Stock2Vec. We scale all continuous features (as well as next day’s price as the target) to between 0 and 1, since it is widely accepted that neural networks are hard to train and are sensitive to input scale glorot2010understanding ; ioffe2015batchnorm
, while some alternative approaches, e.g., decision trees, are scaleinvariant
covington2016deep. It is important to note that we performed scaling separately on each asset, i.e., linear transformation is performed so that the lowest and highest price for asset A over the training period is 0 and 1 respectively. Also note scaling statistics are obtained with the training set only, which prevents leakage of information from the test set, avoiding introduction of lookahead bias.
As a tentative illustration, Figure 6
shows the most important 20 features for predicting next day’s stock price, according to the XGBoost model we trained for benchmarking.
In our experiments, the data are split into training, validation and test sets. The last 126 trading days of data are used as the test set, cover the period from 2019/08/16 to 2020/02/18, and include 61000 samples. The rest data are used for training the model, in which the last 126 trading days, from 2019/02/15 to 2019/08/15, are used as validation set, while the first 499336 samples, cover the period from 2015/01/02 to 2019/02/14, form the training set. Table 2 provides a summary of the datasets we used in this research.
Training set  Validation set  Test set  

Starting date  2015/01/02  2019/02/15  2019/08/16 
End date  2019/02/14  2019/08/15  2020/02/18 
Sample size  499336  61075  61000 
5 Experimental Results and Discussions
5.1 Benchmark Models, Hyperparameters and Optimization Strategy
In the computational experiments below we compare performance of seven models.. Two models are based on time series analysis only (TSTCN and TSLSTM), two use static feature only (random forest
breiman2001random and XGBoost chen2016xgboost ), pure Stock2Vec model and finally, two versions of the proposed hybrid model (LSTMStock2Vec and TCNStock2Vec). This way we can evaluate the effect of different model architectures and data features. Specifically, we are interested in evaluating whether employing feature embedding leads to improvement (Stock2Vec vs random forest and XGBoost) and whether a further improvement can be achieved by incorporating timeseries data in the hybrid models.Random forest and XGBoost are ensemble models that deploy enhanced bagging and gradient boosting, respectively. We pick these two models since both have shown powerful predicting ability and achieved stateoftheart performance in various fields. Both are treebased models that are invariant to scales and perform split on onehot encoded categorical inputs, which is suitable for comparison with embeddings in our Stock2Vec models. We built 100 bagging/boosting trees for these two models. LSTM and TCN models are constructed based on pure time series data, i.e., the inputs and outputs are single series, without any other feature as augmented series. In later context, we call these two models TSLSTM and TSTCN, respectively. The Stock2Vec model is a fullyconnected neural network with embedding layers for all categorical features, it has the exactly same inputs as XGBoost and random forest. As we introduced in Section
3.4, our hybrid model combines the Stock2Vec model with an extra TCN module to learn the temporal effects. And for comparison purpose, we also evaluated the hybrid model with LSTM as the temporal module. We call them TCNStock2Vec and LSTMStock2Vec correspondingly.Our deep learning models are implemented in PyTorch
paszke2017pytorch. In Stock2Vec, the embedding sizes are set to be half of the original number of categories, thresholded by 50 (i.e., the maximum dimension of embedding output is 50). These are just heuristics as there is no common standard for choosing the embedding sizes. We concatenate the continuous input with the outputs from embedding layers, followed by two layers of fullyconnected layers, with sizes of 1024 and 512, respectively. The dropout rates are set to 0.001 and 0.01 for the two hidden layers correspondingly.
For the RNN module, we implement twolayer stacked LSTM, i.e., in each LSTM cell (that denotes a single time step), there are two LSTM layers sequentially connected, and each layer consists of 50 hidden units. We need an extra fullyconnected layer to control the output size for the temporal module, depending on whether to obtain the final prediction as in TSLSTM (with output size to be 1), or a temporal feature map as in LSTMStock2Vec. We set the size of temporal feature map to be 30 in order to compress the information for both LSTMStock2Vec and TCNStock2Vec. In TCN, we use another convolutional layer to achieve the same effect. To implement the TCN module, we build a 16layer dilated causal CNN as the component that focuses on capturing the autoregressive temporal relations from the series own history. Each layer contains 16 filters, and each filter has a width of 2. Every two consecutive convolutional layers form a residual block after which the previous inputs are added to the flow. The dilation rate increases exponentially along every stacked residual blocks, i.e., to be
, which allows our TCN component to capture the autoregressive relation for more than half a year (there are 252 trading days in a year). Again, dropout (with probability
), batch normalization layer and ReLU activation are used for each TCN block.The MSE loss is used for all models. The deep learning models were trained using stochastic gradient descent (SGD), with batch size of 128. In particular, the Adam optimizer
kingma2014adam with initial learning rate of was used to train TSTCN and TSLSTM. To train Stock2Vec, we deployed the superconvergence scheme as in smith2019superand used cyclical learning rate over every 3 epochs, with a maximum value of
. In the two hybrid models, while the weights of the head layers were randomly initialized as usual, we loaded the weights from pretrained Stock2Vec and TSTCN/TSLSTM for the corresponding modules. By doing this, we have applied transfer learning scheme
pan2009survey ; bengio2012deep ; long2017deep and wish the transferred modules have the ability to effectively process features from the beginning. The head layers were trained for 2 cycles (each contains 2 epochs) with maximum learning rate of while the transferred modules were frozen. After this convergence, the entire network was finetuned for 10 epochs by standard Adam optimizer with learning rate of , during which an early stopping paradigm yao2007early was applied to retrieve the model with smallest validation error. We select the hyperparemeters based upon the model performance on the validation set.5.2 Performance Evaluation Metrics
To evaluate the performance of our forecasting model, three commonly used evaluation criteria are used in this study: (a) the root mean square error (RMSE), (b) the mean absolute error (MAE), (c) the mean absolute percentage error (MAPE), (d) the root mean square percentage error (RMSPE):
RMSE  (1)  
MAE  (2)  
MAPE  (3)  
RMSPE  (4) 
where is the actual target value for the th observation, is the predicted value for the corresponding target, and is the forecast horizon.
The RMSE is the most popular measure for the error rate of regression models, as
, it converges to the standard deviation of the theoretical prediction error. However, the quadratic error may not be an appropriate evaluation criterion for all prediction problems, especially in the presence of large outliers. In addition, the RMSE depends on scales, and is also sensitive to outliers. The MAE considers the absolute deviation as the loss and is a more “robust” measure for prediction, since the absolute error is more sensitive to small deviations and much less sensitive to large ones than the squared error. However, since the training process for many learning models are based on squared loss function, the MAE could be (logically) inconsistent to the model optimization selection criteria. The MAE is also scaledependent, thus not suitable to compare prediction accuracy across different variables or time ranges. In order to achieve scale independence, the MAPE measures the error proportional to the target value, while instead of using absolute values, the RMSPE can be seen as the root mean squared version of MAPE. The MAPE and RMSPE however, are extremely unstable when the actual value is small (consider the case when the denominator or close to 0). We will consider all four measures mentioned here to have a more complete view of the performance of the models considering the limitations of each performance measure. In addtion, we will compare the running time as an additional evaluation criterion.
5.3 Stock2Vec: Analysis of Embeddings
As we introduced in Section 3, the main goal of training Stock2Vec model is to learn the intrinsic relationships among stocks, where similar stocks are close to each other in the embedding space, so that we can deploy the interactions from crosssectional data, or more specifically, the market information, to make better predictions. To show this is the case, we extract the weights of the embedding layers from the trained Stock2Vec model, map the weights down to twodimensional space with a manifold by using PCA, and visualize the entities to look at how the embedding spaces look like. Note that besides Stock2Vec, we also learned embeddings for other categorical features.
Figure 7
(a) shows the first two principal components of the sectors. Note that here the first two components account for close to 75% of variance. We can generally observe that
Health Care, Technology/Consumer Services and Finance occupy the opposite corners of the plot, i.e., represent unique sectors most dissimilar from one another. On the other hand a collection of more traditional sectors: Public Utilities, Energy, Consumer Durables and NonDurables, Basic Industries generally are grouped closer together. The plot, then, allows for a natural interpretation which is in accordance with our intuition, indicating that the learned embedding can be expected to be reasonable.Similarly, from the trained Stock2Vec embeddings, we can obtain a 50dimensional vector for each separate stock. We simialrly visualize the learned Stock2Vec with PCA in Figure 8(a), and color each stock by the sector it belongs to. It is important to note that in this case, the first two components of PCA only account for less than 40% of variance. In other words, in this case, the plotted groupings do not represent the learned information as well as in the previous case. Indeed, when viewed all together, individual assets do not exhibit readily discernible patterns. This is not necessarily an indicator of deficiency of the learned embedding, and instead suggests that two dimensions are not sufficient in this case.
However, lots of useful insight can be gained from the distributed representations, for instance, we could consider the similarities between stocks in the learned vector space is an example of these benefits as we will show below,
To reveal some additional insights from the similarity distance, we sort the pairwise cosine distance (in the embedded space) between the stocks in the ascending order. In Figure 8(a), we plot the ticker “NVDA” (Nvidia) as well as its six nearest neighbors in the embedding space. The six companies that are closest to Nvidia, according to the embeddings of learned weights, are either of the same type (technology companies) with Nvidia: Facebook, Akamai, Cognizant Tech Solutions, Charte; or fast growing during the past ten years (was the case for Nvidia during the tested period): Monster, Discover Bank. Similarly, we plot the ticker of Wells Fargo (“WFC”) and its 6 nearest neighbors in Figure 8(b), all of which are either banks or companies that provide other financial services.
These observations suggest are another indicator that Stock2Vec can be expected to learn some useful information, and indeed is capable of coupling together insights from a number of unrelated sources, in this case, asset sector and it’s performance.
The following points must be noted here. First, most of the nearest neighbors are not the closest points in the twodimensional plots due to the imprecision of mapping into twodimensions. Secondly, although the nearest neighbors are meaningful for many companies as the results either are in the same sector (or industry), or present similar stock price trend in the last a few years, this insight does not hold true for all companies, or the interpretation can be hard to discern. For example, the nearest neighbors of Amazon.com (AMZN) include transportation and energy companies (perhaps due to its heavy reliance on these industries for its operation) as well as technology companies. Finally, note that there exist many other visualization techniques for projection of high dimensional vectors onto 2D spaces that could be used here instead of PCA, for example, tSNE maaten2008visualizing or UMAP mcinnes2018umap . However, neither provided visual improvement of the grouping effect over Figure 8(a) and hence we do not present those results here.
Based on the above observations, Stock2Vec provides several benefits: 1) reducing the dimensionality of categorical feature space, thus the computational performance is improved with smaller number of parameter, 2) mapping the sparse highdimensional onehot encoded vectors onto dense distributional vector space (with lower dimensionality), as a result, similar categories are learned to be placed closer to one another in the embedding space, unlike in onehot encoding vector space where every pairs of categories yield the same distance and are orthogonal to each other. Therefore, the outputs of the embedding layers could be served as more meaningful features, for later layers of neural networks to achieve more effective learning. Not only that, the meaningful embeddings can be used for visualization, provides us more interpretability of the deep learning models.
5.4 Prediction Results
Table 3 and Figure 10 report the overall average (over the individual assets) forecasting performance of the outofsample period from 20190816 to 20200214. We observe that TSLSTM and TSTCN perform worst. We can conlude that this is because these two models only consider the target series and ignore all other features. TCN outperforms LSTM, probably since it is capable of extracting temporal patterns over long history without more effectively gradient vanishing problem. Moreover, the training speed of our 18layer TCN is about five times faster than that of LSTM per iteration (aka batch) with GPU, and the overall training speed (given all overhead included) is also around two to three times faster. With learning from all the features, the random forest and XGBoost models perform better than purely timeseriesbased TSLSTM and TSTCN, with the XGBoost predictions are slightly better than that from random forest. This demonstrates the usefulness of our data source, especially the external information combined into the inputs. We can then observe that despite having the same input as random forest and XGBoost, the proposed our Stock2Vec model further improves accuracy of the predictions, as the RMSE, MAE, MAPE and RMSPE decrease by about 36%, 38%, 41% and 43% over the XGBoost predictions, respectively. This indicates that the use of deep learning models, in particular the Stock2Vec embedding improves the predictions, by more effectively learning from the features over the treebased ensemble models. With integration of temporal modules, there is again a significant improvement of performance in terms of prediction accuracy. The two hybrid models LSTMStock2Vec and TCNStock2Vec not only learn from features we give explicitly, but also employ either a hidden state or a convolutional temporal feature mapping to implicitly learn relevant information from historical data. Our TCNStock2Vec achieves the best performance across all models, as the RMSE and MAE decreases by about 25%, while the MAPE decreases by 20% and the RMSPE decreases by 14%, comparing with Stock2Vec without the temporal module.
RMSE  MAE  MAPE(%)  RMSPE (%)  

TSLSTM  6.35  2.36  1.62  2.07 
TSTCN  5.79  2.15  1.50  1.96 
Random Forest  4.86  1.67  1.31  1.92 
XGBoost  4.57  1.66  1.28  1.83 
Stock2Vec  2.94  1.04  0.76  1.05 
LSTMStock2Vec  2.57  0.85  0.68  1.04 
TCNStock2Vec  2.22  0.78  0.61  0.90 
Figure 10 shows the boxplots of the prediction errors of different approaches, from which we can see our proposed models achieve smaller absolute prediction errors in terms of not only the mean also the variance, which indicates more robust forecast. The median absolute prediction errors (and the interquartile range, i.e., IQR) of our TSTCN model is around 1.01 (1.86), while they are around 0.74 (1.39), 0.45 (0.87), and 0.36 (0.66) for XGBoost, Stock2Vec and TCNStock2Vec, respectively.
Similarly, we aggregate the metrics on the sector level, and calculate the average performance within each sector. We report the RMSE, MAE, MAPE, and RMSPE in Tables 4, 5, 6, and 7, respectively, from which we can see again our Stock2Vec performs better than the two treeensemble models for all sectors, and adding the temporal module would further improve the forecasting accuracy. TCNStock2Vec achieves the best RMSE, MAE, MAPE and RMSPE in all sectors with one exception. Better performance on different aggregated levels demonstrates the power of our proposed models.
We further showcase the predicted results of 20 symbols to gauge the forecasting performance of our model under a wide range of industries, volatilities, growth patterns and other general conditions. The stocks have been chosen to evaluate how the proposed methodologies would perform under different circumstances. For instance, Amazon’s (AMZN) stock was consistently increasing in price across the analysis period, while the stock price of Verizon (VZ) was very stable, and Chevron’s stock (CVX) had both periods of growth and decline. In addition, these 20 stocks captured several industries: (a) retail (e.g., Walmart), (b) restaurants (e.g., McDonald’s), (c) finance and banks (e.g., JPMorgan Chase and Goldman Sachs), (d) energy and oil & gas (e.g., Chevron), (e) techonology (e.g., Facebook), (f) communications (e.g., Verizon), etc. Table 8, 9, 10, 11 show the outofsample RMSE, MAE, MAPE and RMAPE, respectively, from the predictions given by the five models we discussed above. Again, Stock2Vec generally performs better than random forest and XGBoost, and the two hybrid models have quite similar performance which is significantly better than that of others. While there also exist a few stocks on which LSTMStock2Vec or even Stock2Vec without temporal module produce most accurate predictions, for most of the stocks, TCNStock2Vec model performs the best. This demonstrates our models generalize well to most symbols.
Furthermore, we plot the prediction pattern of the competing models for the abovementioned stocks on the test set in C, compared to the actual daily prices. We observe that the random forest and XGBoost models predict upanddowns with a lag for most of the time, as the current price plays too much a role as a predictor, probably mainly due to the correct scaling reason. And there occasionally exist several flat predictions over a period for some stocks (see approximately 2019/09 in Figure 15, 2020/01 in Figure 18, and 2019/12 in Figure 30), which is a typical effect of treebased methods, indicates insufficient splitting and underfitting despite so many ensemble trees were used. With entity embeddings, our Stock2Vec model can learn from the features much more effectively, its predictions coincide with the actual upand downs much more accurately. Although it overestimates the volatility by exaggerating the amplitude as well as the frequency of oscillations, the overall prediction errors are getting smaller than the two treeensemble models. And our LSTMStock2Vec and TCNStock2Vec models further benefit from the temporal learning modules by automatically capturing the historical characteristics from time series data, especially the nonlinear trend and complex seasonality that are difficult to be captured by handengineered features such as technical indicators, as well as the common temporal factors that are shared among all series across the whole market. As a result, with the ability to extract the autoregressive dependencies over long term both within and across series from historical data, the predictions from these two models alleviate wild oscillations, and are much more close to the actual prices, while still correctly predict the upanddowns for most of the time with effective learning from input features.
6 Concluded Remarks and Future Work
Our argument that implicitly learning Alphas and Betas upon crosssectional data from CAPM perspective is novel, however, it is more of an insight rather than systematic analysis. In this paper, we built a global hybrid deep learning models to forecast the S&P stock prices. We applied the stateoftheart 1D dilated causal convolutional layers (TCN) to extract the temporal features from the historical information, which helps us to refine learning of the Alphas. In order to integrate the Beta information into the model, we learn a single model that learns from the data over the whole market, and applied entity embeddings for the categorical features, in particular, we obtained the Stock2Vec that reveals the relationship among stocks in the market, our model can be seen as supervised dimension reduction method in that point of view. The experimental results show our models improve the forecasting performance. Although not demonstrated in this work, learning a global model from the data over the entire market can give us an additional benefit that it can handle the coldstart problem, in which some series may contain very little data (i.e., many missing values), our model has the ability to infer the historical information with the structure learned from other series as well as the correlation between the coldstart series and the market. It might not be accurate, but is much informative than that learned from little data in the single series.
There are several other directions that we can dive deeper as the future work. First of all, the stock prices are heavily affected by external information, combining extensive crowdsourcing, social media and financial news data may facilitate a better understanding of collective human behavior on the market, which could help the effective decision making for investors. These data can be obtained from the internet, we could expand the data source and combine their influence in the model as extra features. In addition, although we have shown that the convolutional layers have several advantages over the most widely used recurrent neural network layers for time series, the temporal learning layers in our model could be replaced by any other type, for instance, the recent advances of attention models could be a good candidate. Also, more sophisticated models can be adopted to build Stock2Vec, by keeping the goal in mind that we aim at learning the implicit intrinsic relationship between stock series. In addition, learning the relationship over the market would be helpful for us to build portfolio aiming at maximizing the investment gain, e.g., by using standard Markowitz portfolio optimization to find the positions. In that case, simulation of trading in the market should provide us more realistic and robust performance evaluation than those aggregated levels we reported above. Liquidity and market impacts can be taken into account in the simulation, and we can use Profit & Loss (P&L) and the Sharpe ratio as the evaluation metrics.
References
 (1) E. F. Fama, The behavior of stockmarket prices, The journal of Business 38 (1) (1965) 34–105 (1965).
 (2) W. F. Sharpe, Capital asset prices: A theory of market equilibrium under conditions of risk, The journal of finance 19 (3) (1964) 425–442 (1964).
 (3) J. Lintner, The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets, in: Stochastic optimization models in finance, Elsevier, 1975, pp. 131–155 (1975).
 (4) M. C. Jensen, F. Black, M. S. Scholes, The capital asset pricing model: Some empirical tests (1972).
 (5) G. E. Box, G. M. Jenkins, Some recent advances in forecasting and control, Journal of the Royal Statistical Society. Series C (Applied Statistics) 17 (2) (1968) 91–109 (1968).
 (6) R. Hyndman, A. B. Koehler, J. K. Ord, R. D. Snyder, Forecasting with exponential smoothing: the state space approach, Springer Science & Business Media, 2008 (2008).
 (7) K. Alkhatib, H. Najadat, I. Hmeidi, M. K. A. Shatnawi, Stock price prediction using knearest neighbor (knn) algorithm, International Journal of Business, Humanities and Technology 3 (3) (2013) 32–44 (2013).
 (8) Y. Chen, Y. Hao, A feature weighted support vector machine and knearest neighbor algorithm for stock market indices prediction, Expert Systems with Applications 80 (2017) 340–355 (2017).
 (9) M. R. Hassan, B. Nath, M. Kirley, A fusion model of hmm, ann and ga for stock market forecasting, Expert systems with Applications 33 (1) (2007) 171–180 (2007).
 (10) M. R. Hassan, K. Ramamohanarao, J. Kamruzzaman, M. Rahman, M. M. Hossain, A hmmbased adaptive fuzzy inference system for stock market forecasting, Neurocomputing 104 (2013) 10–25 (2013).
 (11) H. Yang, L. Chan, I. King, Support vector machine regression for volatile stock market prediction, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2002, pp. 391–396 (2002).
 (12) C.L. Huang, C.Y. Tsai, A hybrid sofmsvr with a filterbased feature selection for stock market forecasting, Expert Systems with Applications 36 (2) (2009) 1529–1539 (2009).
 (13) J.Z. Wang, J.J. Wang, Z.G. Zhang, S.P. Guo, Forecasting stock indices with back propagation neural network, Expert Systems with Applications 38 (11) (2011) 14346–14355 (2011).
 (14) E. Guresen, G. Kayakutlu, T. U. Daim, Using artificial neural network models in stock market index prediction, Expert Systems with Applications 38 (8) (2011) 10389–10397 (2011).
 (15) W. Kristjanpoller, A. Fadic, M. C. Minutolo, Volatility forecast using hybrid neural network models, Expert Systems with Applications 41 (5) (2014) 2437–2442 (2014).
 (16) L. Wang, Y. Zeng, T. Chen, Back propagation neural network with adaptive differential evolution algorithm for time series forecasting, Expert Systems with Applications 42 (2) (2015) 855–863 (2015).
 (17) M. Göçken, M. Özçalıcı, A. Boru, A. T. Dosdoğru, Integrating metaheuristics and artificial neural networks for improved stock price prediction, Expert Systems with Applications 44 (2016) 320–331 (2016).
 (18) J. Patel, S. Shah, P. Thakkar, K. Kotecha, Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques, Expert systems with applications 42 (1) (2015) 259–268 (2015).
 (19) A. Booth, E. Gerding, F. Mcgroarty, Automated trading with performance weighted random forests and seasonality, Expert Systems with Applications 41 (8) (2014) 3651–3661 (2014).
 (20) S. Barak, M. Modarres, Developing an approach to evaluate stocks by forecasting effective features with data mining methods, Expert Systems with Applications 42 (3) (2015) 1325–1339 (2015).

(21)
B. Weng, W. Martinez, Y.T. Tsai, C. Li, L. Lu, J. R. Barth, F. M. Megahed, Macroeconomic indicators alone can predict the monthly closing price of major us indices: Insights from artificial intelligence, timeseries analysis and hybrid models, Applied Soft Computing 71 (2018) 685–697 (2018).
 (22) J. J. Murphy, Technical analysis of the financial markets: A comprehensive guide to trading methods and applications, Penguin, 1999 (1999).
 (23) E. F. Fama, K. R. French, Common risk factors in the returns on stocks and bonds, Journal of (1993).
 (24) P. C. Tetlock, M. SaarTsechansky, S. Macskassy, More than words: Quantifying language to measure firms’ fundamentals, The Journal of Finance 63 (3) (2008) 1437–1467 (2008).

(25)
Q. Li, Y. Chen, L. L. Jiang, P. Li, H. Chen, A tensorbased information framework for predicting the stock market, ACM Transactions on Information Systems (TOIS) 34 (2) (2016) 1–30 (2016).
 (26) B. Weng, L. Lu, X. Wang, F. M. Megahed, W. Martinez, Predicting shortterm stock prices using ensemble methods and online data sources, Expert Systems with Applications 112 (2018) 258–273 (2018).
 (27) J. Bollen, H. Mao, X. Zeng, Twitter mood predicts the stock market, Journal of computational science 2 (1) (2011) 1–8 (2011).
 (28) N. Oliveira, P. Cortez, N. Areal, The impact of microblogging data for stock market prediction: Using twitter to predict returns, volatility, trading volume and survey sentiment indices, Expert Systems with Applications 73 (2017) 125–144 (2017).
 (29) Q. Wang, W. Xu, H. Zheng, Combining the wisdom of crowds and technical analysis for financial market prediction using deep random subspace ensembles, Neurocomputing 299 (2018) 51–61 (2018).
 (30) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradientbased learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324 (1998).

(31)
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105 (2012).
 (32) K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv preprint arXiv:1409.1556 (2014).

(33)
J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, realtime object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788 (2016).
 (34) T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013).
 (35) J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543 (2014).
 (36) J. Devlin, M.W. Chang, K. Lee, K. Toutanova, Bert: Pretraining of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
 (37) A. M. Rather, A. Agarwal, V. Sastry, Recurrent neural network and a hybrid model for prediction of stock returns, Expert Systems with Applications 42 (6) (2015) 3234–3241 (2015).
 (38) T. Fischer, C. Krauss, Deep learning with long shortterm memory networks for financial market predictions, European Journal of Operational Research 270 (2) (2018) 654–669 (2018).
 (39) O. B. Sezer, A. M. Ozbayoglu, Algorithmic financial trading with deep convolutional neural networks: Time series to image conversion approach, Applied Soft Computing 70 (2018) 525–538 (2018).
 (40) G. Hu, Y. Hu, K. Yang, Z. Yu, F. Sung, Z. Zhang, F. Xie, J. Liu, N. Robertson, T. Hospedales, et al., Deep stock representation learning: From candlestick charts to investment decisions, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 2706–2710 (2018).
 (41) O. B. Sezer, A. M. Ozbayoglu, Financial trading model with stock bar chart image time series with deep convolutional neural networks, arXiv preprint arXiv:1903.04610 (2019).
 (42) E. Hoseinzade, S. Haratizadeh, Cnnpred: Cnnbased stock market prediction using a diverse set of variables, Expert Systems with Applications 129 (2019) 273–285 (2019).
 (43) W. Long, Z. Lu, L. Cui, Deep learningbased feature engineering for stock price movement prediction, KnowledgeBased Systems 164 (2019) 163–173 (2019).
 (44) Z. Jiang, D. Xu, J. Liang, A deep reinforcement learning framework for the financial portfolio management problem, arXiv preprint arXiv:1706.10059 (2017).
 (45) L. Di Persio, O. Honchar, Artificial neural networks architectures for stock price prediction: Comparisons and applications, International journal of circuits, systems and signal processing 10 (2016) (2016) 403–413 (2016).
 (46) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: Advances in neural information processing systems, 2014, pp. 3104–3112 (2014).
 (47) K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using rnn encoderdecoder for statistical machine translation, arXiv preprint arXiv:1406.1078 (2014).
 (48) H. Sak, A. W. Senior, F. Beaufays, Long shortterm memory recurrent neural network architectures for large scale acoustic modeling (2014).
 (49) Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, A neural probabilistic language model, Journal of machine learning research 3 (Feb) (2003) 1137–1155 (2003).
 (50) D. Salinas, V. Flunkert, J. Gasthaus, T. Januschowski, Deepar: Probabilistic forecasting with autoregressive recurrent networks, International Journal of Forecasting (2019).
 (51) S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, T. Januschowski, Deep state space models for time series forecasting, in: Advances in neural information processing systems, 2018, pp. 7785–7794 (2018).
 (52) P. J. Werbos, Backpropagation through time: what it does and how to do it, Proceedings of the IEEE 78 (10) (1990) 1550–1560 (1990).
 (53) R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in: International conference on machine learning, 2013, pp. 1310–1318 (2013).
 (54) S. Hochreiter, J. Schmidhuber, Long shortterm memory, Neural computation 9 (8) (1997) 1735–1780 (1997).
 (55) J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).
 (56) Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation 1 (4) (1989) 541–551 (1989).
 (57) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778 (2016).
 (58) R. K. Srivastava, K. Greff, J. Schmidhuber, Training very deep networks, in: Advances in neural information processing systems, 2015, pp. 2377–2385 (2015).
 (59) G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708 (2017).
 (60) A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. J. Lang, Phoneme recognition using timedelay neural networks, IEEE transactions on acoustics, speech, and signal processing 37 (3) (1989) 328–339 (1989).
 (61) A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, Wavenet: A generative model for raw audio, arXiv preprint arXiv:1609.03499 (2016).
 (62) Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, in: International conference on machine learning, 2017, pp. 933–941 (2017).
 (63) J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N. Dauphin, Convolutional sequence to sequence learning, arXiv preprint arXiv:1705.03122 (2017).
 (64) C. Lea, M. D. Flynn, R. Vidal, A. Reiter, G. D. Hager, Temporal convolutional networks for action segmentation and detection, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165 (2017).
 (65) M. Binkowski, G. Marti, P. Donnat, Autoregressive convolutional neural networks for asynchronous time series, in: International Conference on Machine Learning, 2018, pp. 580–589 (2018).
 (66) Y. Chen, Y. Kang, Y. Chen, Z. Wang, Probabilistic forecasting with temporal convolutional neural network, Neurocomputing (2020).
 (67) S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, arXiv preprint arXiv:1803.01271 (2018).
 (68) Y. Bengio, S. Bengio, Modeling highdimensional discrete data with multilayer neural networks, in: Advances in Neural Information Processing Systems, 2000, pp. 400–406 (2000).
 (69) A. Paccanaro, G. E. Hinton, Learning distributed representations of concepts using linear relational embedding, IEEE Transactions on Knowledge and Data Engineering 13 (2) (2001) 232–244 (2001).
 (70) G. E. Hinton, et al., Learning distributed representations of concepts, in: Proceedings of the eighth annual conference of the cognitive science society, Vol. 1, Amherst, MA, 1986, p. 12 (1986).
 (71) O. Barkan, N. Koenigstein, Item2vec: neural item embedding for collaborative filtering, in: 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, 2016, pp. 1–6 (2016).
 (72) E. Choi, M. T. Bahadori, E. Searles, C. Coffey, M. Thompson, J. Bost, J. TejedorSojo, J. Sun, Multilayer representation learning for medical concepts, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1495–1504 (2016).
 (73) C. Guo, F. Berkhahn, Entity embeddings of categorical variables, arXiv preprint arXiv:1604.06737 (2016).
 (74) D. L. Minh, A. SadeghiNiaraki, H. D. Huy, K. Min, H. Moon, Deep learning approach for shortterm stock trends prediction based on twostream gated recurrent unit network, Ieee Access 6 (2018) 55392–55404 (2018).
 (75) Q. Wu, Z. Zhang, A. Pizzoferrato, M. Cucuringu, Z. Liu, A deep learning framework for pricing financial instruments, arXiv preprint arXiv:1909.04497 (2019).
 (76) I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016 (2016).
 (77) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 15 (1) (2014) 1929–1958 (2014).
 (78) Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: Representing model uncertainty in deep learning, in: international conference on machine learning, 2016, pp. 1050–1059 (2016).
 (79) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015).

(80)
V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in: ICML, 2010 (2010).
 (81) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256 (2010).
 (82) P. Covington, J. Adams, E. Sargin, Deep neural networks for youtube recommendations, in: Proceedings of the 10th ACM conference on recommender systems, 2016, pp. 191–198 (2016).
 (83) L. Breiman, Random forests, Machine learning 45 (1) (2001) 5–32 (2001).
 (84) T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794 (2016).
 (85) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch (2017).
 (86) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
 (87) L. N. Smith, N. Topin, Superconvergence: Very fast training of neural networks using large learning rates, in: Artificial Intelligence and Machine Learning for MultiDomain Operations Applications, Vol. 11006, International Society for Optics and Photonics, 2019, p. 1100612 (2019).
 (88) S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on knowledge and data engineering 22 (10) (2009) 1345–1359 (2009).
 (89) Y. Bengio, Deep learning of representations for unsupervised and transfer learning, in: Proceedings of ICML workshop on unsupervised and transfer learning, 2012, pp. 17–36 (2012).
 (90) M. Long, H. Zhu, J. Wang, M. I. Jordan, Deep transfer learning with joint adaptation networks, in: International conference on machine learning, 2017, pp. 2208–2217 (2017).
 (91) Y. Yao, L. Rosasco, A. Caponnetto, On early stopping in gradient descent learning, Constructive Approximation 26 (2) (2007) 289–315 (2007).
 (92) L. v. d. Maaten, G. Hinton, Visualizing data using tsne, Journal of machine learning research 9 (Nov) (2008) 2579–2605 (2008).
 (93) L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426 (2018).
Appendix A Sector Level Performance Comparison
Random Forest  XGBoost  Stock2Vec  LSTMStock2Vec  TCNStock2Vec  

Basic Industries  1.70  1.61  1.06  0.85  0.76 
Capital Goods  11.46  10.30  6.25  6.01  5.10 
Consumer Durables  1.78  1.67  0.99  0.93  0.83 
Consumer NonDurables  1.57  1.55  0.98  0.87  0.75 
Consumer Services  4.75  4.69  3.34  2.76  2.30 
Energy  1.50  1.44  0.76  0.76  0.67 
Finance  2.08  2.06  1.39  1.05  1.00 
HealthCare  3.44  3.37  1.95  1.98  1.60 
Miscellaneous  8.23  7.96  5.22  4.14  3.73 
Public Utilities  0.94  0.95  0.64  0.52  0.52 
Technology  4.20  4.23  2.91  1.90  1.94 
Transportation  2.00  1.90  1.15  1.03  0.88 
Random Forest  XGBoost  Stock2Vec  LSTMStock2Vec  TCNStock2Vec  

Basic Industries  1.06  1.03  0.64  0.52  0.49 
Capital Goods  3.13  3.07  1.93  1.57  1.47 
Consumer Durables  1.21  1.18  0.71  0.63  0.57 
Consumer NonDurables  0.96  0.93  0.57  0.52  0.45 
Consumer Services  1.83  1.84  1.19  0.98  0.88 
Energy  0.98  0.95  0.50  0.51  0.45 
Finance  1.19  1.17  0.79  0.55  0.54 
HealthCare  1.99  1.96  1.15  1.10  0.92 
Miscellaneous  3.18  3.18  2.08  1.56  1.44 
Public Utilities  0.63  0.64  0.44  0.33  0.34 
Technology  1.95  1.98  1.26  0.92  0.91 
Transportation  1.26  1.23  0.74  0.65  0.56 
Random Forest  XGBoost  Stock2Vec  LSTMStock2Vec  TCNStock2Vec  

Basic Industries  1.34  1.31  0.74  0.65  0.61 
Capital Goods  1.21  1.24  0.76  0.59  0.56 
Consumer Durables  1.30  1.26  0.73  0.68  0.60 
Consumer NonDurables  1.48  1.32  0.76  0.85  0.65 
Consumer Services  1.24  1.23  0.71  0.66  0.59 
Energy  2.04  1.88  0.97  1.08  0.92 
Finance  1.18  1.16  0.74  0.53  0.53 
HealthCare  1.43  1.35  0.79  0.79  0.65 
Miscellaneous  1.23  1.23  0.81  0.66  0.60 
Public Utilities  0.88  0.90  0.57  0.49  0.47 
Technology  1.44  1.43  0.83  0.68  0.66 
Transportation  1.26  1.23  0.71  0.66  0.57 
Random Forest  XGBoost  Stock2Vec  LSTMStock2Vec  TCNStock2Vec  

Basic Industries  1.86  1.80  0.98  0.91  0.83 
Capital Goods  1.63  1.65  1.01  0.83  0.75 
Consumer Durables  1.79  1.68  0.96  0.95  0.81 
Consumer NonDurables  2.41  2.02  1.13  1.37  1.01 
Consumer Services  1.88  1.82  0.99  1.07  0.91 
Energy  2.89  2.66  1.29  1.51  1.25 
Finance  1.60  1.56  1.00  0.78  0.72 
HealthCare  2.17  2.00  1.15  1.24  0.99 
Miscellaneous  1.66  1.63  1.05  0.95  0.81 
Public Utilities  1.25  1.23  0.74  0.71  0.64 
Technology  2.09  2.00  1.13  1.04  0.95 
Transportation  1.76  1.68  0.98  0.95  0.80 
Appendix B Performance comparison of different models for the oneday ahead forecasting on different symbols
Random Forest  XGBoost  Stock2Vec  LSTMStock2Vec  TCNStock2Vec  

AAPL (Apple)  4.71  4.52  2.86  2.16  1.81 
AFL (Aflac)  0.59  0.62  0.46  0.31  0.27 
AMZN (Amazon.com)  29.91  28.47  23.80  17.73  14.45 
BA (Boeing)  6.00  6.44  3.98  3.83  3.49 
CVX (Chevron)  1.42  1.62  1.03  0.75  0.65 
DAL (Delta Air Lines)  0.79  0.77  0.48  0.40  0.32 
DIS (Walt Disney)  1.95  1.91  1.17  1.10  0.92 
FB (Facebook)  3.51  5.54  2.15  1.72  1.44 
GE (General Electric)  0.39  0.30  0.14  0.29  0.18 
GM (General Motors)  0.58  0.57  0.30  0.30  0.28 
GS (Goldman Sachs Group)  3.11  3.00  1.86  1.27  1.31 
JNJ (Johnson & Johnson)  1.80  1.49  1.00  0.93  0.80 
JPM (JPMorgan Chase)  1.72  1.63  1.59  0.66  0.68 
MAR (Marriott Int’l)  2.02  2.02  1.52  0.89  1.07 
KO (CocaCola) 
0.49  0.50  0.32  0.26  0.25 
MCD (McDonald’s)  2.67  2.50  1.51  1.26  1.16 
NKE (Nike)  1.27  1.23  1.01  0.61  0.62 
PG (Procter & Gamble)  1.43  1.35  0.91  0.70  0.61 
VZ (Verizon Communications)  0.54  0.55  0.46  0.29  0.26 
WMT (Walmart)  1.34  1.43  1.06  0.55  0.50 
Random Forest  XGBoost  Stock2Vec  LSTMStock2Vec  TCNStock2Vec  

AAPL (Apple)  3.63  3.56  2.15  1.72  1.42 
AFL (Aflac)  0.45  0.44  0.35  0.20  0.21 
AMZN (Amazon.com)  22.19  21.36  17.87  11.53  10.29 
BA (Boeing)  4.59  5.10  2.87  2.87  2.74 
CVX (Chevron)  1.07  1.22  0.75  0.57  0.50 
DAL (Delta Air Lines)  0.59  0.58  0.36  0.29  0.24 
DIS (Walt Disney)  1.37  1.40  0.87  0.77  0.67 
FB (Facebook)  2.54  3.80  1.65  1.16  1.06 
GE (General Electric)  0.30  0.22  0.11  0.25  0.15 
GM (General Motors)  0.44  0.44  0.23  0.23  0.22 
GS (Goldman Sachs Group)  2.48  2.37  1.31  1.01  1.05 
JNJ (Johnson & Johnson)  1.21  1.04  0.72  0.64  0.59 
JPM (JPMorgan Chase)  1.34  1.23  1.17  0.51  0.52 
MAR (Marriott Int’l)  1.63  1.66  1.13  0.65  0.87 
KO (CocaCola)  0.39  0.37  0.25  0.19  0.19 
MCD (McDonald’s)  1.99  1.96  1.26  0.89  0.89 
NKE (Nike)  0.97  0.98  0.77  0.46  0.49 
PG (Procter & Gamble)  1.14  1.03  0.70  0.52  0.48 
VZ (Verizon Communications)  0.43  0.42  0.36  0.22  0.20 
WMT (Walmart)  1.02  1.10  0.87  0.41  0.41 
Random Forest  XGBoost  Stock2Vec  LSTMStock2Vec  TCNStock2Vec  

AAPL (Apple)  1.43  1.39  0.80  0.68  0.54 
AFL (Aflac)  0.88  0.86  0.66  0.39  0.39 
AMZN (Amazon.com)  1.21  1.17  0.97  0.63  0.56 
BA (Boeing)  1.33  1.47  0.82  0.83  0.80 
CVX (Chevron)  0.94  1.06  0.65  0.50  0.43 
DAL (Delta Air Lines)  1.03  1.02  0.63  0.51  0.43 
DIS (Walt Disney)  0.99  1.01  0.61  0.55  0.48 
FB (Facebook)  1.29  1.92  0.82  0.59  0.54 
GE (General Electric)  2.99  2.13  1.10  2.53  1.44 
GM (General Motors)  1.22  1.23  0.63  0.63  0.61 
GS (Goldman Sachs Group)  1.14  1.09  0.59  0.46  0.48 
JNJ (Johnson & Johnson)  0.90  0.77  0.51  0.47  0.43 
JPM (JPMorgan Chase)  1.08  1.00  0.90  0.40  0.42 
MAR (Marriott Int’l)  1.21  1.23  0.81  0.48  0.63 
KO (CocaCola)  0.72  0.68  0.45  0.35  0.35 
MCD (McDonald’s)  0.98  0.96  0.61  0.44  0.44 
NKE (Nike)  1.05  1.06  0.80  0.49  0.53 
PG (Procter & Gamble)  0.94  0.85  0.57  0.43  0.40 
VZ (Verizon Communications)  0.73  0.71  0.60  0.37  0.34 
WMT (Walmart)  0.88  0.94  0.73  0.35  0.35 
Random Forest  XGBoost  Stock2Vec  LSTMStock2Vec  TCNStock2Vec  

AAPL (Apple)  1.89  1.76  1.04  0.85  0.68 
AFL (Aflac)  1.15  1.19  0.87  0.60  0.53 
AMZN (Amazon.com)  1.60  1.55  1.28  0.95  0.78 
BA (Boeing)  1.74  1.85  1.13  1.11  1.02 
CVX (Chevron)  1.25  1.42  0.88  0.65  0.57 
DAL (Delta Air Lines)  1.39  1.36  0.83  0.71  0.57 
DIS (Walt Disney)  1.41  1.38  0.81  0.79  0.66 
FB (Facebook)  1.77  2.75  1.06  0.85  0.73 
GE (General Electric)  3.96  2.89  1.35  2.91  1.72 
GM (General Motors)  1.62  1.60  0.82  0.84  0.77 
GS (Goldman Sachs Group)  1.44  1.39  0.84  0.58  0.61 
JNJ (Johnson & Johnson)  1.33  1.11  0.72  0.70  0.60 
JPM (JPMorgan Chase)  1.40  1.33  1.20  0.53  0.54 
MAR (Marriott Int’l)  1.49  1.50  1.07  0.66  0.78 
KO (CocaCola)  0.90  0.92  0.57  0.47  0.45 
MCD (McDonald’s)  1.30  1.22  0.73  0.62  0.57 
NKE (Nike)  1.38  1.34  1.03  0.65  0.67 
PG (Procter & Gamble)  1.19  1.11  0.73  0.58  0.50 
VZ (Verizon Communications)  0.93  0.93  0.76  0.49  0.45 
WMT (Walmart)  1.15  1.23  0.89  0.47  0.43 
Comments
There are no comments yet.