JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 1
DeepLOB: Deep Convolutional Neural Networks
for Limit Order Books
Zihao Zhang, Stefan Zohren, and Stephen Roberts
Abstract—We develop a largescale deep learning model to
predict price movements from limit order book (LOB) data
of cash equities. The architecture utilises convolutional filters
to capture the spatial structure of the limit order books as
well as LSTM modules to capture longer time dependencies.
The proposed network outperforms all existing stateoftheart
algorithms on the benchmark LOB dataset [1]. In a more
realistic setting, we test our model by using one year market
quotes from the London Stock Exchange and the model delivers
a remarkably stable outofsample prediction accuracy for a
variety of instruments. Importantly, our model translates well to
instruments which were not part of the training set, indicating
the model’s ability to extract universal features. In order to
better understand these features and to go beyond a “black
box” model, we perform a sensitivity analysis to understand the
rationale behind the model predictions and reveal the components
of LOBs that are most relevant. The ability to extract robust
features which translate well to other instruments is an important
property of our model which has many other applications.
I. INTRODUCTION
IN today’s competitive financial world more than half of themarkets use electronic Limit Order Books (LOBs) [2] to
record trades [3]. Unlike traditional quotedriven marketplaces,
where traders can only buy or sell an asset at one of the prices
made publicly by market makers, traders now can directly
view all resting limit orders1 in the limit order book of an
exchange. Because limit orders are arranged into different
levels based on their submitted prices, the evolution in time of
a LOB represents a multidimensional problem with elements
representing the numerous prices and order volumes/sizes at
multiple levels of the LOB on both the buy and sell sides.
A LOB is a complex dynamic environment with high di
mensionality, inducing modelling complications that make tra
ditional methods difficult to cope with. Mathematical finance is
often dominated by models of evolving price sequences. This
leads to a range of Markovlike models with stochastic driving
terms, such as the vector autoregressive model (VAR) [4] or
the autoregressive integrated moving average model (ARIMA)
[5]. These models, to avoid excessive parameter spaces, often
rely on handcrafted features of the data. However, given
the billions of electronic market quotes that are generated
The authors are with the OxfordMan Institute of Quantitative Finance,
Department of Engineering Science, University of Oxford. (email: {zihao,
zohren, sjrob}@robots.ox.ac.uk)
1Limit orders are orders that do not match immediately upon submission
and are also called passive orders. This is opposed to orders that match
immediately, socalled aggressive orders, such as a market order. A LOB
is simply a record of all resting/outstanding limit orders at a given point in
time.
everyday, it is natural to employ more modern datadriven
machine learning techniques to extract such features.
In addition, limit order data, like any other financial time
series data is notoriously nonstationary and dominated by
stochastics. In particular, orders at deeper levels of the LOB
are often placed and cancelled in anticipation of future price
moves and are thus even more prone to noise. Other problems,
such as auction and dark pools [6], also add additional difficul
ties, bringing ever more unobservability into the environment.
The interested reader is referred to [7] in which a number of
these issues are reviewed.
In this paper we design a novel deep neural network
architecture that incorporates both convolutional layers as well
as Long ShortTerm Memory (LSTM) units to predict future
stock price movements in largescale highfrequency LOB
data. One advantage of our model over previous research [8]
is that it has the ability to adapt for many stocks by extracting
representative features from highly noisy data.
In order to avoid the limitations of handcrafted features, we
use a socalled Inception Module [9] to wrap convolutional and
pooling layers together. The Inception Module helps to infer
local interactions over different time horizons. The resulting
feature maps are then passed into LSTM units which can
capture dynamic temporal behaviour. We test our model on
a publicly available LOB dataset, known as FI2010 [1], and
our method remarkably outperforms all existing stateofthe
art algorithms. However, the FI2010 dataset is only made up
of 10 consecutive days of downsampled prenormalised data
from a less liquid market. While it is a valuable benchmark set,
it is arguable not sufficient to fully verify the robustness of an
algorithm. To ensure the generalisation ability of our model,
we further test it by using one year order book data for 5
stocks from the London Stock Exchange (LSE). To minimise
the problem of overfitting to backtest data, we carefully opti
mise any hyperparameter on a separate validation set before
moving to the outofsample test set. Our model delivers robust
outofsample prediction accuracy across stocks over a test
period of three months.
As well as presenting results on outofsample data (in a
timing sense) from stocks used to form the training set, we
also test our model on outofsample (in both timing and
data stream sense) stocks that are not part of the training set.
Interestingly, we still obtain good results over the whole testing
period. We believe this observation shows not only that the
proposed model is able to extract robust features from order
books, but also indicates the existence of universal features
in the order book that modulate stock demand and price. The
ability to transfer the model to new instruments opens up a
ar
X
iv
:1
80
8.
03
66
8v
4
[q
fi
n.C
P]
9
A
pr
20
19
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 2
number of possibilities that we consider for future work.
To show the practicability of our model we use it in a simple
trading simulation. We focus on sufficiently liquid stocks
so that slippage and market impact are small. Indeed, these
stocks are generally harder to predict than less liquid ones.
Since our trading simulation is mainly meant as a method of
comparison between models we assume trading takes place at
midprice2 and compare gross profits before fees. The former
assumption is equivalent to assuming that one side of the
trade may be entered into passively and the latter assumes
that different models trade similar volumes and would thus be
subject to similar fees. Our focus here is using a simulation as
a measure of the relative value of the model predictions in a
trading setting. Under these simplifications, our model delivers
significantly positive returns with a relatively small risk.
Although our network achieves good performance, a com
plex “black box” system, such as a deep neural network,
has limited use for financial applications without some un
derstanding of the rationale behind the model predictions.
Here we exploit the modelagnostic LIME method [10] to
highlight highly relevant components in the order book to gain
a better understanding between our predictions and model in
puts. Reassuringly, these conform to sensible (though arguably
unusual) patterns of activity in both price and volume within
the order book.
Outline: The remainder of the paper is as follows.
Section II introduces background and related work. Section
III describes limit order data and the various stages of data
preparation. We present our network architecture in Section IV
and give justifications behind each component of the model. In
Section V we compare our work with a large group of popular
methods. Section VI summarises our findings and considers
extensions and future work.
II. BACKGROUND AND RELATED WORK
Research on the predictability of stock markets has a long
history in the financial literature e.g., [11, 12]. Although opin
ions differ regarding the efficiency of markets, many widely
accepted studies show that financial markets are to some extent
predictable [13, 14, 15, 16]. Two major classes of work which
attempt to forecast financial timeseries are, broadly speaking,
statistical parametric models and datadriven machine learn
ing approaches [17]. Traditional statistical methods generally
assume that the timeseries under study are generated from
a parametric process [18]. There is, however, agreement that
stock returns behave in more complex ways, typically highly
nonlinearly [19, 20]. Machine learning techniques are able to
capture such arbitrary nonlinear relationships with little, or no,
prior knowledge regarding the input data [21].
Recently, there has been a surge of interest to predict
limit order book data by using machine learning algorithms
[1, 22, 23, 24, 25, 26, 27, 20, 28, 29]. Among many machine
learning techniques, preprocessing or feature extraction is of
ten performed as financial timeseries data is highly stochastic.
Generic feature extraction approches have been implemented,
such as the Principal Component Analysis (PCA) and the
2The average of the best buy and best sell prices in the market at the time.
Linear Discriminant Analysis (LDA) in the work of [24]. How
ever these extraction methods are static preprocessing steps,
which are not optimised to maximise the overall objective
of the model that observes them. In the work of [25, 24],
the BagofFeatures model (BoF) is expressed as a neural
layer and the model is trained endtoend using the back
propagation algorithm, leading to notably better results on the
FI2010 dataset [1]. These works suggest the importance of a
data driven approach to extract representative features from a
large amout of data. In our work, we advocate the endtoend
training and show that the deep neural network by itself not
only leads to even better results but also transfers well to new
instruments (not part of the training set)  indicating the ability
of networks to extract “universal” features from the raw data.
Arguably, one of the key contributions of modern deep
learning is the addition of feature extraction and representation
as part of the learned model. The Convolutional Neural Net
work (CNN) [30] is a prime example, in which information
extraction, in the form of filter banks, is automatically tuned to
the utility function that the entire network aims to optimise.
CNNs have been successfully applied to various application
domains, for example, object tracking [31], objectdetection
[32] and segmentation [33]. However, there have been but
a few published works that adopt CNNs to analyse finan
cial microstructure data [34, 35, 26] and the existing CNN
architectures are rather unsophisticated and lack of thorough
investigation. Just like when moving from “AlexNet” [36] to
“VGGNet” [37], we show that a careful design of network
archiecture can lead to better results compared with all existing
methods.
The Long ShortTerm Memory (LSTM) [38] was originally
proposed to solve the vanishing gradients problem [39] of
recurrent neural networks, and has been largely used in ap
plications such as language modelling [40] and sequence to
sequence learning [41]. Unlike CNNs which are less widely
applied in financial markets, the LSTM has been popular in
recent years, [42, 28, 43, 44, 45, 46, 47, 20] all utilising
LSTMs to analyse financial data. In particular, [20] uses
limit order data from 1000 stocks to test a four layer LSTM
model. Their results show a stable outofsample prediction
accuracy across time, indicating the potential benefits of deep
learning methods. To the best of our knowledge, there is no
work that combines CNNs with LSTMs to predict stock price
movements and this is the first extensive study to apply a
nested CNNLSTM model to raw market data. In particular,
the usage of the Inception Model in this context is novel and is
essential in inferring the optimal “decay rates” of the extracted
features.
III. DATA, NORMALISATION AND LABELLING
A. Limit Order Books
We first introduce some basic definitions of limit order
books (LOBs). For classical references on market microstruc
ture the reader is referred to [48, 49] and for a short review
on LOBs in particular we refer to [7]. Here we follow the
conventions of [7]. A LOB has two types of orders: bid orders
and ask orders. A bid (ask) order is an order to buy (sell) an
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 3
Volume Volume
Price/$
20.2 20.25 20.26 20.27 20.28 20.29 20.30 20.24 20.25 20.26 20.27 20.28 20.29 20.30 20.31
Bid Ask Bid Ask
Price
Ask
Bid
Price/$ 20.2 20.3 20.4 20.9 20.5 20.6 20.7 20.8
Price/$ 20.2 20.3 20.4 20.9 20.5 20.6 20.7 20.8
time: t
time: t+1
Bid Ask
!"($)(&)
L1

Bi
d
L2 L3 L4 L1 L2 L3 L4
Volume
!'($)(&)
!'($)(& + 1) !"($)(& + 1)
L1

Bi
d
L2 L3 L4 L1 L2
Figure 1. A slice of LOB at time t and t + 1. L1 represents the respective
first level, L2 the second, etc. p(1)a (t) is the lowest ask price (best ask) and
p
(1)
b (t) is the highest bid price (best bid) at time t.
asset at or below (above) a specified price. The bid orders have
prices Pb(t) and sizes/volumes Vb(t), and the ask orders have
prices Pa(t) and sizes/volumes Va(t). Both P(t) and V(t) are
vectors representing values at different price levels of an asset.
Figure 1 illustrates the above concepts. The upper plot
shows a slice of a LOB at time t. Each square in the
plot represents an order of nominal size 1. This is done
for simplicity, in reality different orders can be of different
sizes. The blue bars represent bid orders and the yellow bars
represent ask orders. Orders are sorted into different levels
based on their submitted prices, where L1 represents the first
level and so on. Each level contains two values: price and
volume. On the bid side, Pb(t) and Vb(t) are 4vectors in this
example. We use p(1)b (t) to denote the highest available price
for a buying order (first bid level). Similarly, p(1)a (t) is the
lowest available selling order (first ask level). The bottom plot
shows the action of an incoming market order to buy 5 shares
at time t+1. As a result, the entire first and second asklevels
are executed against that order and p(1)a (t+ 1) moved to 20.8
from 20.6 at time t.
B. Input Data
We test our model on two datasets: the FI2010 dataset
[1] and one year length of limit order book data from the
London Stock Exchange (LSE). The FI2010 dataset [1] is the
first publicly available benchmark dataset of highfrequency
limit order data and extracted time series data for five stocks
from the Nasdaq Nordic stock market for a time period of
10 consecutive days. Many earlier algorithms are tested on
this dataset and we use it to establish a fair comparison to
other algorithms. However, 10 days is an insufficient amount
of data to fully test the robustness and generalisation ability
of an algorithm as the problem of overfitting to backtest data
is severe and we often expect a signal to be consistent over a
few months.
To address the above concerns, we train and test our model
on limit order book data of one year length for Lloyds Bank,
Barclays, Tesco, BT and Vodafone. These five instruments are
among the most liquid stocks listed on the London Stock
Exchange. It is generally more difficult to train models on
more liquid stocks, but at the same time, those instruments
are easier to trade without price impact so making the simple
trading simulation used to assess performance more realistic.
The data includes all LOB updates for the above names.
It spans all trading days from 3rd January 2017 to 24th
December 2017 and we restrict it to the interval between
08:30:00 and 16:00:00, so that only normal trading activities
occur and no auction takes place. Each state of the LOB
contains 10 levels on each side and each level contains
information on both price and volume. Therefore, we have
a total of 40 features at each timestamp. Note that the FI
2010 dataset is actually downsampled limit order book data
because the authors followed [50] to create additional features
by using every nonoverlapping block of 10 events. We did
not perform any processing on our data and only feed raw
order book information to our algorithm.
Overall, our LSE dataset is made up of 12 months, and has
more than 134 million samples. On average, there are 150,000
events per day per stock. The events are irregularly spaced
in time. The time interval, ∆k,k+1, between two events can
vary considerably from a fraction of a second to seconds, and
∆k,k+1 is on average 0.192 seconds in the dataset. We take the
first 6 months as training data, the next 3 months as validation
data and the last 3 months as test data. In the context of high
frequency data, 3 months test data corresponds to millions of
observations and therefore provides sufficient scope for testing
model performance and estimating model accuracy.
C. Data Normalisation and Labelling
The FI2010 dataset [1] provides 3 different normalised
dataset: zscore, minmax and decimal precision normali
sation. We used data normalised by zscore without any
emendation and found subtle difference when using the other
two normalisation schemes. For the LSE dataset, we again use
standardisation (zscore) to normalise our data, but use the
mean and standard deviation of the previous 5 days’ data to
normalise the current day’s data (with a separate normalisation
for each instrument). We want to emphasize the importance
of normalisation because the performance of machine learning
algorithms often depends it. As financial timeseries usually
experiences regime shifts, using a static normalisation scheme
is not appropriate for a dataset of one year length. The above
method is dynamic and the normalised data often falls into
a reasonable range. We use the 100 most recent states of the
LOB as an input to our model for both datasets. Specifically, a
single input is defined as X = [x1, x2, · · · , xt, · · · , x100]T ∈
R100×40, where xt = [p(i)a (t), v(i)a (t), p(i)b (t), v
(i)
b (t)]
n=10
i=1 . p
(i)
and v(i) denote the price and volume size at ith level of a
limit order book.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 4
After normalising the limit order data, we use the midprice
pt =
p
(1)
a (t) + p
(1)
b (t)
2
, (1)
to create labels that represent the direction of price changes.
Although no order can transact exactly at the midprice,
it expresses a general market value for an asset and it is
frequently quoted when we want a single number to represent
an asset price.
Because financial data is highly stochastic, if we simply
compare pt and pt+k to decide the price movement, the
resulting label set will be noisy. In the works of [1] and [26],
two smoothing labelling methods are introduced. We briefly
recall the two methods here. First, let m− denote the mean
of the previous k midprices and m+ denote the mean of the
next k midprices:
m−(t) =
1
k
k∑
i=0
pt−i
m+(t) =
1
k
k∑
i=1
pt+i
(2)
where pt is the midprice defined in Equation (1) and k is the
prediction horizon. Both methods use the percentage change
(lt) of the midprice to decide directions. We can now define
lt =
m+(t)− pt
pt
(3)
lt =
m+(t)−m−(t)
m−(t)
(4)
Both are methods to define the direction of price movement
at time t, where the former, Equation 3, was used in [1] and
the latter, Equation 4, in [26].
The labels are then decided based on a threshold (α) for
the percentage change (lt). If lt > α or lt < −α, we define
it as up (+1) or down (−1). For anything else, we consider
it as stationary (0). Figure 2 provides a graphical illustration
of two labelling methods on the same threshold (α) and the
same prediction horizon (k). All the labels classified as down
(−1) are shown as red areas and up (+1) as green areas. The
uncoloured (white) regions correspond to stationary (0) labels.
The FI2010 dataset [1] adopts the method in Equation 3
and we directly used their labels for fair comparison to other
methods. However, the produced labels are less consistent as
shown on the top of Figure 2 because this method fits closer
to real prices as smoothing is only applied to future prices.
This is essentially detrimental for designing trading algorithms
as signals are not consistent here leading to many redundant
trading actions thus incurring larger transaction costs.
Further, the FI2010 dataset was collected in 2010 and
the instruments were less liquid compared to now. We ex
perimented with this approach in [1] on our data from the
London Stock Exchange and found the resulting labels are
rather stochastic, therefore we adopt the method in Equation 4
for our LSE dataset to produce more consistent signals.
0 200 400 600 800 1000
26.10
26.15
26.20
26.25
26.30 pt
0 200 400 600 800 1000
26.10
26.15
26.20
26.25
26.30 pt
Figure 2. An example of two smoothed labelling methods based on a same
threshold (α) and same prediction horizon (k). Green shading represents a +1
signal and red a 1. Top: [1]’s method and Bottom: [26]’s method.
IV. MODEL ARCHITECTURE
A. Overview
We here detail our network architecture, which comprises
three main building blocks: standard convolutional layers, an
Inception Module and a LSTM layer, as shown in Figure 3.
The main idea of using CNNs and Inception Modules is to
automate the process of feature extraction as it is often difficult
in financial applications since financial data is notoriously
noisy with a low signaltonoise ratio. Technical indicators
such as MACD and the Relative Strength Index are included as
inputs and preprocessing mechanisms such as principal com
ponent analysis (PCA) [51] are often used to transform raw
inputs. However, none of these processes is trivial, they make
tacit assumptions and further, it is questionable if financial
data can be welldescribed with parametric models with fixed
parameters. In our work, we only require the history of LOB
prices and sizes as inputs to our algorithm. Weights are learned
during inference and features, learned from a large training set,
are dataadaptive, removing the above constraints. A LSTM
layer is then used to capture additional time dependencies
among the resulting features. We note that very short time
dependencies are already captured in the convolutional layer
which takes “spacetime images” of the LOB as inputs.
B. Details of Each Component
a) Convolutional Layer: Recent development of elec
tronic trading algorithms often submit and cancel vast numbers
of limit orders over short periods of time as part of their
trading strategies [52]. These actions often take place deep
in a LOB and it is seen [7] that more than 90% of orders end
in cancellation rather than matching, therefore practitioners
consider levels further away from best bid and ask levels to
be less useful in any LOB. In addition, the work of [53]
suggests that the best ask and best bid (L1Ask and L1Bid)
contribute most to the price discovery and the contribution
of all other levels is considerably less, estimated at as little
as 20%. As a result, it would be otiose to feed all level
information to a neural network as levels deep in a LOB are
less useful and can potentially even be misleading. Naturally,
we can smooth these signals by summarising the information
contained in deeper levels. We note that convolution filters
used in any CNN architecture are discrete convolutions, or
finite impulse response (FIR) filters, from the viewpoint of
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 5
Input
Conv
1x2@16 (1,2)
4x1@16
4x1@16
1x10@16
4x1@16
4x1@16
1x10@16
4x1@16
4x1@16
Conv
1x10@16
4x1@16
4x1@16
Conv
1x10@16
4x1@16
4x1@16
Inception@32
LSTM@64 Units
Conv
1x2@16 (stride = 1x2)
4x1@16
4x1@16
1x2@16 (stride = 1x2)
4x1@16
4x1@16
1x10@16
4x1@16
4x1@16
Figure 3. Model architecture schematic. Here 1x2@16 represents a convolu
tional layer with 16 filters of size (1× 2). ‘1’ convolves through time indices
and ‘2’ convolves different limit order book levels.
signal processing [54]. FIR filters are popular smoothing
techniques for denoising target signals and they are simple
to implement and work with. We can write any FIR filter in
the following form:
y(n) =
M∑
k=0
bkx(n− k) (5)
where the output signal y(n) at any time is a weighted sum
of a finite number of past values of the input signal x(n). The
filter order is denoted as M and bk is the filter coefficient.
In a convolutional neural network, the coefficients of the
filter kernel are not obtained via a statistical objective from
traditional signal filtration theory, but are left as degrees of
freedom which the network infers so as to extremise its value
function at output.
The details of the first convolutional layer inevitably need
some consideration. As convolutional layers operate a small
kernel to “scan” through input data, the layout of limit order
book information is vital. Recall that we take the most 100
recent updates of an order book to form a single input and
there are 40 features per time stamp, so the size of a single
input is (100× 40). We organise the 40 features as following:
{p(i)a (t), v(i)a (t), p(i)b (t), v(i)b (t)}n=10i=1 (6)
where i denotes the ith level of a limit order book. The
size of our first convolutional filter is (1 × 2) with stride of
(1 × 2). The first layer essentially summarises information
between price and volume {p(i), v(i)} at each order book
level. The usage of stride is necessary here as an important
property of convolutional layers is parameter sharing. This
property is attractive as less parameters are estimated, largely
avoiding overfitting problems. However, without strides, we
would apply same parameters to {p(i), v(i)} and {v(i), p(i+1)}.
In other words, p(i) and v(i) would share same parameters
because the kernel filter moves by one step, which is obviously
wrong as price and volume form different dynamic behaviors.
Because the first layer only captures information at each
order book level, we would expect representative features to be
extracted when integrating information across multiple order
book levels. We can do this by utilising another convolutional
layer with filter size (1× 2) and stride (1× 2). The resulting
feature maps actually form the microprice defined by [55]:
pmicro price = Ip(1)a + (1− I)p(1)b
I =
v
(1)
b
v
(1)
a + v
(1)
b
(7)
The weight I is called the imbalance. The microprice is an
important indicator as it considers volumes on bid and ask side,
and the imbalance between bid and ask size is a very strong
indicator of the next price move. This feature of imbalances
has been reported by a variety of researchers [56, 57, 58, 59,
60]. Unlike the microprice where only the first order book
level is considered, we utilise convolutions to form micro
prices for all levels of a LOB so the resulting features maps
are of size (100, 10) after two layers with strides. Finally, we
integrate all information by using a large filter of size (1×10)
and the dimension of our feature maps before the Inception
Module is (100, 1).
We apply zero padding to every convolutional layer so the
time dimension of our inputs does not change and Leaky Rec
tifying Linear Units (LeakyReLU) [61] are used as activation
functions. The hyperparameter (the small gradient when the
unit is not active) of the LeakyReLU is set to 0.01, evaluated
by grid search on the validation set.
Another important property of convolution is that of equiv
ariance to translation [62]. Specifically, a function f(x) is
equivariant to a function g if f(g(x)) = g(f(x)). For example,
suppose that there exists a main classification feature m
located at (xm, ym) of an image I(x, y). If we shift every
pixel of I one unit to the right, we get a new image I ′
where I ′(x, y) = I(x − 1, y). We can still obtain the main
classification feature m′ in I ′ and m = m′, while the location
of m′ will be at (xm′ , ym′) = (xm−1, ym). This is important
to timeseries data, because convolution can find universal
features that are decisive to final outputs. In our case, suppose
a feature that studies imbalance is obtained at time t. If the
same event happens later at time t′ in the input, the exact
feature can be extracted later at t′.
We do not use any pooling layer except in the Inception
Modules. Although pooling layers help us find representations
invariant to translations of the input, the smoothing nature
of pooling can cause underfitting. Common pooling layers
are designed for image processing tasks, and they are most
powerful when we only care if certain features exist in the
inputs instead of where they exist [62]. Timeseries data
has different characteristics from images and the location of
representative features is important. Our experiences show
that pooling layers in the convolutional layer, at least, cause
underfitting problems to the LOB data. However, we think
pooling is important and new pooling methods should be
designed to process timeseries data as it is a promising
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 6
Input
1x1@16
25*1
1x1@16 1x1@16
100*40
50*1
3x40@16 10x40@16 20x40@16
Concat
1x1@16 1x1@16 1x1@16
3x1@16 10x1@16 20x1@16
Concat
Maxpool
1x1@16 1x1@16 1x1@16
3x1@16 10x1@16 20x1@16
Concat
1x1@16 1x1@16 1x1@16
3x1@16 10x1@16 20x1@16
Concat
Maxpool
12*1
1x1@16 1x1@16 1x1@16
3x1@16 10x1@16 20x1@16
Concat
1x1@16 1x1@16 1x1@16 3 10x1 16 20x1@16 Avgpool
3x1@32
5x1@32 1x1@32
Concat
Conv
1x1@32
Conv
3x1@32
Conv
1x1@32
Conv
5x1@32
Maxpool
3x1
Conv
1x1@32
Concat Inception@32
Figure 4. The Inception Module used in the model. For example, 3× 1@32
represents a convolutional layer with 32 filters of size (3× 1).
solution to extract invariant features.
b) Inception Module: We note that all filters of a stan
dard convolutional layer have fixed size. If, for example, we
employ filters of size (4 × 1), we capture local interactions
amongst data over four time steps. However, we can capture
dynamic behaviours over multiple timescales by using Incep
tion Modules to wrap several convolutions together. We find
that this offers a performance improvement to the resultant
model.
The idea of the Inception Module can be also considered as
using different moving averages in technical analysis. Practi
tioners often use moving averages with different decay weights
to observe timeseries momentum [63]. If a large decay weight
is adopted, we get a smoother timeseries that well represents
the longterm trend, but we could miss small variations that are
important in highfrequency data. In practice, it is a daunting
task to set the right decay weights. Instead, we can use
Inception Modules and the weights are then learned during
backpropagation.
In our case, we split the input into a small set of lower
dimensional representations by using 1 × 1 convolutions,
transform the representations by a set of filters, here 3 × 1
and 5 × 1, and then merge the outputs. A maxpooling layer
is used inside the Inception Module, with stride 1 and zero
padding. “Inception@32” represents one module and indicates
all convolutional layers have 32 filters in this module, and
the approach is depicted schematically in Figure 4. The 1× 1
convolutions form the NetworkinNetwork approach proposed
in [64]. Instead of applying a simple convolution to our data,
the NetworkinNetwork method uses a small neural network
to capture nonlinear properties of our data. We find this
method to be effective and it gives us an improvement on
prediction accuracy.
c) LSTM Module and Output: In general, a fully con
nected layer is used to classify the input data. However, all
inputs to the fully connected layer are assumed independent of
each other unless multiple fully connected layers are used. Due
to the usage of Inception Module in our work, we have a large
number of features at end. Just using one fully connected layer
with 64 units would result in more than 630,000 parameters
to be estimated, not to mention multiple layers. In order
to capture temporal relationship that exist in the extracted
features, we replace the fully connected layers with LSTM
units. The activation of a LSTM unit is fed back to itself
and the memory of past activations is kept with a separate
set of weights, so the temporal dynamics of our features can
be modelled. We use 64 LSTM units in our work, resulting in
about 60,000 parameters, leading to 10 times fewer parameters
to be estimated. The last output layer uses a softmax activation
function and hence the final output elements represent the
probability of each price movement class at each time step.
V. EXPERIMENTAL RESULTS
A. Experiments Settings
We apply the same architecture to all our experiments in
this section and the proposed model is denoted as DeepLOB.
We learn the parameters by minimising the categorical cross
entropy loss. The Adaptive Moment Estimation algorithm,
ADAM [65], is utilised and we set the parameter “epsilon” to
1 and the learning rate to 0.01. The learning is stopped when
validation accuracy does not improve for 20 more epochs. This
is about 100 epochs for the FI2010 dataset and 40 epochs for
the LSE dataset.
We train with minibatches of size 32. We choose a small
minibatch size due to the findings in [66] in which they sug
gest that largebatch methods tend to converge to narrow deep
minima of the training functions, but smallbatch methods
consistently converge to shallow broad minima. All models
are built using Keras [67] based on the TensorFlow backend
[68], and we train them using a single NVIDIA Tesla P100
GPU.
B. Experiments on the FI2010 Dataset
There are two experimental setups using the FI2010
dataset. Following the convention of [24], we denote them
as Setup 1 and Setup 2. Setup 1 splits the dataset into 9 folds
based on a day basis (a standard anchored forward split). In
the ith fold, we train our model on the first i days and test it
on the (i+ 1)th day where i = 1, · · · , 9. The second setting,
Setup 2, originates from the works [26, 28, 27, 25] in which
deep network architectures were evaluated. As deep learning
techniques often require a large amount of data to calibrate
weights, the first 7 days are used as the train data and the last
3 days are used as the test data in this setup. We evaluate our
model in both setups here.
Table I shows the results of our model compared to other
methods in Setup 1. Performance is measured by calculating
the mean accuracy, recall, precision, and F1 score over all
folds. As the FI2010 dataset is not well balanced, [1] suggests
to focus on F1 score performance as fair comparisons. We have
compared our model to all existing experimental results in
cluding Ridge Regression (RR) [1], SingleLayerFeedforward
Network (SLFN) [1], Linear Discriminant Analysis (LDA)
[22], Multilinear Discriminant Analysis (MDA) [22], Mul
tilinear Timeseries Regression (MTR) [22], Weighted Mul
tilinear Timeseries Regression (WMTR) [22], Multilinear
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 7
Table I
SETUP 1: EXPERIMENT RESULTS FOR THE FI2010 DATASET
Model Accuracy % Precision % Recall % F1 %
Prediction Horizon k = 10
RR [1] 48.00 41.80 43.50 41.00
SLFN [1] 64.30 51.20 36.60 32.70
LDA [22] 63.83 37.93 45.80 36.28
MDA [22] 71.92 44.21 60.07 46.06
MCSDA [23] 83.66 46.11 48.00 46.72
MTR [22] 86.08 51.68 40.81 40.14
WMTR [22] 81.89 46.25 51.29 47.87
BoF [24] 57.59 39.26 51.44 36.28
NBoF [24] 62.70 42.28 61.41 41.63
B(TABL) [25] 73.62 66.16 68.81 67.12
C(TABL) [25] 78.01 72.03 74.04 72.84
DeepLOB 78.91 78.47 78.91 77.66
Prediction Horizon k = 50
RR [1] 43.90 43.60 43.30 42.70
SLFN [1] 47.30 46.80 46.40 45.90
BoF [24] 50.21 42.56 49.57 39.56
NBoF [24] 56.52 47.20 58.17 46.15
B(TABL) [25] 69.54 69.12 68.84 68.84
C(TABL) [25] 74.81 74.58 74.27 74.32
DeepLOB 75.01 75.10 75.01 74.96
Prediction Horizon k = 100
RR [1] 42.90 42.90 42.90 41.60
SLFN [1] 47.70 45.30 43.20 41.00
BoF [24] 50.97 42.48 47.84 40.84
NBoF [24] 56.43 47.27 54.99 46.86
B(TABL) [25] 69.31 68.95 69.41 68.86
C(TABL) [25] 74.07 73.51 73.80 73.52
DeepLOB 76.66 76.77 76.66 76.58
Classspecific Discriminant Analysis (MCSDA) [23], Bagof
Feature (BoF) [24], Neural BagofFeature (NBoF) [24], and
AttentionaugmentedBilinearNetwork with one hidden layer
(B(TABL)) and two hidden layers (C(TABL)) [25]. More
methods such as PCA and Autoencoder (AE) are actually
tested in their works but, for simplicity, we only report their
best results and our model achieves better performance.
However, the Setup 1 is not ideal for training deep learning
models as we mentioned that deep network often requires
a large amount of data to calibrate weights. This anchored
forward setup leads to only one or two days’ training data for
the first few folds and we observe worse performance in the
first few days. As training data grows, we observe remarkably
better results as shown in Table II which shows the results
of our network compared to other methods in Setup 2. In
particular, the important difference between our model and
CNNI [26] and CNNII [27] is due to network architecture
and we can see huge improvements on performance here. In
Table III, we compare the parameter sizes of DeepLOB with
CNNI [26]. Although our model has many more layers, there
are far fewer parameters in our network due to the usage of
LSTM layers instead of fully connected layers.
We also report the computation time (forward pass) in
milliseconds (ms) for available algorithms in Table III. Due
to the development of GPUs, training deep networks is now
feasible and it is swift to make predictions, making it possible
for high frequency trading. We will discuss this more in the
next section.
Table II
SETUP 2: EXPERIMENT RESULTS FOR THE FI2010 DATASET
Model Accuracy % Precision % Recall % F1 %
Prediction Horizon k = 10
SVM [28]  39.62 44.92 35.88
MLP [28]  47.81 60.78 48.27
CNNI [26]  50.98 65.54 55.21
LSTM [28]  60.77 75.92 66.33
CNNII [27]  56.00 45.00 44.00
B(TABL) [25] 78.91 68.04 71.21 69.20
C(TABL) [25] 84.70 76.95 78.44 77.63
DeepLOB 84.47 84.00 84.47 83.40
Prediction Horizon k = 20
SVM [28]  45.08 47.77 43.20
MLP [28]  51.33 65.20 51.12
CNNI [26]  54.79 67.38 59.17
LSTM [28]  59.60 70.52 62.37
CNNII [27]    
B(TABL) [25] 70.80 63.14 62.25 62.22
C(TABL) [25] 73.74 67.18 66.94 66.93
DeepLOB 74.85 74.06 74.85 72.82
Prediction Horizon k = 50
SVM [28]  46.05 60.30 49.42
MLP [28]  55.21 67.14 55.95
CNNI [26]  55.58 67.12 59.44
LSTM [28]  60.03 68.58 61.43
CNNII [27]  56.00 47.00 47.00
B(TABL) [25] 75.58 74.58 73.09 73.64
C(TABL) [25] 79.87 79.05 77.04 78.44
DeepLOB 80.51 80.38 80.51 80.35
Table III
AVERAGE COMPUTATION TIME OF STATEOFTHEART MODELS
Models Forward (ms) Number of parameters
BoF [24] 0.972 86k
NBoF [24] 0.524 12k
CNNI [26] 0.025 768k
LSTM [28] 0.061 
C(TABL) [25] 0.229 
DeepLOB 0.253 60k
C. Experiments on the London Stock Exchange (LSE)
As we suggested, the FI2010 dataset is not sufficient to
verify a prediction model  it is far too short, downsampled
and taken from a less liquid market. To perform a meaningful
evaluation that can hold up to modern applications, we further
test our method on stocks from the LSE of one year length
with a testing period of three months. As mentioned in Section
III, we train our model on five stocks: Lloyds Bank (LLOY),
Barclays (BARC), Tesco (TSCO), BT and Vodafone (VOD).
Recent work of [20] suggests that deep learning techniques
can extract universal features for limit order data. To test this
universality, we directly apply our model to five more stocks
that were not part of the training data set (transfer learning).
We select HSBC, Glencore (GLEN), Centrica (CNA), BP and
ITV for transfer learning because they are also among the most
liquid stocks in the LSE. The testing period is the same three
months as before, and the classes are roughly balanced.
Table IV presents the results of our model for all stocks on
different prediction horizons. To better investigate the results,
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 8
Table IV
EXPERIMENT RESULTS FOR THE LSE DATASET
Prediction Horizon Accuracy % Precision % Recall % F1 %
Results on LLOY, BARC, TSCO, BT and VOD
k=20 70.17 70.17 70.17 70.15
k=50 63.93 63.43 63.93 63.49
k=100 61.52 60.73 61.52 60.65
Results on Transfer Learning (GLEN, HSBC, CNA, BP, ITV)
k=20 68.62 68.64 68.63 68.48
k=50 63.44 62.81 63.45 62.84
k=100 61.46 60.68 61.46 60.77
Down Stationary Up
Do
wn
Sta
tio
na
ry
Up
9667343 2532164 907266
2910692 10879399 2570201
1177182 2617167 9364113
Down Stationary Up
9546221 3069960 889661
3603711 7524565 4373401
652603 2776885 10169020
Down Stationary Up
9996056 2650105 673195
4999538 6688453 4162581
900722 2996586 9506291
Down Stationary Up
Do
wn
Sta
tio
na
ry
Up
14188991 3414391 1189903
5234454 15298391 4761173
1532771 3627615 13738088
Down Stationary Up
14673322 4028731 975993
6493267 10999252 6046319
1095414 4376190 14277789
Down Stationary Up
14718401 4289685 940541
7111634 9936780 5968307
1284541 4662741 14021147
Figure 5. Confusion matrices. Top: results on LLOY, BARC, TSCO, BT and
VOD. From the left to right, prediction horizon (k) equals 20, 50 and 100;
Bottom: results on transfer learning (GLEN, HSBC, CNA, BP, ITV).
we display the confusion matrices in Figure 5 and calculate
the accuracy for every day and for every stock across the
testing period. We use the boxplots in Figure 6 to present
this information and we can observe consistent and robust
performance, with narrow interquartile range (IQR) and few
outliers, for all stocks across the testing period. The ability
of our model that generalises well to data not in the training
set indicates that the CNN block in the algorithms, acting to
extract features from the LOB, can capture universal patterns
that relate to the price formation mechanism. We find this
observation most interesting.
D. Performance of the Model in a Simple Trading Simulation
A simple trading simulation is designed to test the practica
bility of our results. We set the number of shares per trade, µ,
to one both for simplicity and to minimise the market impact,
ensuring orders to be executed at the best price. Although
µ can be optimised to maximise the returns, for example,
prediction probabilities are used to size the orders in [69], we
would like to show that our algorithm can work even under
this simple setup.
To reduce the number of trades, we use following rules
to take actions. At each timestep, our model generates a
signal from the network outputs (−1, 0,+1) to indicate the
price movements in k steps. Signals (−1, 0,+1) correspond
to actions (sell, wait and buy). Suppose our model produces a
prediction of +1 at time t, we then buy µ shares at time t+ 5
LLOY BARC TSCO BT VOD
Stock
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Ac
cu
ra
cy
k=20
k=50
k=100
GLEN HSBC CNA BP ITV
Stock
0.55
0.60
0.65
0.70
0.75
0.80
Ac
cu
ra
cy
k=20
k=50
k=100
Figure 6. Boxplots of daily accuracy for the different prediction horizons.
Top: results on LLOY, BARC, TSCO, BT and VOD; Bottom: results on
transfer learning (GLEN, HSBC, CNA, BP, ITV).
(taking slippage into account), and hold until −1 appears to
sell all µ shares (we do nothing if 0 appears). We apply the
same rule to short selling and repeat the process during a day.
All positions are closed by the end of the day, so we hold no
stocks overnight. We make sure no trades take place at the
time of auction, so no abnormal profits are generated.
As the focus of our work is on predictions and the above
simple simulation is a way of showing that this prediction is in
principle monetisable. In particular, our aim is not to present
a fully developed, standalone trading strategy. Realistic high
frequency strategies often require a combination of various
trading signals in particular to time the exact entry and exit
points of the trade. For the purpose of the above simulation we
use midprices without transaction costs. While in particular
the second assumption is not a reasonable assumption for a
standalone strategy, we argue that (i) it is enough for a relative
comparison of the above models and (ii) it is a good indicator
of the relative value of the above predictor to a more complex
highfrequency trading model. Regarding the first assumption,
a midmid simulation, we note that in highfrequency trading,
many participants are involved in market making, as it is
difficult to design profitable fully aggressive strategies with
such short holding periods. If we assume that we are able
to enter the trade passively, while we exit it aggressively,
crossing the spread, then this is effectively equivalent to a
midmid trade. Such a situation arises naturally for example
in investment banks which are involved in client market
making. Regarding the second assumption, careful timing of
the entry points as well as more elaborate trading rules, such
as including position upsizing, should be able to account for
additional profits to cover the transaction costs. In any case,
as merely a metric of testing predictability of our model, the
above simple simulation suffices.
Figure 7 presents the boxplots for normalised daily profits
(profits divided by number of trades in that day) for different
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 9
LLOY BARC TSCO BT VOD GLEN HSBC ITV BP CNA
0.01
0.00
0.01
0.02
0.03
Pr
of
it
k=20
k=50
k=100
LLOY BARC TSCO BT VOD GLEN HSBC ITV BP CNA
Stock
10
0
10
20
30
40
50
ts
co
re
k=20
k=50
k=100
Figure 7. Boxplots for normalised daily profits and tstatistics for different stocks and prediction horizons (k). Profits are in GBX (= GBP/100).
LLOY0.00
0.05
0.10
0.15
0.20 k=20k=50
k=100
BARC0.0
0.1
0.2
0.3
0.4 k=20
k=50
k=100
TSCO0.0
0.2
0.4
0.6
0.8 k=20
k=50
k=100
BT0.0
0.2
0.4
0.6
0.8
1.0 k=20
k=50
k=100
VOD0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35 k=20
k=50
k=100
GLEN0.0
0.2
0.4
0.6
0.8
1.0
1.2 k=20
k=50
k=100
HSBC
0.0
0.2
0.4
0.6
0.8 k=20
k=50
k=100
ITV0.0
0.2
0.4
0.6
0.8 k=20
k=50
k=100
BP0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7 k=20
k=50
k=100
CNA0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7 k=20
k=50
k=100
Figure 8. Normalised cumulative profits for test periods for different stocks and prediction horizons (k). Profits are in GBX(= GBP/100).
stocks and prediction horizons. We use a ttest to check if
the profits are statistically greater than 0. The tstatistics is
essentially the same as Sharpe ratios but a more consistent
evaluation metric for high frequency trading. Figure 8 shows
the cumulative profits across the testing period. We can ob
serve consistent profits and significant tvalues over the testing
period for all stocks. Although we obtain worse accuracy for
longer prediction horizons, the cumulative profits are actually
higher as a more robust signal is generated.
E. Sensitivity Analysis
Trust and risk are fundamental in any financial application.
If we take actions based on predictions, it is always important
to understand the reasons behind those predictions. Neural
networks are often considered as “black boxes” which lack
interpretability. However, if we understand the relationship
between the inputs’ components (e.g. words in text, patches in
an image) and the model’s prediction, we can compare those
relationships with our domain knowledge to decide if we can
accept or reject a prediction.
The work of [10] proposes a method, which they call LIME,
to obtain such explanations. In our case, we use LIME to reveal
components of LOBs that are most important for predictions
and to understand why the proposed model DeepLOB works
better than other network architectures such as CNNI [26].
LIME uses an interpretable model to approximate the predic
tion of a complex model on a given input. It locally perturbs
the input and observes variations in the model’s predictions,
thus providing some measure of information regarding input
importance and sensitivity.
Figure 9 presents an example that shows how DeepLOB
and CNNI [26] react to a given input. In the figure we
show the top 10 areas of pros (in green) and cons (in red)
for the predicted class (yellow being the boundary). Not
coloured areas represent the components of inputs that are
less influential on the predicted results or “unimportant”. We
note that most components of the input are inactive for CNN
I [26]. We believe that this is due to two maxpooling layers
used in that architecture. Because [26] used largesize filters
in the first convolutional layer, any representation deep in the
network actually represents information gleaned from a large
portion of inputs. Our experiments applying LIME to many
examples indicate this observation is a common feature.
VI. CONCLUSION
In this paper, we introduce the first hybrid deep neural net
work to predict stock price movements using high frequency
limit order data. Unlike traditional handcrafted models, where
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 10
Real label = Stationary
Ref: P(Down) = 0.27 Ref: P(Stationary) = 0.46 Ref: P(Up) = 0.27
DeepLOB: P(Down) = 0.27 DeepLOB: P(Stationary) = 0.71 DeepLOB: P(Up) = 0.04
L10AskSize
L10BidSize
L10AskPrice
L10BidPrice
t=1 t=100 Time
CNNI[24]: P(Down)=0.27 CNNI[24]: P(Stationary)=0.46 CNNI[24]: P(Up)=0.27
DeepLOB: P(Up)=0.04DeepLOB: P(Stationary)=0.71DeepLOB: P(Down)=0.27
Figure 9. LIME plots. xaxis represents time stamps and yaxis represents
levels of the LOB, as labelled in the top image. Top: Original image.
Middle: Importance regions for CNNI [26]. Bottom: Importance regions
for DeepLOB model. Regions supportive for prediction are shown in green,
and regions against in red. The boundary is shown in yellow.
features are carefully designed, we utilise a CNN and an
Inception Module to automate feature extraction and use
LSTM units to capture time dependencies.
The proposed method is evaluated against several baseline
methods on the FI2010 benchmark dataset and the results
show that our model performs better than other techniques in
predicting short term price movements. We further test the
robustness of our model by using one year of limit order
data from the LSE with a testing period of three months. An
interesting observation from our work is that the proposed
model generalises well to instruments that did not form part
of the training data. This suggests the existence of universal
features that are informative for price formation and our model
appears to capture these features, learning from a large data
set including several instruments. A simple trading simulation
is used to further test our model and we obtain good profits
that are statistically significant.
To go beyond the oftencriticised “black box” nature of
deep learning models, we use LIME, a method for sensitivity
analysis, to indicate the components of inputs that contribute to
predictions. A good understanding of the relationship between
the input’s components and the model’s prediction can help
us decide if we can accept a prediction. In particular, we see
how the information of prices and sizes on different levels and
horizons contribute to the prediction which is in accordance
with our econometric understanding.
In a recent extension of this work we have modified the
DeepLOB model to use Bayesian neural networks [69]. This
allows to provide uncertainty measures on the network’s
outputs which for example can be used to upsize positions
as demonstrated in [69].
In subsequent continuations of this work we would like
to investigate more detailed trading strategies, using Rein
forcement Learning, which are based on the feature extraction
performed by DeepLOB.
ACKNOWLEDGEMENTS
The authors would like to thank members of Machine
Learning Research Group at the University of Oxford for their
helpful comments on drafts of this paper. We are most grateful
to the OxfordMan Institute of Quantitative Finance, who pro
vided limit order data and other support. Computation for our
work was supported by Arcus Phase B and JADE HPC at the
University of Oxford and Hartree national computing facilities,
U.K. We also thank the Royal Academy of Engineering U.K.
for their support.
REFERENCES
[1] A. Ntakaris, M. Magris, J. Kanniainen, M. Gabbouj,
and A. Iosifidis, “Benchmark dataset for midprice fore
casting of limit order book data with machine learning
methods,” Journal of Forecasting, vol. 37, no. 8, pp. 852–
866, 2018.
[2] C. A. Parlour and D. J. Seppi, “Limit order markets:
A survey,” Handbook of financial intermediation and
banking, vol. 5, pp. 63–95, 2008.
[3] I. Rosu et al., “Liquidity and information in order driven
markets,” Tech. Rep., 2010.
[4] E. Zivot and J. Wang, “Vector autoregressive models for
multivariate time series,” Modeling Financial Time Series
with SPLUS R©, pp. 385–429, 2006.
[5] A. A. Ariyo, A. O. Adewumi, and C. K. Ayo, “Stock
price prediction using the ARIMA model,” in Computer
Modelling and Simulation (UKSim), 2014 UKSimAMSS
16th International Conference on. IEEE, 2014, pp. 106–
112.
[6] C. Carrie, “The new electronic trading regime of dark
books, mashups and algorithmic trading,” Trading, vol.
2006, no. 1, pp. 14–20, 2006.
[7] M. D. Gould, M. A. Porter, S. Williams, M. McDonald,
D. J. Fenn, and S. D. Howison, “Limit order books,”
Quantitative Finance, vol. 13, no. 11, pp. 1709–1742,
2013.
[8] W.C. Chiang, D. Enke, T. Wu, and R. Wang, “An
adaptive stock index trading decision support system,”
Expert Systems with Applications, vol. 59, pp. 195–207,
2016.
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi
novich, “Going deeper with convolutions,” in Proceed
ings of the IEEE conference on computer vision and
pattern recognition, 2015, pp. 1–9.
[10] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should I
trust you?: Explaining the predictions of any classifier,”
in Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining.
ACM, 2016, pp. 1135–1144.
[11] A. Ang and G. Bekaert, “Stock return predictability: Is it
there?” The Review of Financial Studies, vol. 20, no. 3,
pp. 651–707, 2006.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 11
[12] P. Bacchetta, E. Mertens, and E. Van Wincoop, “Pre
dictability in financial markets: What do survey expec
tations tell us?” Journal of International Money and
Finance, vol. 28, no. 3, pp. 406–426, 2009.
[13] T. Bollerslev, J. Marrone, L. Xu, and H. Zhou, “Stock
return predictability and variance risk premia: Statistical
inference and international evidence,” Journal of Finan
cial and Quantitative Analysis, vol. 49, no. 3, pp. 633–
661, 2014.
[14] M. A. Ferreira and P. SantaClara, “Forecasting stock
market returns: The sum of the parts is more than the
whole,” Journal of Financial Economics, vol. 100, no. 3,
pp. 514–537, 2011.
[15] B. Mandelbrot and R. L. Hudson, The Misbehavior of
Markets: A fractal view of financial turbulence. Basic
books, 2007.
[16] B. B. Mandelbrot, “How Fractals Can Explain What’s
Wrong with Wall Street,” Scientific American, vol. 15,
no. 9, p. 2008, 2008.
[17] J. Agrawal, V. Chourasia, and A. Mittra, “Stateofthe
art in stock prediction techniques,” International Journal
of Advanced Research in Electrical, Electronics and
Instrumentation Engineering, vol. 2, no. 4, pp. 1360–
1366, 2013.
[18] R. C. Cavalcante, R. C. Brasileiro, V. L. Souza, J. P.
Nobrega, and A. L. Oliveira, “Computational intelligence
and financial markets: A survey and future directions,”
Expert Systems with Applications, vol. 55, pp. 194–211,
2016.
[19] Q. Cao, K. B. Leggio, and M. J. Schniederjans, “A com
parison between Fama and French’s model and artificial
neural networks in predicting the Chinese stock market,”
Computers Operations Research, vol. 32, no. 10, pp.
2499–2512, 2005.
[20] J. Sirignano and R. Cont, “Universal features of price
formation in financial markets: perspectives from deep
learning,” arXiv preprint arXiv:1803.06917, 2018.
[21] G. S. Atsalakis and K. P. Valavanis, “Surveying stock
market forecasting techniques–Part II: Soft computing
methods,” Expert Systems with Applications, vol. 36,
no. 3, pp. 5932–5941, 2009.
[22] D. T. Tran, M. Magris, J. Kanniainen, M. Gabbouj,
and A. Iosifidis, “Tensor representation in highfrequency
financial data for price change prediction,” in Computa
tional Intelligence (SSCI), 2017 IEEE Symposium Series
on. IEEE, 2017, pp. 1–7.
[23] D. T. Tran, M. Gabbouj, and A. Iosifidis, “Multilinear
classspecific discriminant analysis,” Pattern Recognition
Letters, vol. 100, pp. 131–136, 2017.
[24] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and
A. Iosifidis, “Temporal bagoffeatures learning for pre
dicting mid price movements using high frequency limit
order book data,” IEEE Transactions on Emerging Topics
in Computational Intelligence, 2018.
[25] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj,
“Temporal attentionaugmented bilinear network for fi
nancial timeseries data analysis,” IEEE transactions on
neural networks and learning systems, 2018.
[26] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen,
M. Gabbouj, and A. Iosifidis, “Forecasting stock prices
from the limit order book using convolutional neural
networks,” in Business Informatics (CBI), 2017 IEEE
19th Conference on, vol. 1. IEEE, 2017, pp. 7–12.
[27] ——, “Using Deep Learning for price prediction by
exploiting stationary limit order book features,” arXiv
preprint arXiv:1810.09965, 2018.
[28] ——, “Using deep learning to detect price change in
dications in financial markets,” in Signal Processing
Conference (EUSIPCO), 2017 25th European. IEEE,
2017, pp. 2511–2515.
[29] M. Dixon, D. Klabjan, and J. H. Bang, “Classification
based financial markets prediction using deep neural
networks,” Algorithmic Finance, vol. 6, no. 34, pp. 67–
77, 2017.
[30] Y. LeCun, Y. Bengio et al., “Convolutional networks for
images, speech, and time series,” The handbook of brain
theory and neural networks, vol. 3361, no. 10, p. 1995,
1995.
[31] N. Wang and D.Y. Yeung, “Learning a deep compact
image representation for visual tracking,” in Advances in
neural information processing systems, 2013, pp. 809–
817.
[32] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
feature hierarchies for accurate object detection and
semantic segmentation,” in Proceedings of the IEEE
conference on computer vision and pattern recognition,
2014, pp. 580–587.
[33] J. Long, E. Shelhamer, and T. Darrell, “Fully convolu
tional networks for semantic segmentation,” in Proceed
ings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2015, pp. 3431–3440.
[34] J.F. Chen, W.L. Chen, C.P. Huang, S.H. Huang, and
A.P. Chen, “Financial timeseries data analysis using
deep convolutional neural networks,” in Cloud Com
puting and Big Data (CCBD), 2016 7th International
Conference on. IEEE, 2016, pp. 87–92.
[35] J. Doering, M. Fairbank, and S. Markose, “Convolu
tional neural networks applied to highfrequency market
microstructure forecasting,” in Computer Science and
Electronic Engineering (CEEC), 2017. IEEE, 2017, pp.
31–36.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,”
in Advances in neural information processing systems,
2012, pp. 1097–1105.
[37] K. Simonyan and A. Zisserman, “Very Deep Convolu
tional Networks for LargeScale Image Recognition,” in
International Conference on Learning Representations,
2015.
[38] S. Hochreiter and J. Schmidhuber, “Long shortterm
memory,” Neural computation, vol. 9, no. 8, pp. 1735–
1780, 1997.
[39] Y. Bengio, P. Simard, and P. Frasconi, “Learning long
term dependencies with gradient descent is difficult,”
IEEE transactions on neural networks, vol. 5, no. 2, pp.
157–166, 1994.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 12
[40] M. Sundermeyer, R. Schlu¨ter, and H. Ney, “LSTM neural
networks for language modeling,” in Thirteenth Annual
Conference of the International Speech Communication
Association, 2012.
[41] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to
sequence learning with neural networks,” in Advances in
neural information processing systems, 2014, pp. 3104–
3112.
[42] W. Bao, J. Yue, and Y. Rao, “A deep learning framework
for financial time series using stacked autoencoders and
longshort term memory,” PloS one, vol. 12, no. 7, p.
e0180944, 2017.
[43] S. Selvin, R. Vinayakumar, E. Gopalakrishnan, V. K.
Menon, and K. Soman, “Stock price prediction using
LSTM, RNN and CNNsliding window model,” in Ad
vances in Computing, Communications and Informatics
(ICACCI), 2017 International Conference on. IEEE,
2017, pp. 1643–1647.
[44] T. Fischer and C. Krauss, “Deep learning with long short
term memory networks for financial market predictions,”
European Journal of Operational Research, vol. 270,
no. 2, pp. 654–669, 2018.
[45] L. Di Persio and O. Honchar, “Artificial neural networks
architectures for stock price prediction: Comparisons and
applications,” International Journal of Circuits, Systems
and Signal Processing, vol. 10, pp. 403–413, 2016.
[46] M. Dixon, “Sequence classification of the limit order
book using recurrent neural networks,” Journal of com
putational science, vol. 24, pp. 277–286, 2018.
[47] D. M. Nelson, A. C. Pereira, and R. A. de Oliveira,
“Stock market’s price movement prediction with LSTM
neural networks,” in Neural Networks (IJCNN), 2017
International Joint Conference on. IEEE, 2017, pp.
1419–1426.
[48] L. Harris, Trading and exchanges: Market microstructure
for practitioners. Oxford University Press, USA, 2003.
[49] M. O’Hara, Market microstructure theory. Blackwell
Publishers Cambridge, MA, 1995, vol. 108.
[50] A. N. Kercheval and Y. Zhang, “Modelling high
frequency limit order book dynamics with Support Vector
Machines,” Quantitative Finance, vol. 15, no. 8, pp.
1315–1329, 2015.
[51] A. Abraham, B. Nath, and P. K. Mahanti, “Hybrid intelli
gent systems for stock market analysis,” in International
Conference on Computational Science. Springer, 2001,
pp. 337–345.
[52] T. Hendershott, C. M. Jones, and A. J. Menkveld, “Does
algorithmic trading improve liquidity?” The Journal of
Finance, vol. 66, no. 1, pp. 1–33, 2011.
[53] C. Cao, O. Hansch, and X. Wang, “The information
content of an open limitorder book,” Journal of futures
markets, vol. 29, no. 1, pp. 16–41, 2009.
[54] S. J. Orfanidis, Introduction to signal processing.
PrenticeHall, Inc., 1995.
[55] J. Gatheral and R. C. Oomen, “Zerointelligence realized
variance estimation,” Finance and Stochastics, vol. 14,
no. 2, pp. 249–283, 2010.
[56] Y. Nevmyvaka, Y. Feng, and M. Kearns, “Reinforcement
learning for optimized trade execution,” in Proceedings
of the 23rd international conference on Machine learn
ing. ACM, 2006, pp. 673–680.
[57] M. Avellaneda, J. Reed, and S. Stoikov, “Forecasting
prices from LevelI quotes in the presence of hidden
liquidity,” Algorithmic Finance, vol. 1, no. 1, pp. 35–43,
2011.
[58] Y. Burlakov, M. Kamal, and M. Salvadore, “Optimal
limit order execution in a simple model for market
microstructure dynamics,” 2012.
[59] L. Harris, “Makertaker pricing effects on market quo
tations,” USC Marshall School of Business Work
ing Paper. Avalable at http://bschool. huji. ac. il/.
upload/hujibusiness/Makertaker. pdf, 2013.
[60] A. Lipton, U. Pesavento, and M. G. Sotiropoulos, “Trade
arrival dynamics and quote imbalance in a limit order
book,” arXiv preprint arXiv:1312.0514, 2013.
[61] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier
nonlinearities improve neural network acoustic models,”
in Proc. icml, vol. 30, no. 1, 2013, p. 3.
[62] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn
ing. MIT Press, 2016,
http://www.deeplearningbook.org.
[63] T. J. Moskowitz, Y. H. Ooi, and L. H. Pedersen, “Time
series momentum,” Journal of financial economics, vol.
104, no. 2, pp. 228–250, 2012.
[64] M. Lin, Q. Chen, and S. Yan, “Network in network,” in
International Conference on Learning Representations,
2014.
[65] D. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” Proceedings of the International Confer
ence on Learning Representations 2015, 2015.
[66] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy,
and P. T. P. Tang, “On largebatch training for deep
learning: Generalization gap and sharp minima,” in Inter
national Conference on Learning Representations, 2017.
[67] F. Chollet et al., “Keras,” https://keras.io, 2015.
[68] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,
S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,
J. Levenberg, D. Mane´, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,
K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Vie´gas, O. Vinyals, P. Warden, M. Wattenberg,
M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large
scale machine learning on heterogeneous systems,”
2015, software available from tensorflow.org. [Online].
Available: https://www.tensorflow.org/
[69] Z. Zhang, S. Zohren, and S. Roberts, “BDLOB: Bayesian
Deep Convolutional Neural Networks for Limit Order
Books,” in Third workshop on Bayesian Deep Learning
(NeurIPS 2018), 2018.