Return Predictions From Trade Flow
January 30, 2024
1 Introduction
Here you will assess trade flow as means of generating profit opportunities in 3 cryptotoken markets. We stress the word “opportunity” because at high data rates like these, and given the markets’ price-time priority, it is far easier to identify desirable trades in the data stream than it is to inject oneself profitably into the fray.
2 Data
We have preprocessed level 3 exchange messages from the Coinbase WebSocket API for you into a more digestible format of truncated level 2 data.
2.1 Treatment
Load the 2023 data for all 3 pairs from the class website. For each one, split it into test and training sets, with your training set containing the first 40% of the data and the test set containing the remainder.
2.2 Format
The data has the following structure
2.2.1 Trades
received utc nanoseconds
The Side is actually a
2.2.2 Book
timestamp utc nanoseconds
sum of trade sides at the same
PriceMillionths
price and time.
SizeBillionths Side
Ask1PriceMillionths
Bid1PriceMillionths
Ask1SizeBillionths
Bid1SizeBillionths
Ask2PriceMillionths
Bid2PriceMillionths
Ask2SizeBillionths
Bid2SizeBillionths
received utc nanoseconds
timestamp utc nanoseconds
Mid 22971350000
(transposed)
Here, the received time comes from the clock of the recording device, which was not synchronized to the exchange clock. Such inaccuracies in clock settings, i.e. “clock skew”, can cause exchange timestamps to appear later than the time at which they are recorded as having been received.
As noted in class, exchange timestamps are not actionable, in the sense that any market participant would not see the data until considerably later. On the other hand, received timestamps, while actionable, may be subject to poor recording techniques on the client side. For this homework you may choose either, but I recommend the exchange timestamps.
3 Exercise
Write code to find τ-interval trade flow F(τ) just prior1 to each trade data point2 i. Compute T-second i
forward returns3 r(T). Regress them against each other in your training set, to find a coefficient β of i
regression.
For each data point in your test set you already have F(τ), so your return prediction is rˆ := β · F(τ).
Define thresholds j for rˆ and assume you might attempt to trade whenever j < |rˆ | . Good values for j ii
will have relatively frequent participation, but not anywhere near 100%.
4 Analysis
Assess the trading opportunities arising from using these return predictions in your test set, both with and without trading cost assumptions. Examine Sharpe ratios, drawdowns and tails. As part of this assessment, comment on the reliability/stability of β (most easily done by further splitting the data set), how you chose j, and what you might expect from using much longer training and test periods.
iii
1We do not include the trade i data itself, because we are evaluating trade i in terms of the flow we would have been aware of just before it happened.
2NOTE: the trade data series does not necessarily have strictly increasing timestamps. Be sure not to include other trades at the same timestamp in your computation of Fi.
3You need not handle latency in your homework, but for your edification: a more careful implementation would account for lags. For a pessimistic approach we could choose L as, say, twice the 99th percentile of computational and communications lag. Then, it would use book data (not just trade data) to help compute return from time ti + L to ti + L + T and run regressions using that. The idea here is that it takes approximately time L to “do anything” about trade information.