代写接单- MSDM5058 Information Science Computational Project

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

MSDM5058 Information Science Computational Project I:

Association Rules and Prediction Rules for Financial Data Mining 1 Data Preprocessing For the data, you can download from any credible source, e.g. Yahoo Finance. Sometimes a stocks price is not available; select any stock with prices available on more than 4000 days (around 15 years) and denote its time series with (). Compute your stocks daily return rate () with () = () ( 1) (1) ( 1) and denote the length of () with . You may download your time series from other credible sources as long as it spans more than 4000 days. In this case, please cite your source properly. 2 Exponential Moving Average PDF Define the stocks an a-day exponential moving average (EMA) at time t as 1 (;,) = ()exp , (2) =+1 where represents the EMAs memory length. The normalization factor satisfies PDF = ()exp , (3) parameters change? =+1 Plot (; , 1) for = 30, 100, and 300. The three windows roughly correspond to one month, one season, and one year. Plot [; ,( 1)] for = 1,2,and3with = 30. Discuss the effect of and . How do your EMAs change as the two 3 Cumulative Distribution Function Let us regard the values of () as realizations of a random variable . Plot its CDF FX(x). 3.1 Fermi-Dirac Distribution A Fermi-Dirac distribution from statistical mechanics is defined as () = 1 . (4) 1 + exp[( )] We may fit the CDF FX(x) with a Fermi-Dirac distribution. What is FX(x )? Hence what is your empirical x ? Let f (x) = F'(x). What is f (0)? Hence what is your empirical b? Plot F(x) atop FX(x). 3.2 Bonus: Kolmogorov-Smirnov Test We can test if our fitting is good with the Kolmogorov-Smirnov test. Our null PDF hypothesis is that F(x) fits FX(x) well or, more precisely, that x is drawn according and the fitting so = max| () ()|. If > , we reject the null to f (x). Define D as the maximum absolute difference between the empirical CDF hypothesis at a significance level . The threshold solves =1 2 (2 1)22 e x p 8 2 = 1 ( 5 ) Does the test reject the null hypothesis at, say, = 0.05? (At this significance level, the null hypothesis is wrongly rejected once every twenty times. PDF 4 Probability density function On one hand, we can estimate Xs PDF with its fitted CDFs derivative. On the other hand, we can estimate the PDF of X, i.e. fX, with a k-bin normalized th histogram, where each bin is h = max min units wide. The i bin counts the (6) PDF < min+h (<) min + ( 1)h max ( = ) relative to N so that all bin values add up to one. Plot f (x) = F'(x) for F(x) fitted in last section. occurrence of Plot three k-bin histograms for k = 20, 100, and 400. Bonus. Plot two more k-bin histograms with k respectively determined with the Sturges formula and the Freedman-Diaconis formula. Compare your histograms with each other and f. Discuss the effect of k. 5 Descriptive statistics Now rename s(t) with s1(t) and x(t) with x1(t). Then pick another stocks closing- price time series. Denote the new series with s2(t) and compute its daily return rate 2 Compute both return rates all-time mean , variance , and Sharpe ratio . x2(t). 6 Then compute their covariance 12 and correlation coefficient 12. RepeatlaststepwiththedataontheKmostrecentdays,i.e.t[NK+1,N],for K = 30, 100, and 300. Mean-Variance Analysis We would like to perform a mean-variance analysis on x1(t) and x2(t) and accordingly construct the minimum-risk portfolio sp(t) = ps1(t) + (1 p)s2(t) for some fraction of investment p [0,1]. 6.1 An All-Time Analysis Consider the all-time performance of s (t) and s (t), so p is a constant. 12 2 Determine p according to x(t)s all-time mean and all-time variance . Plot the resultant portfolio sp(t) with s1(t) and s2(t). = Plot sp(t)s price r1 relative to s1(t) and s2(t), where PDF 6.2 A K-Day Analysis As the relevance of x(t) decays with time, it is more sensible to consider the stocks {x(t i) | i [1, K]} at time t. Hence the fraction of investment (; ) varies with Plot (; ), which is determined according to x(t)s K-day mean (; ) and K-day variance 2(; ). performance on the last K days only. In other words, we only infer information from time and depends on K. Complete the following tasks for K = 30, 100, and 300. Plot the resultant portfolio sp(t; K) with s1(t) and s2(t). Plot sp(t; K)s price ri relative to s1(t) and s2(t).(;) Plot sp(t; K)s K-day Sharpe ratio (; ) = (;) An all-time analysis in some sense sets K . Compare the performance of your four portfolios (including the all-time portfolio) with s1(t) and s2(t). 7 Digitization of Time Series Now focus on the first stock, so its subscript 1 is hereafter dropped and implied. Digitize x(t) as d(t) with three alphabets, viz. D for down, U for up, and H for hold. D [() < 0.002] () = U [() > 0.002] (7) PDF 8 Calculate the probability P[d(t) = X] for X {D, U, H}. Calculate the conditional probability P[d(t+1) = X | d(t) = Y] for all nine possible pairs of (X, Y) {D, U, H}{D, U, H}. Association Rules H (otherwise) In real applications, you may of course replace 0.002 with other values. We would like to find out five-day patterns A that associate well with an immediate down (D). Formally, there are 35 = 243 possible rules in the form of R: A = {d(t 4), d(t 3), d(t 2), d(t 1), d(t)} d(t+1) = D. We may simplify our notation and label each rule R with the five alphabets in A; for example, the rule R = UUUUU predicts a down after five consecutive ups. Divide d(t) into an M-day-long learning set L and an (NM)-day-long testing set T. Their length ratios M:(NM) should be around 3:1. PDF 9 Prediction with Association Rules 9.1 An Experimental Approach The most practical way to verify a rules goodness is via a bidding experiment. When a pattern A occurs in T, we bid for an immediate down. We earn $u if A indeed precedes a down, otherwise we lose $v. Consider the case u = 1 and v = 0. Tabulate the ten most profitable rules in T and denote the set as {Rexpt}T. Report their support and confidence. 9.2 A Nave Comparison In order to judge the goodness of a rule, we need a measure that can indicate the performance of a rule in both L and T. The simplest way is to check by confidence whether a confident rule in L remains confident in T. Tabulate the top 10 rules with the highest confidence in L and denote the set as {Rconf}L. Report their support and confidence. If you find more than 10 rules having the same highest confidence, please list out all of them. Tabulate the top ten rules with the highest confidence in T and denote the set as {Rconf}T. Report their support and confidence. If you find more than 10 rules having the same highest confidence, please list out all of them. How many common rules does {Rconf}T share with {Rconf}L? Bonus. Compute the correlation between a rules confidence in L and that in T. Since a rules rank is more important than its exact confidence, you may want to use Spearmans or Kendalls correlation instead of Pearsons correlation. Specify which one you use. On the other hand, how much do the rules in {Rconf}L earn? How many rules does {Rconf}L share with {Rexpt}T? Discuss why some profitable rules in {Rexpt}T may be missing from {Rconf}L, and the potential flaws of measuring a rules goodness with confidence. Bonus. Compute the correlation between a rules confidence in L and its profit in T. Again, you are free to choose the form of your correlation. 10 Further Analysis of Association Rules In this final part, we examine the goodness of a rule by further analyzing the bidding experiment. By simple probability, a rule Rs profit in T, R, can be found related to both its support and confidence in T, where the proportionality depends on the length of T. This matches our intuition: a rule is good if it is both frequent (for a high support) and accurate (for a high confidence). Now, simplify the case with v = 0, so R ~ suppT(R) confT(R). The problem is that in reality we can only measure suppL(R) and confL(R) from past data, but their values can deviate much from their counterparts in T, i.e. future data. PDF R suppT(R) {u confT(R) v[1 confT(R) ]} , (8) PDF 10.1 Geometric Mean and Arithmetic Mean Let us first boldly assume suppT(R) = suppL(R) = SR and confT(R) = confL(R) = CR, yielding R ~ SR CR. Since the ranking of R does not change whether we take a square root on the right hand side, we may predict the rank of R with the geometric mean of Rs support and confidence. Tabulate the top ten rules with the highest geometric mean, i.e. in L. and denote the set as {Rgeo}L. Tabulate the top ten rules with the highest arithmetic mean of support and confidence, i.e. (SR + CR) in L and denote the set as {Rari}L. How many rules do {Rgeo}L and {Rari}L share with {Rexpt}T? Also report the geometric means and arithmetic means of the rules in {Rexpt}T. Hence compare the two means respective performance with confidence. 10.2 Bonus: Generalized Mean Now we will account for the discrepancy between a rules confidence in L and T, i.e. confT R confL R = CR. We are pessimistic, so we expect Rs depreciation of confidence over time; still, we also expect a more confident rule in L to remain more confident in T. The two assumptions suggest confT R = CRm for some m > 1. With the scale of R maintained, we may formulate R with a generalized mean: 1 = ()1+ 1 , (9) where the tuning parameter = 1 [0, 1]. As rises from 0 to 1, 1 smoothly 1+ slides from a rules support SR to its confidence CR. When strikes 1/3, 1/32/3 = PDF 3 (), i.e. the cubic root of Rs rule power factor. Tabulate the top ten rules with the highest RPF in L and denote as {RRPF}L. Also report the RPFs of those from {Rexpt}T. How many rules does {RRPF}L share with {Rexpt}T? Hence compare the performance of RPF with confidence. Do you think there will be an optimal value for that best indicates a rules goodness? If yes, what may it be? If no, why not? References 1. Information for Authors of Springer Computer Science Proceedings, https://www.springer.com/gp/computer-science/lncs/conference- proceedings-guidelines , last accessed 2022/01/29. PDF