Mathematica slow at estimating vector autoregressive model parameters

6

I have been trying to obtain VAR model parameters using EstimatedProcess and FindProcessParameters. It seems to take a very long time as soon as I have more than 6 or 7 time series and lag length >2.

The data I am using is the z-scored daily log return of financial time series, with around 1000 data points.

It works fine for up to 8 series with lag 2 and only up to 6 series with lag 3 (taking 15 minutes for the later). Anything above that doesn't seem to work or is either really slow. EstimatedProcess has been running for 3 hours now for an p=3 VAR process with 7 times series.

I have also attempted this on data I simulated using ARProcess and the problem is the same, it takes a long time and above a certain number of series and lag length it doesn't return results (unless I didnt wait long enough). Here is the unprocessed data (data71517) https://www.wolframcloud.com/obj/078b7f99-bd9c-4b3e-b0ca-07a8cf65693c and code I am using:

stocks7 = {"GOOG", "FB", "AMZN", "NFLX", "AAPL", "MSFT", "TSLA"};
data71517 = 
Table[QuantityMagnitude[
  FinancialData[list, "Close", {date}]], {list, stocks7}, {date, 
  DateRange[{2015, 1, 1}, {2017, 12, 31}][[All, 1 ;; 3]]}];
n = 7;
datath = Table[Thread[{Range[1096], data71517[[i]]}], {i, n}];
datap = Table[Partition[
If[datath[[i]][[#, 2]] == datath[[i]][[# + 1, 2]], # + 1, 
   Nothing] & /@ Range[1095], 1], {i, n}];
datacl = Table[Delete[datath[[i]], datap[[i]]], {i, n}];
datalo = Table[Log[Interpolation[datacl[[i]], pts]], {i, n}, {pts, Range[1096]}];
datadif = Table[Differences[datalo[[i]]], {i, n}];
dataz = Table[(datadif[[i]] - Mean[datadif[[i]]])/(1*
  StandardDeviation[datadif[[i]]]), {i, n}];
elm := Function[{a, b, c}, Table[ToExpression[ToString
  [a] <> ToString[i] <> ToString[j]], {i, b}, {j, c}]]
es = EstimatedProcess[Transpose[dataz],ARProcess[{elm[aa, n, n], elm[ab, n, n], elm[ac, n, n]}, elm[e, n, n]]] 

Any help would be greatly appreciated.

Charles Dörre

Posted 2020-02-10T03:09:51.000

Reputation: 81

This cannot be answered without any code and without the underlying dataset. – Henrik Schumacher – 2020-02-10T06:00:25.263

Hm. I cannot really tell what you are doing, but already the way you load the data is super inefficient. Better load it with one query like a = FinancialData[stocks7, "Close", {{2015, 1, 1}, {2017, 12, 31}}];. The datacl looks as if you try to remove those days on which there is actually no data. You only have to do that because you enforced to get data also for the missing days. If I am not mistaken, you can just do datacl = QuantityMagnitude /@ Through[a["Values"]];. But the again you interpolate on the list of all days with datalo. So I do not get it, really. – Henrik Schumacher – 2020-02-10T07:46:07.930

What looks really, really evil is elm. What is it supposed to do? – Henrik Schumacher – 2020-02-10T08:01:09.607

Yes I know it is quite inefficient and there is probably a better way of doing this. What I am doing is forcing it to give me data for all days such that the data sets for all the stocks have equal length, I am then replacing the "filler" data with interpolation. If I were to do FinancialData[stocks7, "Close", {{2015, 1, 1}, {2017, 12, 31}}]; I would get data sets of unequal length, sometimes of equal length but with different days missing, and wouldnt know where to interpolate. – Charles Dörre – 2020-02-10T08:02:13.127

elm just gives me a matrix of coefficients, e.g elm[a,2,2] would give me {{a11,a12},{a21,a22}}. Saves me from having to write matrices of coefficients out as I would like to eventually do this for a larger number of series. – Charles Dörre – 2020-02-10T08:04:52.573

Oh, that can be done easier with Array[aa, {n, n}], Array[ab, {n, n}], Array[ac, {n, n}], etc. – Henrik Schumacher – 2020-02-10T08:08:28.923

The TimeSeries objects in the list a do already an "interpolation": For a nonexisting date, the data of the last previous date is used. That is much more meaningfull than interpolation, isn't it? – Henrik Schumacher – 2020-02-10T08:20:20.993

I am not sure I follow. When I do FinancialData[stocks7, "Close", {{2015, 1, 1}, {2017, 12, 31}}] I get 7 time series, of length 746, 749, 749, 755, 755, 755, 743. If there is a way of getting the last value carried forward that would be great, as obtaining the data would be much faster. I would still be replacing the last value carried forward by an interpolation though as I am also trying to study the correlation structure and partial directed coherence between the series and those measures would be spiked from sets having stagnant data in the same positions. – Charles Dörre – 2020-02-10T08:31:14.430

Oh I see I think I figured out how to get the last value carried forward using the timeseries generated by your code. Thanks! that will save me a lot of time when downloading the data. – Charles Dörre – 2020-02-10T09:11:51.543

You're welcome. – Henrik Schumacher – 2020-02-10T09:15:52.670

Answers

3

Just for the records: This is how I would write the code. At least the loading of data is now optimized for minimizing communication overhead.

stocks7 = {"GOOG", "FB", "AMZN", "NFLX", "AAPL", "MSFT", "TSLA"};
n = Length[stocks7];
α = {2015, 1, 1};
β = {2017, 12, 31};
data0 = FinancialData[stocks7, "Close", {α, β}];
data = Developer`ToPackedArray[QuantityMagnitude[
    Quiet[
     Through[data0[DateRange[α, β][[All, 1 ;; 3]]]]]
    ]];
datadif = Differences /@ Log[data];
dataz = (datadif - Mean /@ datadif)/(StandardDeviation /@ datadif);

The symbolic arrays in the end can be obtained with Array. However, I do not know whether this call to EstimatedProcess is meaningful. Other users may have more experience with it.

es = EstimatedProcess[
   Transpose[dataz],
   ARProcess[
    {Array[a, {n, n}], Array[b, {n, n}], Array[c, {n, n}]}, 
     Array[e, {n, n}]]
   ];

At least EstimatedProcess does not throw errors and just runs... maybe forever (like more then 3 seconds). I observed that only one of my CPU cores is used; maybe that means that EstimatedProcess is not parallelized; but maybe it also means that we called EstimatedProcess in a suboptimal way that enforces some intermediate symbolic computations...

Henrik Schumacher

Posted 2020-02-10T03:09:51.000

Reputation: 85 430

Thank you this has been very helpful so far. Something that might be worth saying is that unless I am mistaken there is no way of estimating the lag length of a VAR model on mathematica. So I estimate the lag length for AR models of the individual time series (using TimeSeriesModel) and pick the longest lag (in this case 3) to then estimate parameters of the VAR model. – Charles Dörre – 2020-02-10T09:56:30.703