How to merge monthly, daily and weekly data?

12

9

Google Trends returns weekly data so I have to find a way to merge them with my daily/monthly data.

What I have done so far is to break each serie into daily data, for exemple:

from:

2013-03-03 - 2013-03-09 37

to:

2013-03-03 37 2013-03-04 37 2013-03-05 37 2013-03-06 37 2013-03-07 37 2013-03-08 37 2013-03-09 37

But this is adding a lot of complexity to my problem. I was trying to predict google searchs from the last 6 months values, or 6 values in monthly data. Daily data would imply a work on 180 past values. (I have 10 years of data so 120 points in monthly data / 500+ in weekly data/ 3500+ in daily data)

The other approach would be to "merge" daily data in weekly/monthly data. But some questions arise from this process. Some data can be averaged because their sum represent something. Rainfall for example, the amount of rain in a given week will be the sum of the amounts for each days composing the weeks.

In my case I am dealing with prices, financial rates and other things. For the prices it is common in my field to take volume exchanged into account, so the weekly data would be a weighted average. For financial rates it is a bit more complex a some formulas are involved to build weekly rates from daily rates. For the other things i don't know the underlying properties. I think those properties are important to avoid meaningless indicators (an average of fiancial rates would be a non-sense for example).

So three questions:

For known and unknown properties, how should I proceed to go from daily to weekly/monthly data ?

I feel like breaking weekly/monthly data into daily data like i've done is somewhat wrong because I am introducing quantities that have no sense in real life. So almost the same question:

For known and unknown properties, how should I proceed to go from weekly/monthly to daily data ?

Last but not least: when given two time series with different time steps, what is better: Using the Lowest or the biggest time step ? I think this is a compromise between the number of data and the complexity of the model but I can't see any strong argument to choose between those options.

Edit: if you know a tool (in R Python even Excel) to do it easily it would be very appreciated.

for python, the standard tool is pandas. It was specifically designed to deal with financial data timeseries. pandas timeseries

– seanv507 – 2015-01-03T23:08:17.697

Care to expand a bit on what you mean by "unknown property"? – TheGrimmScientist – 2015-01-04T21:20:51.923

9

when given two time series with different time steps, what is better: Using the Lowest or the biggest time step ?

For your timeseries analysis you should do both: get to the highest granularity possible with the daily dataset, and also repeat the analysis with the monthly dataset. With the monthly dataset you have 120 data points, which is sufficient to get a timeseries model even with seasonality in your data.

For known and unknown properties, how should I proceed to go from daily to weekly/monthly data ?

To obtain say weekly or monthly data from daily data, you can use smoothing functions. For financial data, you can use moving average or exponential smoothing, but if those do not work for your data, then you can use the spline smoothing function "smooth.spline" in R: https://stat.ethz.ch/R-manual/R-patched/library/stats/html/smooth.spline.html

The model returned will have less noise than the original daily dataset, and you can get values for the desired time points. Finally, these data points can be used in your timeseries analysis.

For known and unknown properties, how should I proceed to go from weekly/monthly to daily data ?

To obtain daily data when you have monthly or weekly data, you can use interpolation. First, you should find an equation to describe the data. In order to do this you should plot the data (e.g. price over time). When factors are known to you, this equation should be influenced by those factors. When factors are unknown, you can use a best fit equation. The simplest would be a linear function or piecewise linear function, but for financial data this won't work well. In that case, you should consider piecewise cubic spline interpolation. This link goes into more detail on possible interpolation functions: http://people.math.gatech.edu/~meyer/MA6635/chap2.pdf.

In R, there is a method for doing interpolation of timeseries data. Here you would create a vector with say weekly values and NAs in the gaps for the daily values, and then use the "interpNA" function to get the interpolated values for the NAs. However, this function uses the "approx" function to get the interpolated values, which applies either a linear or constant interpolation. To perform cubic spline interpolation in R, you should use the "splinefun" function instead.

Something to be aware of is that timeseries models typically do some sort of averaging to forecast future values whether you are looking at exponential smoothing or Auto-Regressive Integrated Moving Average (ARIMA) methods amongst others. So a timeseries model to forecast daily values may not be the best choice, but the weekly or monthly models may be better.

when you interpolate using e.g. cubic spline, in financial time series, would you not be introducing a forward-look bias? i think this might be particularly important if implementing for a machine-learning model? – tsando – 2017-06-13T14:30:57.463

Seems to be a practical answer. Not sure if this is applicable to financial time series because of arbitrage. – lcrmorin – 2015-01-09T15:38:33.107

I think the answers to your questions are still valid. For the timeseries model you may want to look at ARCH (AutoRegressive Conditional Heteroskedasticity) models. – gchaks – 2015-01-13T17:27:15.343

5

I'm not an expert in this area, but I believe that your question is concerned with time series aggregation and disaggregation. If that is the case, here are some hopefully relevant resources, which might be helpful in solving your problem (first five items are main, but representative, and last two are supplementary):

2

This won't be a very satisfying answer, but here's my take...

For known and unknown properties, how should I proceed to go from daily to weekly/monthly data ?

For known and unknown properties, how should I proceed to go from weekly/monthly to daily data ?

Same answer for both: you can't do this for unknown properties, and for known properties it will depend on how the values were computed.

As you alluded to:

(an average of fiancial rates would be a non-sense for example)

There is no single transformation that will be appropriate in all cases, whether the properties/values are known or unknown. Even with known properties, you'll likely need a unique transformation for each type: mean, median, mode, min, max, boolean, etc.

when given two time series with different time steps, what is better: Using the Lowest or the biggest time step ?

Whenever possible, try to preserve the full granularity of the smallest possible step. Assuming you know how to transform the values, you can always roll-up the steps (e.g., day to month, month to year)... but you won't necessarily be able to reconstruct smaller steps from larger ones following a lossy conversion.

2

For known and unknown properties, how should I proceed to go from daily to weekly/monthly data ?

Aggregation.

For example, you have the number of time people searched for 'widgets' every day. Add up the daily totals for a month to get monthly totals. I would need to see more specifics about the actual data collected at each granularity to give you a more complete version.

For known and unknown properties, how should I proceed to go from weekly/monthly to daily data ?

You can't.

In physics, a comparable idea is the Nyquist frequency. The general idea is that you can't add more information than what you already have present in your data without bringing in more data. Given only the day someone ran a query, how can you tell what time of day that query was ran? You may be able to make some inferences, but the only way to answer the question is to directly or indirectly bring in more information to the system. There are things you can do to make informed guesses at the daily state of monthly variables (as gchaks mentioned, interpolation), but your data is still fundamentally monthly data stretched to look daily.

When given two time series with different time steps, what is better: Using the Lowest or the biggest time step ?

That totally depends on what you're trying to answer.

The smaller granularity will be more sensitive to noise and other anomalies. The lager granularity will be able to answer questions more confidently, but loose some of it's usefulness. For example, if you're trying to see when people start looking up venues to weekend plans to know when to launch marketing campaigns for a new night club, you'll want to be looking at daily data, if not smaller. If you're looking at the general trending of night clubs to figure out who you want to invest in, then monthly would probably be better.

2

Being able to aggregate data based on date segments is a piece of cake using Tableau software. You would simply plug your data into the tool, and then you can drag and drop both the metric and date dimensions onto a report body space. Tableau will instantaneously whip up aggregate visualizations and/or detail data tables, on the fly. You can group/sum by year, quarter, month, week, day, hour, etc. (standard, built in, out of the box functionality offered by the tool)

Also, if you wanted to incorporate additional data later on (which I assume is what you meant by 'unknown properties'), you can import another data set and easily append it onto the first one, as long as it also offers dates.

I would recommend checking out the free version, which I believe is called Tableau Public.