How do I find a repeating pattern of unknown length and start within a string




I am trying to do some analysis on our customer's around identifying early markers for debt delinquency. We bill customers on a weekly basis and payment patterns will differ between accounts. Good customers will normally pay Weekly, Fortnightly, 4 weekly or monthly on a 4-4-5 pattern. Other customers will have an erratic payment pattern.

Even where the payment pattern is consistent, they may be overpaying or not be paying the balance in full so there may be an underlying rising or falling debt.

Work So Far:

To this end I have created two strings to represent:

  1. Whether the weekly balance has decreased (v), increased (^) or not moved (-) (infer a payment has been made if balance is down or unchanged) and ...
  2. The level of indebtedness on the account based on the weekly charge ( - for in credit, . for balance cleared, 0-9 for balance less than 10 weeks, X for balance over 10 weeks)

So I now have a number of strings that look like this (good accounts)

^v^v^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^^    231212-012-012-012-012-0120112-012-0-0120012-012-012-0120112-012-012-012-012-0123
^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^v^^^^    12-012-012-012-012-012-012-012-012-0-012-012-012-012-012-012-012-012-012-012-0123

and like this for bad/erratic accounts

^v^^^vv^^v^^^v^^^v^v^v^^^v^^^v^^^v^v^^vv^^^v^^vv^^vv^^vv^^vv^^vv^^vv^^vv^^vv^^v^    341234112312231234233412341234123423123312341233123301220122-011-011-011--00--001
v^v^v^^v^vv^vv^vv^v^^^^v^^^v^^^^v^v^^^v^^^v^^^^v^^^v^^^^v^^^v^^^v^^^^v^^^v^^^v^ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX9XXX9XXXX9XXX9XXXX9XXX89XX89XXX89XX789X78

Next Steps:

What I want to do now is identify accounts with specific payment patterns:

  • Fortnightly = ^v^v
  • 4weekly = ^^^v^^^v^^^v
  • Monthly = ^^^v^^^v^^^^v (note 4-4-5)

I also want to identify where those patterns have changed. Bearing in mind that on a 4 weekly or monthly pattern we may not be at the start of a pattern I also need to try and find the 'seed' week for each pattern

I will regard a pattern as consistent if I have seen it repeat at least 3 times so 12 weeks for the fortnightly pattern, 36 weeks for the 4 weekly and 36 weeks for the Monthly [monthly could be 445, 454 or 544 but must repeat] so I then want to flag up accounts that were consistent that have become erratic or changed their pattern. I have data going back up to 5 years.

I have the data in Microsoft SQL and currently the only analysis tools I have to hand are Microsoft SQL and Excel. I'm not looking for code, just what the next practical steps should be in my strategy to analyse the data.

I'm thinking find the first down step in the pattern and remove this and the preceding up ticks (and associated characters from the multiplier) then run the remaining string through a number of REPLACE() functions but this doesn't feel very 'smart'

Is there a better strategy to adopt as it feels like I am MUNGing the data?

Aaron Reese

Posted 2017-11-08T15:37:01.970

Reputation: 61



From your description, it sounds like autocorrelation could be a good initial step to identify the frequency. Autocorrelation is basically the product of the sequence with itself shifted at 0, 1, 2, 3, and so forth. If a client tends to pay monthly, you will see the first positive bump around 4, because that is where each payment aligns with the payment after it. Note that this will work irrespective of the start of the sequence, and is even resilient to some noise. Even Excel seems to have an autocorrelation function.

Of course, you could refine this further with other steps, and/or try this on different windows (in case the patter changes over time).

Ami Tavory

Posted 2017-11-08T15:37:01.970

Reputation: 887