Planning: coping with data errors
23/11/2010by Rob Findlay
Garbage in, garbage out.
When you’re planning for next financial year, you don’t want your modelling to be shot through with data errors. But neither do you want to have to pick through your data, line by line, looking for errors and fixing them manually (and perhaps inconsistently).
So how can you detect and fix common data errors automatically? Naturally, it depends on what’s wrong, and the difficulties comes in several flavours.
The data you get directly from your activity database is likely to be pretty complete (accuracy is a different matter), so that takes care of completeness for activity counts, actual lengths of stay, urgency rates (which are essential for calculating waiting times), and so on. You are more likely to find gaps in data that comes from elsewhere, such as:
- demand growth assumptions
- waiting time targets
- waiting list sizes
- removal rates
- bed occupancy and bed utilisation assumptions
- bed, theatre and clinic performance assumptions
The important thing is to be explicit about the assumptions you want to make when any of these data items are missing. In many cases zero will be an acceptable default when data is missing, but sometimes it won’t and it’s dangerous to assume.
For instance if you know the waiting list size at the start, but not the end, of your historical data period, then zero would not be a safe default because your model would then be based on a rapidly shrinking list size (and therefore a high level of historical non-recurring activity). So it would be more sensible to assume the waiting list had remained a constant size, and populate the missing end list by copying the start list. Or vice versa, if it’s the start list size that’s missing.
For demand growth (or waiting time targets) you may have a standard set of assumptions, such as 3% growth for non-electives (or 90% admitted within 8 weeks), in which case you just need to ensure that your standard assumption is used wherever it is needed (but without over-writing any exceptions).
For removals, zero is often a good-enough default because any systematic errors in the removal rate should be second-order in a well-constructed demand calculation.
Capacity performance figures are proportional to the capacity being calculated, and so it is important to get these numbers right. However it is common for very broad assumptions to be made without proper consideration. For instance, 85% bed occupancy is often assumed to be a suitable buffer against fluctuations, but for this figure to be arbitrarily raised to 95% when the calculations show that a bigger hospital would be needed! This is a big subject in its own right, but the broad message is that Trusts would benefit from closer attendance to capacity assumptions.
Even if the raw data in a low-volume service is accurate, it can still be misleading for modelling purposes because the data is “noisy”. For instance, in a service that is provided only occasionally by one consultant, demand might fluctuate wildly:
- 10 in 2007-08
- 5 in 2008-09
- 18 in 2009-10
You can see the danger. If you used only the years from 2008-09 to 2009-10 when calculating the trend, you might conclude that demand was growing at 2,600% per year and so future demand would be:
- 65 in 2010-11
- 233 in 2011-12
- 840 in 2012-13
Which would be ridiculous, but not necessarily easy to spot if you’re crunching dozens (or even hundreds) of services automatically in a giant spreadsheet.
Instead, you need to cap demand growth within sensible limits. It is also sensible to avoid conducting detailed waiting time modelling on very small services (again because of noisy data leading you astray), and instead assume simply that demand must be met.
Additions data is used in waiting time calculations, but this data source is notoriously unreliable. The standard check for data quality (in the absence of suspensions and deferrals) is the reconciliation formula:
start list + additions = end list + activity + removals
So in most cases you can cross-check additions data against the other data, and if it lies outside a defined tolerance then you can cap it. So far, so simple.
However there is a complication when it comes to admitted patients. Daycases who stay overnight automatically become inpatients. So when you use the formula, you might find that it is out of balance because some of the daycase additions ended up as inpatient activity.
The solution is to consider daycases and inpatients together when detecting errors in the additions figures for a given specialty. If the reconciliation works out well in total, then the separate additions figures do not need adjusting.
Waiting list data
Sometimes there is a delay between receiving an elective referral (or making a decision to admit) and logging that patient onto the IT system as an addition to the outpatient (or inpatient/daycase) waiting list. So if you were to extract yesterday’s waiting list you would miss any patients that haven’t been keyed-in yet.
For planning purposes, this problem can usually be avoided just by using older data. When planning for next financial year, some months before it even starts, it makes little difference whether you rely on data up to the end of last month or the month before.