The Most Accurate Call Centre Planner and Forecaster In The World

OrderlyPlanner

At Orderly Telecoms we understand just how tricky planning your staffing rota can be.

With legacy queue systems, a balance must be struck between key performance indicators like answer rate and service level, and the need to keep costs reasonable. Peaks and troughs mean a law of diminishing returns on adding extra staff as you get towards 100% answer rate, which makes answering all callers prohibitively expensive for the majority of businesses.

Our OrderlyQ traffic shaper fully solves this problem, taking you to 99.9% answer rate with the staff you already have. For call centres that don't yet have OrderlyQ we provide the OrderlyC staff planner and forecaster so that you can accurately work out the staff you need to fulfil your performance requirements. The OrderlyC workforce planner provides up to four times more accurate predictions than Erlang C or even Erlang A (see full study) - indeed we understand that OrderlyC is the most accurate call centre planning and forecasting tool in the world.

Group Of  Business People at Meeting Table
Your Data
Calls
in a period of
 
Average Handle Time
include time spent on phone and time working after call. Usually between 3 and 5 minutes.
Target Answer Rate %Percent calls answered.
 
                                             Your Data Plan      Forecast
Average Handle Time
include time spent on phone and time working after call.
Usually between 3 and 5 minutes.
Roster Period
Minutes.
This is the time interval over which you organise your staff roster.
Opens at : For 24 hour call centres, set the Opening and Closing time to the same time.
Closes at :
Target Answer Rate % Percent calls answered.
Period Calls
Mean and Max Seats
8:00 - 9:00      
9:00 - 10:00      
10:00 - 11:00      
11:00 - 12:00      
12:00 - 13:00      
13:00 - 14:00      
14:00 - 15:00      
15:00 - 16:00      
16:00 - 17:00      
17:00 - 18:00      

Want to compare accuracy at a glance? That's here.

Call Centre Forecasting and Planning Methods: A Systematic Large Scale Trial of Erlang C, Erlang A and OrderlyC with Real World Call Centre Data

Introduction

Getting the right number of staff to achieve performance targets is tricky, due to inherent randomness in call patterns and a highly non-linear relationship between staffing levels and key performance indicators.

Call Centre Managers use a variety of workforce management tools to attempt to forecast staffing levels. Having reviewed the online tools widely available, it appears that by far the most prevalent is Erlang C.

Erlang C was developed by A K Erlang and published in 1917, during the early days of telephony, and relates staff levels to call volumes and average speed of answer (ASA). It was designed for telephone exchanges which were staffed by people known as operators. At this time, to make a long distance call a person would call an operator at the local exchange and ask for a call to be set up, and then hang up. The operator would speak to a succession of other operators at other exchanges, who manually connected the calls with wires on their switchboards. The operator would then call back the requester to say the call was ready to be connected (so the request had been handled).

Perhaps this is why Erlang C has no notion of abandonment; it simply looks at how long it takes for the operator to handle each call request, and assumes that all requests will be handled eventually. There is no abandonment in the above use-case.

In 1946, Swedish mathematician C Palm published a refined model to incorporate abandonment, making it suitable for use at call centres with on-hold queues. This more sophisticated model is known as Erlang A. Despite the publication of a number of papers advocating a switch to Erlang A, Erlang C seems to remain more prevalent in terms of actual usage, according to our review. It is surprising that a mathematical model that was not designed for the needs of call centres is still in such widespread use.

We did wonder why this was. Could it be that Erlang C is more accurate than Erlang A, despite the added sophistication of the Erlang A model? Or is it, as some have suggested, simply because Erlang C is easy to do in Excel? When we tried to find out, we couldn't find any large scale real-world trials comparing the two Erlang methods - though there are a number of published model trials, performed using stochastic simulations of call centres, rather than actual call centre data.

Providing a quantitative answer to this question with real call centre data is one of the two aims of this paper.

Now, the Erlang methods of call centre forecasting and planning are both based on the mathematics of the Poisson distribution.

The Poisson distribution is only a suitable model of call centre traffic if its requirements are fulfilled. So, for Erlang predictions to be accurate:

  • the number of servers (seats) must be a constant integer.
  • the average caller demand must be constant.
  • the average handle time must be constant.
  • calls must be independent of each other.
  • the number of calls within any given interval of time must take an integer value.
  • the probability of a call arriving in a small interval is proportional to the length of the interval.
  • no two calls may arrive at exactly the same time.

At real call centres, only the last requirement is normally fulfilled. The others are almost always broken as:

  • agents go on breaks or become unavailable unpredictably, and call centre managers add and remove agents as required.
  • as well as the random fluctuations envisaged by Poisson, average caller demand is also changing at all times throughout the day.
  • average handle time varies for many reasons, including time of day and abandonment.
  • abandoned calls earlier mean more calls later. Some callers must also speak with agents more than once.
  • often predictions are most useful when an average number of calls for a particular hour of the day is used, which may not be an integer value.
  • you are more likely to have calls at some times of day than others, so proportionality is not ensured.

It is therefore unreasonable to expect either Erlang method to produce perfectly accurate results.

In addition to the above, Erlang C adds the following restrictions:

  • All calls are answered
  • All callers are infinitely patient
  • Sufficient seats must be present to answer all calls eventually.

The first of these is usually false. The second of these is always false. The third of these is frequently false.

Because Erlang C has no notion of abandonment, Erlang C can only make predictions based on average speed of answer (ASA), often expressed in terms of a service level target (e.g. 80% of calls answered in 20 seconds).

This means Erlang C cannot be used to plan or forecast around the more useful Answer Rate figure. Answer Rate is more useful to a business because it's far more important to know whether a caller is answered or not, than to know how long that caller was waiting. Certainly, callers prefer being answered to not being answered too. It is therefore more desirable to know how many staff must be available in order to answer the calls, rather than how long those who are lucky enough to be answered will have to wait.

As it turns out, greater accuracy is possible when answer rate is used as the target performance indicator too.

Erlang A does model abandonment, so the additional restrictions of Erlang C are lifted.  Erlang A can also be used to express targets in terms of answer rate, however the patience time (the average amount of time callers are prepared to wait) must also be known to use Erlang A, and again this may vary.

The patience time can be estimated by totalling the time that abandoning callers spend in the queue, and dividing by the number of abandoned calls, however this will tend to underestimate the true patience time, as the more patient callers are selected out of the abandoning set by being answered. Other methods for patience time estimation include Kaplan - Meier, which will be the subject of future investigation. Perhaps difficulty in estimating patience time is also a reason why Erlang A is not more widely adopted than Erlang C. Our understanding is that this issue remains an open problem with Erlang A.

Because of this discrepancy between the Erlang model requirements and real world conditions, we wondered if a more accurate model might be possible if the requirements were lifted.

So, we devised a mathematical model of call centre traffic that does not require any of the strict assumptions of Erlang C or Erlang A to hold.

We call it OrderlyC.

It has very strong predictive power.

You can use it when underlying caller demand is changing, as well as to cope with Poisson-type peaks and troughs.
You can use it with changing numbers of agents.
You can use it with changing average handle time.
You can use it when you don't know the patience time.
You can use it even if not all calls are answered, or if there are not enough staff to answer every call.
You can use it when some calls depend on other calls, or arrive two-at-a-time.

Best of all, it targets answer rate rather than average speed of answer.

OrderlyC is therefore more suitable for use at real call centres than any Erlang method.

But is it more accurate?

The short answer: Yes, very much more. Check out this cool table.

The long answer:

Method

To find out the answer to this question, we checked OrderlyC, as well as Erlang C and Erlang A against almost a million hours of call centre data from hundreds of call centres all around the world, to find out which was the most accurate and true. Indeed, this work may well be the largest international accuracy study of Erlang methods ever performed; certainly we know of no other to this scale.

For each hour, we used the measured ASA for Erlang C and measured Answer Rate and Patience Time for Erlang A in order to find the number of servers predicted by each model. We compared the predictions of all three models with the maximum number of people simultaneously handling calls during each hour (= Max Seats).

Why Max Seats? Because if you know, or can forecast the max seats, you know how many staff you will need ready and available to take calls during that hour. If you don't know exactly when the peaks will occur within a period, you must have that many people in the call centre throughout the period in order to achieve the target performance. So, max seats defines the number of people your call centre must employ.

Now, Erlang C and Erlang A make no distinction between mean and max seats, as both of these models make predictions based on a constant integer number of servers. Indeed perhaps one could argue that Erlang C and Erlang A aren't really predicting what we are measuring, as they can only cope with a fixed number of servers. Still, whenever staffing levels do happen to be constant, the number of Erlang servers should be identical to the Max Seats figure, as all the servers must handle calls simultaneously at some point in time, as otherwise the call centre could achieve the same performance target with fewer servers.

However, we don't know of any call centre that actually has a constant number of servers all the time, or that changes the number of servers every hour on the hour, as required by the Erlang models.

So, in the more usual case where there is a changing number of servers across the period under consideration, we can expect that there will sometimes be fewer servers, and sometimes more servers than the constant staffing level that would yield the same performance target.

This means that the Erlang methods should underestimate the Max Seats value. It's worth pointing out that there's no way to pull a figure out of the real world data that complies with the constant servers requirement of the Erlang models. This makes Erlang C and Erlang A predictions formally untestable with real world data. All we can do is test the predictions of these models against the observable that we do have, the real max seats, which is the crucial figure for planning.

When assessing predictive models, there are two types of error one must analyse.

The first is systematic error, which is the tendency of the model to consistently overestimate (or underestimate) when compared with real world data. A perfect model has zero systematic error. When a model’s systematic error is always in the same direction, regardless of input, we describe that as bias.

The second is residual error, which tells the amount by which each model prediction varies from each individual real world measurement. A perfect model predicts each and every value perfectly, and has a residual error of zero. In the field of measurement, residual error is also known as random error.

For example, consider a simple model of throwing a die that always predicts the outcome 3.5. The systematic error of this model is zero, because this is indeed the average of all the numbers on the faces. Furthermore, the model does not consistently overestimate or underestimate, and therefore has zero bias.

However, no die ever lands with 3.5 showing because the closest available numbers are 3 and 4. The residual error is therefore going to be large. Residual error is usually measured by taking the square root of the mean of the square of the difference between each die roll and the corresponding prediction, known as Root Mean Square or RMS error, which works out at 1.71 in this particular case. That's 48.8% of the predicted value, so one can also talk about an RMS of 48.8%. So, you might see a prediction of this model written as 3.5 ± 1.7 or (equivalently) 3.5 ± 50%. We use the latter convention in what follows.

Normally one expects about 70% of the measured values to be at least as close to the predicted values as the RMS value, and we can see here that this is the case as the numbers 2,3,4 and 5 are all within 1.71 of the value predicted by this very simple model, which is two thirds of the equally-probable outcomes of a die roll, which is 66.6%.

Note that these two types of error are not independent; a high systematic error will tend to shift the predictions away from the real world result, which contributes towards a higher RMS value too. This is why it is normally the RMS error in predictions and measurements that is quoted. RMS is usually considered the more significant value and indicator of overall accuracy.

Variation in the underlying data set also contributes to RMS error, so while a systematic error of zero is both possible and desirable, an RMS error of zero is generally not to be expected.

Results

Systematic Error

So, to assess systematic error (and bias), we compared the average predictions of Erlang A and Erlang C to the measured average maximum number of servers at different levels of intensity (intensity = calls per unit time * average handle time, measured in units called Erlangs) within each hour. It is reasonable to expect forecasting models to be free from systematic error, at least.

Hour by Hour Systematic Error

Intensity (Hourly) Erlang A Erlang A* Erlang C OrderlyC
Over 40
-15.2%
 
-15.7%
 
18.5%
 
1.82%
 
35 to 40
-12.9%
 
-14.1%
 
9.3%
 
0.38%
 
30 to 35
-14.0%
 
-15.1%
 
10.4%
 
-0.52%
 
25 to 30
-14.9%
 
-15.6%
 
11.1%
 
-0.53%
 
20 to 25
-13.5%
 
-14.4%
 
9.4%
 
-0.31%
 
15 to 20
-12.3%
 
-12.3%
 
10.8%
 
0.36%
 
10 to 15
-14.9%
 
-13.5%
 
11.7%
 
-0.12%
 
5 to 10
-25.6%
 
-23.5%
 
8.6%
 
-1.61%
 
Up to 5
-28.9%
 
-19.4%
 
-5.2%
 
0.26%
 

The above table shows the systematic error of each model at different bands of intensity. Negative values indicate a tendency to underestimate. The error is expressed as a percentage of the sample mean of the maximum seats in each hour in each band. The coloured bars represent this graphically, with larger, red bars representing worse accuracy. The purple line in the centre represents zero systematic error. Underestimates swing to the left, and overestimates swing to the right of the purple centre line. The shorter the bar, the smaller the systematic error. Models that mostly swing in one direction are biased.

The systematic error for Erlang C ranged from just 5% at the lowest intensity (less than 5 Erlangs, which was around 45 calls per hour), where it underestimated, up to 19% at greater than 40 Erlangs (average of 555 calls per hour, the busiest band in our study), where it overestimated. So, for example, where a call centre actually had 48 max seats, Erlang C predicted 57 servers, on average. At the intermediate levels of traffic, the systematic error of Erlang C was an overestimate of around 10% of the number of seats required to deliver the measured ASA.

Erlang A always tended to underestimate the maximum number of seats required. Unlike Erlang C, this is as expected. Erlang A had a systematic error of 29% at the lowest intensity, 26% in the 5 - 10 Erlang band (about 126 calls per hour), and falling to 15% at the greater than 40 Erlang band, which was the only band where it had less systematic error than Erlang C. It underestimated by about 14% in the intermediate bands. The particularly high systematic error at the lower bands can be attributed - at least partially - to the fact that average patience time and answer rate cannot be measured with certainty with the small number of abandoned calls in each hour at call low volumes. We did also try Erlang A using the patience time for the whole day containing the hour in question. At the lowest call volumes, this reduced Erlang A systematic error by about a third, but slightly increased the systematic error at high call volumes (by around 1%). We denote the Erlang A predictions with daily patience time Erlang A* in the tables.

OrderlyC did very much better than either Erlang method. At less than 5 Erlangs, the systematic error was less than 0.3%. The only bands where the 1% threshold was exceeded were 5 - 10 Erlangs (1.6%) and greater than 40 Erlangs (1.8%). For all the rest of the bands, the systematic error was around 0.4% (i.e. negligible). It is likely that the tiny remaining systematic error may be attributable to statistical variation in sample mean from the true mean, rather than an actual disturbance in the OrderlyC model - though we have not yet conducted a formal analysis. Furthermore, and consistent with that proposition, OrderlyC neither consistently overestimated nor underestimated across the bands; rather it is essentially bias free when single hours are considered.

Certainly OrderlyC had between 5 times and 100 times less systemic error than Erlang C, and between 16 times and 122 times less systematic error than the Erlang A implementation we looked at.

Day by Day Systematic Error

In theory, good models should become more accurate with more data. So, giving a model a whole day's worth of hours, and then totalling up the model's predictions and comparing them with the actual day's totals should yield greater accuracy.

In order to test, we did exactly that with our database of almost a million hours of call centre data and all three models. We summed the hourly data into 24 hour periods for each call centre, and banded the resulting days by total intensity (so we added up the intensity in Erlangs for each hour of the day). The lowest band 0 - 25 total Erlangs corresponded to around 300 calls per day. The highest band of over 200 Erlangs covered up to 15000 calls in a single day.

Intensity (Daily) Erlang A Erlang A* Erlang C OrderlyC
Over 200
-15.5%
 
-14.9%
 
9.5%
 
0.7%
 
175 to 200
-14.2%
 
-10.7%
 
9.7%
 
1.2%
 
150 to 175
-14.6%
 
-8.7%
 
11.0%
 
2.7%
 
125 to 150
-18.3%
 
-7.7%
 
13.3%
 
3.8%
 
100 to 125
-22.5%
 
-12.2%
 
14.6%
 
3.3%
 
75 to 100
-28.0%
 
-22.1%
 
8.0%
 
-4.3%
 
50 to 75
-29.9%
 
-24.1%
 
1.5%
 
-3.7%
 
25 to 50
-35.0%
 
-26.3%
 
-1.5%
 
-1.1%
 
0 to 25
-40.8%
 
-17.5%
 
-7.3%
 
-0.2%
 

Surprisingly, Erlang C performed best at the highest and lowest ends, and worst in the middle. At 0 - 25 Erlangs, the systematic error was an underestimation by 7.3%, and at 25 - 50 Erlangs (around 750 calls per day), this fell to an underestimate of just 1.5%, which was Erlang C's best day-to-day result. However, Erlang C's systematic error quickly rose to an overestimate of 14.6% at 100 Erlangs (around 1500 calls per day), before falling again to an overestimate of 9.5% at the highest traffic band. The anomalously low systematic error around 50 Erlangs represents a 'sweet spot' where Erlang C switches from underestimating to overestimating. Any model (of anything) that switches bias in this manner must have a point at which its systematic error is zero. It would therefore be misleading to represent Erlang C as more accurate than other models on this basis; certainly sweet spot areas should not form the basis by which model accuracy is assessed.

Erlang A consistently underestimated the total seat hours, by 41% in the lowest band, but systematic error did reduce with increasing calls, and it fell to an underestimate of 15.5% in the highest demand band. Erlang A systematic error was greater in magnitude than Erlang C systematic error in every band. Systematic error for ErlangA was significantly reduced when the daily patience time was used for each hour in the day, rather than the specific patience time for that hour. At the lowest band the systematic error reduced to an underestimate of 17.5% (so Erlang A accuracy for small call centres was more than doubled by this one change), with the benefit falling to a reduction of systematic error of 0.6% at the highest band.

Once again OrderlyC outperformed both Erlang models. At the lowest traffic levels it underestimated total seats requirement for the day by just 0.2% (which is 30 times less systematic error than Erlang C at this same band). As with Erlang C, its maximum systematic error was in the middle, with an underestimate of 4.2% in the 75 - 100 Erlang band, and an overestimate of 3.3% at the 100 - 75 Erlang band. OrderlyC was twice as accurate as Erlang C and four times as accurate as Erlang C in these bands respectively. Although it's not shown in the table, we can confirm that there is a sweet spot at the boundary between these bands, however unlike Erlang C, the magnitude of the OrderlyC systematic error does tend towards zero with increasing demand, as it should, rather than growing. From there OrderlyC accuracy increased dramatically, with systematic error reduced to an overestimate of just 0.7% at the highest band. It was more accurate than Erlang C in every band save the Erlang C sweet spot. Averaging across all bands, OrderlyC was 7.5 times more free of systematic error than Erlang C, and had 29.6 times less systematic error than Erlang A.

So, did the models become more accurate when a whole day's data was used? In terms of systematic error, sometimes yes, sometimes no. The reason for this is that a single day at a typical call centre contains highly variable levels of demand, from quiet times at opening and closing to comparatively large peaks (normally mid morning and mid afternoon). Some of our call centres had on average 100 times more traffic during their peak hours when compared with their quietest. Averaging across all the call centres, peak hours were 28 times busier than quiet hours. Because systematic error of the all the models varies by demand, this results in a mixed picture when comparing systematic error on a day by day basis with systematic error for individual hours treated separately.

None the less, the systematic error should tend towards zero at the highest daily intensity. OrderlyC was the only model that achieved this.


Residual Error (aka RMS or Random Error)

The RMS error gives an indication of how much each individual prediction varies from each individual measured result (as opposed to the averages used to assess systematic error). The smaller the RMS, the more accurate the model.

It is important to point out that there is natural variation in the real life max seats data, even given the exact same calls per hour, handle time and answer rate or ASA, so one should not expect any call centre forecasting model that only has these measures as input to have zero RMS.

To quantify the sample variation, we looked at every value of intensity from 5 to 10 Erlangs, grouped together in 0.1 Erlang bands, and the specific answer rates 95% - 96%, and 96%-97% (so 100 different groupings in total). The natural RMS variation in max seats even with these very narrow ranges of input was 15.5% of each group's mean Max Seats value. This is a fundamental limit on the accuracy of any model that uses these inputs, so one should not expect any model to have an RMS less than this, or single-hour forecasts to be more accurate than this, at least in the 5 to 10 Erlang band.

We also repeated this exercise with ASA instead of answer rate, where intensity was grouped by 0.1 Erlang bands and ASA was grouped by 5 second bands. We found the natural sample variation under these measures was 25.3%, which is much larger than the sample variation for answer rate. This means that models which take calls per hour, handle time and ASA as input for predictions (like Erlang C), even when perfect, cannot be as accurate as models that take calls per hour, handle time and answer rate as input (like OrderlyC). This is despite the fact that in theory the ASA contains more information than the answer rate, as the ASA is an average of times (each time being any amount of time), rather than the average of a binary value (1 or 0, answered or not answered).

We do not have sufficient data to do a similar pinpoint comparison for the combination of calls per hour, handle time, ASA, patience time and answer rate that would enable a comparable direct assessment of the limit of accuracy possible for models that take the Erlang A inputs. It could be smaller if each additional parameter reduces the sample variation. It could be larger if the inherent sample variation within the volume defined by the parameter space increases with larger dimensionality. We are working to answer this question.

However, it is worth noting at this point that Erlang A can be used to target either answer rate or ASA. We tried both methods, and found that Erlang A was much more accurate when answer rate was targeted. When ASA was targeted instead, Erlang A switched from underestimating to overestimating, when really it should underestimate (see later discussion in the Observation discussion), and the amount of systematic error was also increased. RMS error was also increased substantially. The results presented in this paper all use answer rate as the Erlang A target.

This lends further weight to the proposition that the greatest accuracy is possible when answer rate is used as the target performance indicator. Call centres still using service level or average speed of answer should switch.

For all models, the least accuracy is expected at the smallest number of calls, as averages cannot be accurately measured with a small sample size, and Poisson type errors become significant (whereby the effect of a single call coming in or not coming in, or being answered or not being answered, is a large change in measured averages). Erlang A is expected to be particularly sensitive to this, as the Patience Time input relies entirely on a small number of abandoned calls in each hour, so we also tried Erlang A with the daily average patience time as before, again shown as Erlang A* in the tables.

None the less, good models should become increasingly accurate the more calls there are.

On to the results.

Hour by Hour RMS Error

Intensity (Hourly) Erlang A Erlang A* Erlang C OrderlyC
Over 40
17.9%
18.3%
34.4%
9.0%
35 to 40
16.3%
17.4%
22.8%
8.6%
30 to 35
18.7%
19.5%
22.3%
10.8%
25 to 30
20.3%
21.5%
22.4%
12.7%
20 to 25
20.5%
20.9%
20.6%
13.0%
15 to 20
19.5%
20.2%
21.6%
12.9%
10 to 15
22.5%
23.5%
24.7%
14.8%
5 to 10
31.5%
32.9%
30.8%
18.5%
Up to 5
39.1%
41.9%
30.6%
22.7%

The above table gives the RMS error as a percentage of the average of the maximum seats of each hour within the band. Residual error has no particular direction, so all the values are positive. The size of the RMS is indicated by the size of the circle. The smaller the pinpoint, the more accurate the model. OrderlyC hourly RMS values are likely to be close to the minimum possible for any model.

Looking across all bands, Erlang C had the highest RMS error (so least accuracy) at the lowest call volumes (< 5 Erlangs, and 5 - 10 Erlangs), where the square root of the average of the square of the difference between its predictions and the actual measured max seats in both bands was 31%, so for example in hours where a call centre actually had 10 seats, Erlang C predicted 7,8,9,10,11,12 or 13 seats about 70% of the time. From there RMS fell smoothly to a low point (so highest accuracy) in the 20 - 30 Erlang range (about 200 calls per hour), of 21%. After that, the RMS grew again (so less accurate), to a maximum value of 34% at the greater than 40 Erlang range (so for call centre hours that actually had 50 people handling calls at the same time, Erlang C predicts between 33 and 67 servers about 70% of the time).

Erlang A was less accurate than Erlang C for very low intensities (39% RMS at < 5 Erlangs), and comparably accurate to Erlang C up to the 20 - 25 Erlang range (around 200 calls per hour). At this point it became more accurate than Erlang C, with smaller RMS values, falling to an RMS of just 18% at traffic in excess of 40 Erlangs, so Erlang A was found to be twice as accurate as Erlang C for high call volumes. So, for example, again when a call centre actually had 50 seats, Erlang A predicted between 41 and 59 seats about 70% of the time. This smaller RMS range makes Erlang A more accurate than Erlang C at high call volumes, despite having more bias. We found that using the hourly patience time produced smaller RMS residuals (so more accuracy) than the daily patience time for all levels of intensity, but the difference wasn't very large (about 0.2 seats).

Again OrderlyC did much better than either Erlang model, at all levels of demand. As is to be expected, OrderlyC had its worst accuracy at very small call volumes (as, for example, less than 100 calls means the answer rate cannot be measured with a precision of more than 1%), but with an RMS of 23% in the less than 5 Erlang band OrderlyC was still one third more accurate than Erlang C and almost twice as accurate as Erlang A.

OrderlyC accuracy improved rapidly with increasing demand.

In the 5 - 10 Erlang band, RMS was down to 18.5% (which is only 3% more than the sample variation described above, which is the fundamental limit on accuracy at this level of demand), dropping to 14.8% at 10-15 Erlangs, and then falling smoothly to less than 9% at the greater than 40 Erlang band.

So, unlike the Erlang models, OrderlyC continued to become more accurate with increasing call volume all the way up the scale. In terms of RMS error, OrderlyC was between 25% more accurate (very low demand) and four times as accurate (very high demand) as Erlang C, and between 34% more accurate (low demand) and twice as accurate (very high demand) as Erlang A.

Day by Day RMS error

Daily RMS error was found by totalling the max seats from each hour in the day, and finding the square root of the average of the squares of the difference between that value and the total predicted max seats for each hour of the day.

It should be emphasised that daily RMS error as shown in the table below is the key accuracy measure for call centre managers who plan their staffing rotas for daily traffic patterns, i.e. most call centre managers - so we've saved the most important and practical result for last.

Intensity (Daily) Erlang A Erlang A* Erlang C OrderlyC
Over 200 - up to 15000 calls per day.
17.5%
17.5%
16.2%
3.7%
175 to 200
16.2%
13.3%
11.9%
5.4%
150 to 175
16.9%
12.6%
13.4%
6.8%
125 to 150
20.7%
13.4%
17.0%
7.9%
100 to 125
24.6%
19.1%
20.7%
10.2%
75 to 100
29.7%
27.2%
21.0%
11.7%
50 to 75
32.3%
28.2%
21.3%
11.8%
25 to 50
38.0%
31.5%
23.5%
12.5%
0 to 25 - around 300 calls per day.
46.6%
31.3%
21.5%
14.4%

Small green dots mean pinpoint accuracy.

Erlang C RMS error for a whole day of data was highest in the 25 - 50 Erlang band (around 750 calls per day), at 23.5%. It was actually slightly lower for the 0 - 25 Erlang band, at 21.5%. It fell to a minimum of 11.9% at the 175 - 200 Erlang band, and then rose again at the highest traffic band to 16.2% - so that's quite a large error on all bands. Certainly Erlang C should never be relied upon to provide predictions with more than 15% accuracy; a 15 - 25% expectation is more reasonable.

Erlang A RMS error dropped smoothly from 46% at the lowest traffic band to 16.2% at the second highest band. At the maximum band, it rose again slightly to 17.5%. Erlang A accuracy was substantially improved by using the daily patience time, shown in the second column. When the daily patience time was used, Erlang A became more accurate than Erlang C in many intensity bands, and was about as accurate as Erlang C in the remaining bands. Erlang A is to be preferred over Erlang C for this reason.

OrderlyC RMS error started low and kept dropping, so OrderlyC accuracy does indeed increase when more calls and hours are in play. At the lowest band, 0-25 Erlangs, the OrderlyC RMS error was just 14.4%. This dropped immediately to 12.5% in the 25-50 Erlang band, and was down to 10% by 100 Erlangs. From there RMS error continued to fall to a comparatively minuscule 3.7% at the highest band, which by far exceeds the accuracy of any Erlang method at any band.

OrderlyC was again very much more accurate than either Erlang method in all bands, ranging from 33% more accurate to over four times as accurate as Erlang C, and from twice as accurate to almost five times as accurate as Erlang A.

Observations & Recommendations

A number of observations are worth summarizing:

Regarding Limits of Accuracy

There is a limit to the accuracy possible of any predictive model due to natural variation within the real world data. When a single hour of data is considered, this was measured as 15.5% RMS for smaller call centres with answer rates around 95%. We can also infer that the hourly limit of accuracy for large call centres across all answer rates must be less than 9% as OrderlyC was able to achieve an accuracy greater than this. The daily limit on accuracy must similarly be at least accuracy within 4%.

Sample variation when ASA is used as the defining measure was much larger when measured at the same band. This, combined with the outperformance of the purely ASA-based Erlang C by the models that did also incorporate answer rate (Erlang A), or did not use the ASA measure at all (OrderlyC), when combined with the fact that Erlang A is much more accurate when answer rate is targeted, indicates strongly that models that predict based on answer rate should generally be preferred over those that target ASA, as they offer the greatest scope for accuracy.

We therefore recommend that call centres that plan around a service level or ASA target as their key performance indicator switch to answer rate. This is also of the most benefit to callers, and the business.

Regarding Erlang A

When looking at Erlang A with individual hours, using the daily patience time rather than the hourly patience time does reduce systematic error at low call volumes, however, this is associated with a reduction in RMS accuracy at all hourly call volumes, so perhaps it may seem this is best avoided.

However, when the hours are grouped into days (like our Day Planner), these errors cancel each other out, and it is definitely a better idea to use the daily patience time rather than the hourly patience time, as this significantly improves both Systematic and Residual error in Erlang A, regardless of demand, when forecasts for a whole day are required.

We therefore recommend call centres using Erlang A apply the daily patience time in normal usage.

Given the limits of accuracy described above, we would also normally recommend that Erlang A is used to forecast and/or plan for a target answer rate rather than target average speed of answer or service level (Erlang A can be used for either one). The Erlang A results shown in this paper target answer rate.

Regarding Erlang A and Erlang C

On an hourly basis, the systematic tendency towards underestimating in Erlang A is more than compensated for by the reduction in residual error over Erlang C for hours with more than about 15 max seats. This indicates that Erlang A is more suitable for use than Erlang C at call centres that normally have more seats than this.

On a daily basis, we again see Erlang A become more accurate than Erlang C once a threshold of activity is exceeded, so long as daily patience time is used.

The consistent bias of Erlang A towards underestimating the max seats requirement may initially seem troubling, but is explainable, as Erlang A uses a fixed number of servers, but real world call centres have varying numbers, sometimes more, sometimes less. It is therefore always likely that a call centre would achieve the same answer rate with a smaller maximum number of servers if this value was held constant throughout the hour, which is the assumption that underpins both Erlang methods. The tendency towards underestimation is therefore not a failure of the Erlang A model per se; rather it can be construed as a result of the methodology used in this experiment, in terms of the way the Erlang A prediction was matched against the real world data.

We suggest that Erlang A can be safely used so long as one is mindful that if you do vary staff numbers for any reason, then more seats will be required at some point than the output of Erlang A. However, the same cannot be said for Erlang C, which usually and inexplicably overestimated the number of servers.

Furthermore, Erlang A can be used (and is most accurate) when staffing around answer rate, which is the superior performance indicator both in terms of accuracy and commercial and caller value, whereas Erlang C cannot.

We therefore recommend Erlang A over Erlang C.

Regarding Erlang C, Erlang A and OrderlyC

Neither Erlang method could come close to matching the performance of OrderlyC, at any level of demand, on either an hourly or daily basis, in terms of either RMS or systematic error.

OrderlyC was the only model with no detectable bias.

OrderlyC was also the only model to tend towards zero systematic error with increasing demand.

OrderlyC has between 25% more accuracy and four times as much accuracy as any other model at all bands, using the standard RMS measure.

We therefore recommend OrderlyC over both Erlang methods in all circumstances.

Conclusion

We echo the calls from many academics to move away from Erlang C to more accurate models, including Erlang A. Such a move is supported by our data.

OrderlyC has been found to be more accurate than either Erlang C or Erlang A at all observed levels of demand, with negligible systematic error, no bias, and a much smaller RMS error, both at the level of individual hours and also across entire days.

OrderlyC will therefore give you the best predictions of the number of seats you need in order to achieve a specific answer rate target, as well as the expected answer rate given a particular number of seats.

We therefore believe that OrderlyC is the most accurate call centre planning algorithm in the world.

In order to help call centre managers with the tricky job of planning and forecasting, we have decided to make OrderlyC freely available to use in our Block and Day planners. We invite you to check the accuracy for yourself.

If you have any questions or comments about our work, please Contact Us.

If you make a call centre planning product or tool and would be interested in delivering higher quality forecasts for your customers, then please contact us for further information.

The Erlang A library used in the production of this paper was graciously provided by Dr Wyean Chan of the University of Montreal, to whom we are most grateful. Dr Chan also provides an excellent online Erlang A calculator.

Mean Seats
The Mean Seats figure can be calculated exactly for any call centre data so long as the answer rate, number of calls and average handle time is known.

It represents the number of people that would need to be continuously handling calls throughout the period in order to achieve the specified answer rate.

Put another way, the specified answer rate cannot be reached with fewer seats than the mean seats figure shown; it is the mathematical lower limit.

So, you always need at the very least the mean seats figure available if you want to have any hope of achieving your target answer rate. It cannot be done, even in theory, with fewer than the Mean Seats.


Max Seats
In practice, just having the mean seats available will never be enough. This is because the level of caller demand during any period is not constant (and neither is the average handle time), which means peaks and troughs, so you will have to have more agents working at some times than at others. This necessarily makes the max greater than the mean. In addition to peaks and troughs caused by changes in demand, there are also statistical peaks and troughs caused by the Poisson distribution - for even when average demand is constant, some 5 minute intervals will have more calls than others.

By contrast, the Max Seats figure predicted by OrderlyC corresponds to the actual maximum number of people handling calls at any moment in each one hour period in our enormous data set. The call centres did have at least this many agents ready to take calls at some point in each hour in order for all those people to handle calls at the same time, which is why Max Seats is the basis for forecasting and costing.

Can you get away with fewer? The Max Seats figure is a reflection of what actually happened, not what is possible, so maybe, in some cases. For example, if you know that your call centre is quiet when it opens at 9am and then gets busy at 10am, you might get away with having fewer agents towards the start of the hour and more towards the end. However, this still means you need the maximum seats at some point in the hour, and you still have to pay wages for these people. You might be able to have some of them doing other tasks like email, for some of the time, so long as they can be moved swiftly onto calls when the peak comes in.

Conversely, if you don't know when your peaks will occur within any particular hour, then you need the whole team sat in the call centre for the whole hour to be sure of coping with any peaks well enough to reach the target answer rate, which is why Max Seats is the figure to use for planning purposes.

By the way, if you want Max Seats answer rates with Mean Seats numbers of people (and who wouldn't), really the only way to go about it is to use the OrderlyQ traffic shaper, which ensures that your agents are not idle waiting for callers, and also that your callers are never holding for agents. You can request a free trial of OrderlyQ by getting in touch from our Contacts page. If you currently have an 85% answer rate you should expect a 99%+ answer rate with OrderlyQ, and we offer a 300% return on investment guarantee, so you can only win with OrderlyQ.