Understand what you are and are not simulating

If a customer has a history of upgrade problems, we might decide to collect some configurations from live sites, load them up in our lab, and practice the upgrade before doing the real one – that should help, right?

It does help some. There are 2 parts to an upgrade success/failure result: there’s “does the upgrade process run ok, nothing aborts or makes a process crash or a system reboot when it isn’t supposed to”, and then there’s “afterwards, do all the call/data flows that used to work, still work”.

Our lab is really not set up to be able do a complete simulation of the customer environment, able to use the customer IP interfaces unchanged and with test equipment that can run calls using the IPs of actual peer nodes in that network, much less produce call flows (message sequences) that look like the ones presented to the live system. We load the customer database and then add some additional configuration so the test tools can run test calls using the local lab network addresses. This does pretty well for finding problems with the actual upgrade process – stuff like “this system has some corrupted configuration data that needs to be fixed before we try to upgrade’, and “this system is using feature X, that needs to be disabled before the upgrade and re-enabled afterwards”.

But if the upgrade on the live system runs ok, but some calls fail afterwards, this approach would not be expected to find that during the practice run. The calls we had up during the practice upgrade do not look much like the calls that come into the live system. The test calls use just a few basic call flows and have a simple set of INVITE headers and are processed by the basic configuration settings used for a lab network. The live calls have more varied headers and message sequences, and those headers and call flows interact with the customer’s network-specific configuration settings – which can get somewhat, um, exotic.

And, this testbed looks attractive for investigating the failed-calls problem, because it already has the customer database loaded, but it may not really be good for that. Could we adapt the configuration and test tools for the problem investigation? Maybe yes, but maybe a different testbed, perhaps one already used for ongoing support for this customer, would be better.

Building a simulation of a customer’s live traffic is hard, you have to know what their call flows and headers look like. Some customers don’t themselves have a complete picture of that information. And then you have to program the test tools … I don’t know if building such a simulation would be worthwhile. For a customer for which we had a good understanding of their call flows (and they didn’t have huge variation) – maybe that would produce some insight into how to prevent those post-upgrade failures.

It would probably be fun to build.