This was initially believed to be DNS times so the system was updated to pre-resolve DNS (and use the IP address in the test directly where possible). Though this does seem to have slightly reduced some time delays the issue still seems to be there. This has been raised as a bug (ID 148).
The problem can be seen in the image at the bottom which shows two tests at various levels of zoom (rangemax). The two are identical web time tests, the one on the left being the first test on the node and the right-hand one the fourth.
I think this is down to the sheer number of tests running at once. The system fires off testers every five minutes and spawns them all near-simultaniously. The load in terms of MySQL connections (and in this case outgoing network sockets) is therefore at it's maximum right at the start. Later tests are likely to have less running alongside them (other nodes with only one or two tests configured having finished) or at least be more out-of-sequence with each other (the first tests will execute near-simultaniously the but second tests will only begin after the first has finished which will take varying times on different nodes so the second tests and onwards execute out of step with each other).
There are various potential ways to solve this problem (if that is indeed the cause) but none of them are cleanly implementable with the current setup (either requring a dirty hack or a big rewrite to core code).
As the plan is to eventually have a much better time system that isn't just a crude five-minute job then it may be better to wait (perhaps just implementing an optional dirty fix for those users who find it a specific problem) and do that big rewrite with the problem in mind.
Anyway - investigations continue...
If anyone has any ideas on what may be causing this or a neato solution please drop me a line in public here or in private here.
So far the "fix" bodges considered include:
- Random start delays - the node tests are randomly delayed between 10 and 60 seconds before starting to avoid clashing with each other as much as possible
- Rolling Testing - the nodes are tested over a rolling four minute period so if you have two nodes they are started at 0s and 120s, four 0s, 60s, 120s, 180s and 240s etc...
Example showing the same configured web time tests running first and fourth in sequence on a node. The first test can be seen to consistently take 0.05-0.1s longer and also have many more "above normal" times as well