1 minute error error messages

The forum for help and support with FreeNATS as well as any useful hints and tips
Post Reply
paullanders
Posts: 92
Joined: Thu Sep 04, 2008 9:48 pm

1 minute error error messages

Post by paullanders » Thu Oct 02, 2008 7:56 pm

Hi Dave.

I noted the following error message in the System Event log at 1 minute intervals:

13:47:02 02/10/2008 Tester Start 5 Tester 345779 Started
13:47:01 02/10/2008 Tester Stop 5 Tester 345778 Finished
13:47:01 02/10/2008 Tester Start 5 Tester 345778 Started
13:46:24 02/10/2008 Tester Error 1 Tester Already Running: Aborted
13:46:23 02/10/2008 Tester Stop 5 Tester 345777 Finished
13:46:23 02/10/2008 Tester Start 5 Tester 345777 Started
13:46:22 02/10/2008 Tester Stop 5 Tester 345776 Finished
13:46:22 02/10/2008 Tester Start 5 Tester 345776 Started

Is this a cause for concern?

Thanks!

Paul

dave
Site Admin
Posts: 260
Joined: Fri May 30, 2008 9:09 pm
Location: UK
Contact:

Re: 1 minute error error messages

Post by dave » Thu Oct 02, 2008 10:30 pm

Hi Paul,

Do you get them every minute because then it might be a problem.

... A bit of background ... then a test is started for a node (or in the old days for a group of nodes but one after another) a fntestrun record is created. At startup the script checks to see if a session is already running by just looking at this table. If it's already running then that error is generated and the script exits.

This was to avoid having to use pid-control and also as a safeguard (years ago I managed to bring down a server that was struggling when my tester continued to unrelentingly open a new socket every 1 minute whilst the others were still connected). In fact this is a cause of one of the more common bugs as following a reboot or a process crash the record is still live and the process can never complete to mark it finished. The cleanup script closes ones over a certain time though so generally if a process bombs you only go a day without any tests before it automatically starts again.

It is not uncommon for test sessions to genuinely overlap.

The most common cause is a test which is failing or taking a long time - a 3 retry http test with 120s timeout may take six minutes to fail and so five or six next sessions may be skipped. You may also have some nodes tested every minute and others not. When loads are due node X is started later than the others and the script fires again almost straight away, this time without the delay of starting all the other node tests.

So....

If it's just happening occasionally then I really wouldn't worry.

If it is happening every time - do you perhaps have a stuck test session (System Settings - Test Sessions will show you any one still "running" and you can manually mark it terminated). Or they may be something else going wrong.

Cheers,

Dave.

(Sorry for the long reply)

paullanders
Posts: 92
Joined: Thu Sep 04, 2008 9:48 pm

Re: 1 minute error error messages

Post by paullanders » Thu Oct 02, 2008 10:46 pm

Hmm, yeah I didn't even know that I had test sessions running! I killed 2 sessions and will monitor it to see if the errors go away.

Thanks!

Paul

dave
Site Admin
Posts: 260
Joined: Fri May 30, 2008 9:09 pm
Location: UK
Contact:

Re: 1 minute error error messages

Post by dave » Fri Oct 03, 2008 11:06 pm

Yes it's a bit of a problem sometimes. I will make the tester script able to clear stale locks itself at some point.

Cheers,

Dave.

Post Reply