ARC Infosys PBS.pm Speed Gains

1. Introduction to Nikhef's ARC-CE
2. Issues with the ARC infosys slow updates
3. Diving under the hood
4. Putting the data to the test
5. Sorting out the data
6. Putting the new code to work
- 6.1. Update [2021-11-16 Tue]

1 Introduction to Nikhef's ARC-CE

This bit is boring and can be skipped without penalty.

Nikhef operates a compute cluster with several thousand cores; the actual number has varied over the years from around six to currently close to ten thousand.

We operate in federation with other labs world-wide to join and share compute resources; this requires a front-end service to proxy between the global user identity and their local mapped identities. Mundane tasks such as user identity mapping, authentication and authorisation, copying of input and output data, queuing compute jobs on the local batch system, and reporting the status of jobs befall this front-end service.

For a number of years we ran a system called the CREAM-CE, which nobody outside our niche ever heard of; this was only one of a few such system that implemented the required interfaces.

Things never last for ever, and the funding for the continued support for CREAM ran out. We (and plenty of others) had to switch brands in 2019. The logical choice was the ARC-CE, coming out of the Nordugrid corner.

I could make all kinds of allusions about trading in designer Italian furniture for stylish Scandinavian, but I won't. More than anything we suffered from switching from running a system with well-known quirks and bugs to stepping into one pit-fall after another.

2 Issues with the ARC infosys slow updates

One of the components of the ARC-CE is the information system, which reports on all the jobs in the system via an LDAP directory. The directory is continually updated by asking the batch system about its jobs and cross-referencing these with the jobs the ARC-CE knows about. It's not a one-on-one correspondence because we have multiple ARC-CEs submitting to the same batch system, and jobs are entering and leaving the systems all the time. After leaving the batch system they linger on the ARC-CE until the user retrieves the results. At any time as many as thirty thousand jobs could be in some state of progress on the system.

We started noticing that the update process became excruciatingly slow. So slow, in fact, that our users started to complain about it. The jobs they had submitted were not reporting back until a half our later so they had no idea how their jobs were doing. It's like sending control commands to a mars probe that has to navigate a perilous path among sand traps and craters.

Finding out what was keeping the system busy was easy enough: the process that occupied the CPU for most of the time stood out like a sore thumb when running top:

top - 23:01:21 up 13 days,  5:25,  1 user,  load average: 4,81, 2,72, 2,46
Tasks: 228 total,   2 running, 226 sleeping,   0 stopped,   0 zombie
%Cpu(s): 31,0 us, 21,1 sy,  0,0 ni, 47,8 id,  0,1 wa,  0,0 hi,  0,0 si,  0,1 st
KiB Mem : 16243660 total,   349800 free,  2248336 used, 13645524 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  9622540 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                 
111639 root      20   0  545092  51752   9824 S  99,7  0,3   5963:04 arched                                                                                                                                                                                                                                                  
 47468 root      20   0  386484 215584   2744 R  99,3  1,3   0:07.68 CEinfo.pl                                                                                                                                                                                                                                               
 66066 ldap      20   0 2249892 696328   3748 S   7,3  4,3   1001:05 slapd                                                                                                                                                                                                                                                   
   594 nscd      20   0  613728   2276   1500 S   0,7  0,0  20:42.09 nscd

Besides the master arched process itself, the CEinfo.pl program is running in a seeming endless loop. This is the program responsible for creating the ldif data that feeds into the information system. I could see it was running for a very long time.

The log file in /var/log/arc/bdii/bdii-update.log shows the time stamps that match beginning and end of this process so we had a pretty good idea how long it was running, at 100% CPU as well.

3 Diving under the hood

As it is a Perl program, it is easy enough to inspect the source to try and find out where it is spending its time. It turns out that it calls a module related to the local batch system, which, in our case, is that good old dinosaur Torque (of PBS lineage).

With some deliberate print statements and calls to gettimeofday I could really whittle down the code to find the nested loop that was responsible for all of the time spent. Here it is, starting in PBS.pm on line 853.

my ( %hoh_qstatf ) = read_qstat_f($path);
foreach my $pbsjid (keys %hoh_qstatf) {
    # only jobids known by A-REX are processed
    my $jid = undef;
    foreach my $j (@$jids) {
         if ( $pbsjid =~ /^$j$/ ) {
             $jid = $j;
             last;
         }
    }
    next unless defined $jid;
    # handle qstat attributes of the jobs
    foreach my $k (keys %{$hoh_qstatf{$jid}} ) {
        my $v = $hoh_qstatf{$jid}{$k};
        &$handle_attr($jid, $k, $v);
    }
    # count cpus for this jobs
    set_cpucount($lrms_jobs{$jid});
}

This seemingly innocent chunk of code purports to do the following job. The output of qstat -f, which is basically all of the jobs in Torque, are correlated with all the jobs the ARC-CE knows about, held in the array @$jids. Both data structures are keyed on the Torque job ID which look like 98251226.korf.nikhef.nl. For each match, the Torque attributes for the job are filed away to end up in LDIF form the information system.

There is nothing wrong with what the code tries to achieve, but everything is wrong in how it goes about it. The outer loop is a simple linear traversal of the qstat jobs. But for every job, another linear search is initiated in the $jids array to find a match. The total time spent searching in this way is on the order of the square of the number of jobs in the system, or $O(N^2)$ in computer science speak.

Besides the expensive array traversals, the actual matching code uses a regular expression which looks like something that forces an exact match. Experimentally this seems to differ little from a straight up string match with the Perl eq operator, but there is a certain cost to each comparison anyhow.

The real clincher, though, is the fact that this loop is run once for every queue defined in the configuration file, even though the exact same data is parsed again and again. We have thirteen queues, so however slow this code already was, it is now thirteen times slower.

4 Putting the data to the test

I had a pretty decent idea how this could be improved, but in order to do that I had to get my hands on a good sample of data. Updating the code live and hoping I didn't make any programming errors was going against my better judgement as a system administrator, so I wanted to run this on a test bed first. But without actual data (thousands of jobs) the effects of any code improvement would be impossible to demonstrate.

The first step was to add a couple of lines to the code that would dump out both large data structures to file so I could copy them over to a development system. I could test out this small change on a test bed version of ARC-CE first to have a proof of concept.

--- /tmp/PBS.pm         2021-10-26 23:58:42.142526735 +0200
+++ /tmp/PBS-nstore.pm  2021-10-26 23:58:42.022530905 +0200
@@ -24,6 +24,8 @@

 use LogUtils ( 'start_logging', 'error', 'warning', 'debug' );

+use Storable qw(nstore);
+
 ##########################################
 # Saved private variables
 ##########################################
@@ -773,7 +775,7 @@

     # LRMS job IDs from Grid Manager (jobs with "INLRMS" GM status)
     my ($jids) = shift;
-
+    nstore \@{$jids}, "/tmp/jids.dat";
     # Return data structure %lrms_jobs{$lrms_local_job_id}{$keyword}
     #
     # All values should be defined, empty values "" are ok if field
@@ -856,6 +858,7 @@
     };

     my ( %hoh_qstatf ) = read_qstat_f($path);
+    nstore \%hoh_qstatf, "/tmp/hoh_qstatf.dat";
     foreach my $pbsjid (keys %hoh_qstatf) {
         # only jobids known by A-REX are processed
         my $jid = undef;

Perl comes with the Storable module to dump out data structures in a way that can be read in later. It is pretty fast as it is implemented in C.

After proving that the added lines were not harming the operations on the test bed, I ran this on one of our ARC systems for a couple of update cycles. Here the advantage of running an interpreted language was once again obvious, as no restarts of any services were required.

cd /usr/share/arc/
mv PBS.pm PBS.pm-orig
cp -p PBS.pm-with-nstore  PBS.pm

After capturing the output files and copying them to my laptop, I reverted the changes.

The files were of a decent size but not unmanageable.

-rw------- 1 user user  14M 26 okt 18:53  hoh_qstatf.dat
-rw------- 1 user user 221K 26 okt 18:53  jids.dat

Now I could write a minimal example program that had the exact same data and the exact same loop as present on the production system.

I came up with roughly the following.

#!/usr/bin/perl -w
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
                    clock_gettime clock_getres clock_nanosleep clock
                    stat lstat utime);
use Storable;
my ($jids) = retrieve 'jids.dat';
my $x  = retrieve 'hoh_qstatf.dat';
my %hoh_qstatf = %$x;
my $t1 = [gettimeofday];
my $undefcount = 0;
foreach my $pbsjid (keys %hoh_qstatf) {
  # only jobids known by A-REX are processed
  my $jid = undef;
  foreach my $j (@$jids) {
    if ( $pbsjid =~ /^$j$/ ) {
      $jid = $j;
      last;
    }
  }
  if (! defined $jid) {
    $undefcount++;
    print "$undefcount\n" if $undefcount % 10 == 0;
    next;
  }
  my $t2 = [gettimeofday];
  #print tv_interval($t1, $t2), "\n";
  $t1 = $t2;
  # handle qstat attributes of the jobs
  foreach my $k (keys %{$hoh_qstatf{$jid}} ) {
    my $v = $hoh_qstatf{$jid}{$k};
    &handle_attr($jid, $k, $v);
  }
  # count cpus for this jobs
  #set_cpucount($lrms_jobs{$jid});
}
sub handle_attr($$$) {
  #my $t2 = [gettimeofday];
  #print tv_interval($t1, $t2), "\n";
  #$t1 = $t2;
  print shift, " ", shift, ": ", shift,"\n";
}

I played around a bit with printing some timings, and printing every tenth time a job was not found. This was mostly out of curiosity. The actual handling of the job attributes was replaced by a stub subroutine that just printed the stuff.

Running the code confirmed what I already suspected: it was very slow. The lines scrolled by on the screen faster than I could read, but not so fast that I could not follow it at all. And most noticeable: the fan came on and went to high-speed for all the time it ran.

time ./pbs.pl  > /tmp/out

real	2m30,209s
user	2m29,906s
sys	0m0,260s

5 Sorting out the data

I wouldn't claim to be a brilliant programmer, but I did see how to improve the code immediately. A novice may try to be clever and come up with a better search algorithm for the inner loop, like turning it into a sorted list and performing a binary search, and yes, that would work; it would reduce the run time to $O(N \log N)$ and I can't really improve on that. But the elegance of my solution lies in the realisation that there is considerable overlap between the two lists, and the order of the output does not really matter. (I really took a swing at that novice straw man there, didn't I?)

So after sorting both indices, we can start traversing both lists linearly, in lockstep, advancing the one with the lesser index, acting only when both agree. One linear traversal of the lot and we are done. The code is not truly linear because it requires the sorting of the indices which itself is a $O(N \log N)$ operation but the simplicity of the code is a joy.

#!/usr/bin/perl -w
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
                    clock_gettime clock_getres clock_nanosleep clock
                    stat lstat utime);
use Storable;
my ($jids) = retrieve 'jids.dat';
my $x  = retrieve 'hoh_qstatf.dat';
my %hoh_qstatf = %$x;
my $t1 = [gettimeofday];
my $undefcount = 0;
# make two sorted indices
my @qstatkeys = sort keys %hoh_qstatf;
my @sjids = sort (@$jids);
my $jidindex = 0;
foreach my $pbsjid (@qstatkeys) {
  # only jobids known by A-REX are processed
  my $jid = undef;
  while ($sjids[$jidindex] lt $pbsjid) {
    last if ($jidindex == $#sjids);
    $jidindex++;
  }
  if ( $pbsjid =~ /^$sjids[$jidindex]$/ ) {
    $jid = $sjids[$jidindex];
  }
  if (! defined $jid) {
    $undefcount++;
    print "$undefcount\n" if $undefcount % 10 == 0;
    next;
  }
  my $t2 = [gettimeofday];
  #print tv_interval($t1, $t2), "\n";
  $t1 = $t2;
  # handle qstat attributes of the jobs
  foreach my $k (keys %{$hoh_qstatf{$jid}} ) {
    my $v = $hoh_qstatf{$jid}{$k};
    &handle_attr($jid, $k, $v);
  }
  # count cpus for this jobs
  #set_cpucount($lrms_jobs{$jid});
}
sub handle_attr($$$) {
  #my $t2 = [gettimeofday];
  #print tv_interval($t1, $t2), "\n";
  #$t1 = $t2;
  print shift, " ", shift, ": ", shift,"\n";
}

There is a little safeguard on advancing the @sjids index beyond the end of the array. This and several other subtleties had to be discovered by trial and error; although I only present the final result, I cannot take pride in the many intermediate steps I took.

Superficially this still looks like a nested linear search, but all the heavy lifting is done by the two calls to sort before the outer loop.

I could now verify that this program produces the same output as before, which it did not. But that is due, of course, to the fact that the keys, now sorted, are traversed in a different order. Sorting and comparing the output of both programs confirmed their identical result, but the really important part:

time ./pbs-fast.pl  > /tmp/fast-output

real	0m0,339s
user	0m0,254s
sys	0m0,052s

6 Putting the new code to work

Having a proof of concept is great, but now it was time to see it in action. Would it actually be so much faster on the ARC-CE system?

The good part of having this code tested on actual data is that we can be pretty sure it holds up in production. And since I had used the exact snippet from the original, the fix was now a drop-in replacement.

For completeness sake, here is the patch as it applies to the original file:

--- PBS.pm-orig 2021-09-30 08:32:59.468493554 +0200
+++ PBS-fast.pm 2021-10-26 21:44:02.014181320 +0200
@@ -856,14 +856,21 @@
     };

     my ( %hoh_qstatf ) = read_qstat_f($path);
-    foreach my $pbsjid (keys %hoh_qstatf) {
+
+    # make two sorted indices
+    my @qstatkeys = sort keys %hoh_qstatf;
+    my @sjids = sort (@$jids);
+    my $jidindex = 0;
+
+    foreach my $pbsjid (@qstatkeys) {
         # only jobids known by A-REX are processed
         my $jid = undef;
-        foreach my $j (@$jids) {
-             if ( $pbsjid =~ /^$j$/ ) {
-                 $jid = $j;
-                 last;
-             }
+        while ($sjids[$jidindex] lt $pbsjid) {
+            last if ($jidindex == $#sjids);
+            $jidindex++;
+        }
+        if ( $pbsjid =~ /^$sjids[$jidindex]$/ ) {
+            $jid = $sjids[$jidindex];
         }
         next unless defined $jid;
         # handle qstat attributes of the jobs

The end result? The update times are down from around 25 minutes to about 1 minute 30 seconds, of which the CEinfo.pl accounts for less than 30 seconds.

Not bad for four extra lines of code.

For the extra curious, if you want to see this in action and have an actual data set to work with, I can probably provide that. But unless you are a close colleague I can't share the real data unless it is cleaned up and vetted to be GDPR safe.

6.1 Update [2021-11-16 Tue]

No code update is so small or it has a bug in it. An edge cage when there are no ARC jobs leaves the array empty and we should really have checked for that. Not only does the code reference a non-existent entry in the empty array, it increases the index infinitely because the stop condition is never met.

The subroutine should return an empty hash in this case, so the check should be near the top of the subroutine before getting any other work done.

@@ -776,6 +781,9 @@

     my (%lrms_jobs);

+    # edge case if @$jids is empty, return now
+    return %lrms_jobs if ($#$jids < 0);
+
     # Fill %lrms_jobs here (da implementation)

     # rank is treated separately as it does not have an entry in