Mitch Rich-ling: UNIX System Admin Tools

Author:	Mitch Richling
Updated:	2021-09-22

1. Introduction
2. fstat: Follow symlinks and get file information
3. stats.pl: Compute Statistics
4. newhostid.pl: Change Solaris host IDs
5. Solaris Patch Tools
6. Work with syslog message files
7. Fast filesystem traversal

1. Introduction

You will find here several simple tools that may be of use for UNIX system administrators. My Guide to UNIX System Programming has a few programs that may be useful as well.

2. `fstat`: Follow symlinks and get file information

If given a link, or chain of links, fstat.pl will follow them, printing out bits of information along the way, until it finds a real file the link(s) terminate with. It will then print out most of the data available via the stat(1) and file(1) commands provided by most UNIX variants.

Some years ago I came across a virtual forest of symbolic links inside an application tree that required the resolution of as many as 58 links before a real real file was to be found!!! While I wrote fstat.pl to help with that contract, it has proven to be so handy over the years that it has found a home in my private bin directory. Aside from the link following capability, the ability of this script to provide access to stat(3) data from the command line in a consistent way on different UNIX variants is probably its most charming feature.

3. `stats.pl`: Compute Statistics

stats.pl started life as a bit of one line perl magic, and has grown over the years into what you see here. The idea is simple: bust the text data up into columns of numeric values and then report various statistical information. The stats.pl script is capable of some fairly sophisticated statistical computations. Of course the standard things like mean, average, max, min, count, standard deviation, variance, regression lines, and histograms are all available. On the more exotic side, the script understands combinatorial (factor) variables - a feature commonly found only in dedicated statistical packages. Another advanced feature, which is even quite rare in advanced statistical software, is the ability to generate weighted histograms. All of the various computations may be performed on the input data or on data computed from the input data – rather like how one sometimes adds a computed column to an spreadsheet. Finally, the format in which all of the statistical computations are reported is quite customizable allowing a range of formats from machine readable ones like CSV and TSV to human consumable reports using fixed width tables.

Perhaps the most complex, and useful, feature of the stats.pl script is the powerful techniques it uses to extract the data in the first place. After all, there is no point in having sophisticated computational capabilities if one can't extract the data and get it into the tool – this is a barrier every working statistician learns very soon after entering the real world!! This is doubly important for UNIX geeks that tend to deal with numerous oddly formatted text files on a daily basis. Note that the script is not only capable of using the data it extracts, it is also capable of outputting the filtered and scrubbed data in various formats (like CSV). Many people tell me they primarily use the script in this mode as a sort of a general purpose "data extractor and filter" allowing them to feed data into tools like R, SAS, or (goodness forbid) Excel. I know of no other tool that even comes close in terms of flexibility in data extraction.

For simple cases, the script "just works" with the default values; however, more complex examples are easy to find in the day-to-day life of a UNIX system administrator:

How do I extract the data from vmstat?: The output of vmstat is funny as the second line has the titles while the first and third lines are junk with the data starting on line four. That sounds painful, but stats.pl makes it easy:

-skipLines=3 -headerLine=2

How do I get extract the data from mpstat?: The output of mpstat is another odd one in that the first line and every fourth line consists of column titles. How kooky is that? We note that each title line has the string CPU and none of the data lines do. So we can use something like this:

-headerLine -notRegex=CPU

OK. I got the data from mpstat, but I want a summary for each CPU?: The CPU is labeled in the output of mpstat in a column called CPU - the column we used qin the previous FAQ entry to delete the title lines. All we need do is tell stats.pl about this column. The following options will do the trick:

-headerLine=1 -notRegex=CPU -cats=CPU=

How do I get the data from sar?: The output from sar is more complex. The first three lines are bogus, the fourth line has titles MIXED with data, and the last two lines are junk (a blank line and an "Average" line). Still, it isn't too bad telling stats.pl how to get the data. Because this one is so complex, there are different ways to do it. Here are three:

-notRegex=Average -goodColCnt=5=
-stopRegex='^$' -skipLines=4=
-notRegex='(^$|Average)' -skipLines=4=

How can I get better titles from sar data?: First, see the previous question about how to get the data. Use one of the options, and add the following to the command line:

-colNames==time,usr,sys,wio,idle

4. `newhostid.pl`: Change Solaris host IDs

SPARC based computers running Solaris have a feature available known as the "host ID" - a 32-bit integer intended to uniquely identify the host. This "host ID" is burned into the PROM of older hardware, and is programmed into the NVRAM of newer SPARC platforms. The design of the UNIX operating system is such that software always interacts with the actual hardware via the kernel - thus one may effectively change the host ID by manipulating the running kernel. The kernel in a UNIX system is nothing more than a program, and thus may be manipulated with a debugger. The Host ID is stored in the kernel symbol hw_serial. Unfortunately, stuffing a new Host ID into hw_serial is rather convoluted:

Convert the hex host ID into decimal.
Compute the ASCII code for each digit.
Compute the hex equivalent of each ASCII code.
Place these ASCII code, hex numbers into groups of 4.
Pad the last hex number with zeros.
Place the resulting 3, 32-bit, hex integers into hw_serial, hw_serial+4, and hw_serial+8.

The script newhostid.pl performs the necessary conversions, and the uses adb debugger to make the changes to the running kernel. BTW, you can change the Host ID by patching the binary file /kernel/misc/sysinit on Solaris x64.

5. Solaris Patch Tools

Solaris system administrators are often required to perform several extremely tedious tasks related to Solaris patches - hopefully the following scripts will help a bit…

checkpatch.pl: Takes a list of patches on the command line, or a stream of text on STDIN that contains patch IDs, and determines if the current system has the patches, or newer versions, installed. The text stream sent to STDIN can contain text other than the patch IDs, and the application will extract the patch IDs from the text - so one may simply cat in a README file or e-mail into the script and let it find the patch numbers.
diffpatches.pl=: Takes two host names as arguments. It then tells you what the differences are between the patches on the two hosts. It even tells you information regarding the versions of the various patches found on both hosts. This script can be invaluable when trying to find out why a program works on host A and doesn't on host B.
patchmach.sh: Takes a host name, and will tell you what patches need to be installed on the current host to bring it up to the same patch levels as the given host. This is handy when you simply what to bring a particular host up to the level of some reference host on your network.
kerpatch.sh: This is a template for a script that can install Solaris kernel patches. Kernel patches generally require a reboot and need to be installed at run level 2. This script checks to see if a particular patch is installed on the current host based upon the OS version. If the patch is not installed, it installs it in the correct way for the OS version. It then reboots the host. It logs EVERYTHING. If it fails, it will NOT attempt to install the patch on the next boot until the lock file is removed. This prevents a host from falling into a "reboot loop". This is a simple, but very handy little script that can make the install of a kernel patch on thousands of hosts painless. It can be changed to install other patches as well.

6. Work with syslog message files

The syslogd daemon is a part of most versions of UNIX ranging from commercial systems like Solaris and HP/UX to free systems like Linux and FreeBSD. This daemon provides a central, and uniform device through which all applications on a computer, or set of networked computers, may log messages. Unfortunately, this venerable tool has some quirks that make it difficult to use. These problems include:

The "last message repeated n times" messages make line oriented tools like grep less than useful.
It is often difficult to judge the spread of a network wide problem by looking at the messages file for a large network.
Many times one simply wishes to count the number of errors by host in order to judge the severity of a problem on a large network.
It is very difficult to find time dependencies in syslog data. For example, it is difficult to see hourly or daily repeating errors.
All of the "don't care" messages can really get in the way and obscure messages that are important.

I have used a uniform naming convention for these scripts. If the script has "xsyslog" in the name, then it processes a syslog file that has been "expanded" by the first script listed below. If the script only has "syslog" in the name, then it processes a raw syslog file.

expandSyslog.pl: Probably the most useful script. It expands the "last message repeated n times" messages into "n" copies of the previous message. It also prepends each line with a "-" if it is a unique message found in the file, and a "+" if it is generated from a "last message repeated n times" message. This tool opens up the processing of syslog files to the considerable collection of text processing utilities available on all UNIX platforms - like grep(1), sed(1), etc…
countXsyslogByHost.pl: Takes an expanded messages file, one that has been processed by expandSyslog.pl and counts the messages that match a given regular expression by host - i.e. how many times each host has produced a message that matches the search criteria. This kind of thing is very difficult to get a feel for by just looking at the messages file on a large network.
countSyslogByTime.pl: Takes a raw messages file and creates a histogram based on buckets that are formed by time so that problems that occurring at regular intervals may be easily identified. For example, a cron task causing a spike of errors every day at 1PM would show up as a spike in the histogram graph. This script allows one to specify the time quantum, and a regular expression to select the messages to count. countSyslogByTime.gp is an input file for gnuplot that can be used to graph the histogram.
extractHostXsyslog.pl: Extracts the messages that were generated from a set of hosts given in a file.
extractMultiHostsSyslog.pl: Extracts the messages that were generated from a set of hosts given in a set of files. The extracted messages are sorted into different files based upon the host groupings specified to the tool. This allows one to break a messages file up into classes based upon host class or group.
filterSyslog.pl: Extracts all the messages that do NOT match ANY of the regular expressions given to the script. Basically this is a handy way to grep for interesting messages. The filterXsyslog.pl script is similar except that it requires an "expanded" syslog file.

7. Fast filesystem traversal

The traditional way to traverse a file system is to simply use a recursive algorithm.

This algorithm is generally I/O bound; however, the culprit on modern systems is often I/O latency - not bandwidth. This is particularly true with today's transaction based I/O subsystems and network file systems like NFS. One way to alleviate this bottleneck is to have multiple I/O operations simultaneously in flight. Using this technique on a single CPU Linux box with a local file system only produces marginal performance increases, but when dealing with NFS file systems the speedup can be quite significant. Experiments with multi-CPU hosts utilizing gigabit Ethernet with large NFS servers show incredible performance improvements of well over 50x (20 hours cut down to 20 minutes). This set of programs has been used to traverse hundreds of terabytes of storage distributed across more than a billion files and 100 fileservers in just a few hours.

The idea is to first store every directory given on the command line in a linked list. Then a thread pool is created, and the threads pop entries off of that linked list in the order they were placed in the list (FIFO). Each thread then reads all the entries in the directory it popped off the list, performs user defined actions on each entry, and stores any subdirectories at the end of the linked list. This algorithm leads to a roughly depth-first directory traversal. The nature of the algorithm places a heavy load upon the caching systems available in many operating systems. For example, ncsize plays a roll in how effective this program is on a Solaris system. Also in Solaris the number of simultaneous NFS connections dramatically effects performance. Depending on what the optional processing functions are doing, this program can place an incredible load on nscd.

The version of the code linked here is written in C, and makes use of ancient C techniques to provide for tool customization. The C++ version provides a dramatically superior extension and abstraction model, and is much less difficult to extend. While the C++ version was written at about the same time as the C version it has seen much less testing in the real world, and I am hesitant to release it into the wild. In addition, an MPI version for both the C and C++ code exists that can spread the system across many hosts in a network. Like the C++ version, I am not comfortable enough with this version to release it.

The code base is designed to be customized so that binaries may be easily produced to do special tasks as the need arises. As an example of this, several compile options exist for the code in the archive that generate different binaries that do very different things. Currently the following examples may be compiled right out of the box:

du: A very fast version of /bin/du. It has no command line options, and simply displays the output of a 'du -sk'.
dux: A very fast, extended version of /bin/du that displays much more data about the files traversed including: file sizes, number of blocks, detects files with holes, and lots of other data.
own: Prints the names of all files in a directory tree that are owned by a specified list of users.
age: Produces a report regarding the ages of the files in a directory tree.
noown: Prints the names of all files in a directory tree that are NOT owned by a specified list of users.
dirgo: Simply lists the files it finds. This is similar to a 'find ./', only it does an almost depth-first search.

Mitch Rich-ling: UNIX System Admin Tools

Table of Contents