Mitch Rich-ling: UNIX System Admin Tools
Author: | Mitch Richling |
Updated: | 2021-09-22 |
Table of Contents
1. Introduction
You will find here several simple tools that may be of use for UNIX system administrators. My Guide to UNIX System Programming has a few programs that may be useful as well.
2. fstat
: Follow symlinks and get file information
If given a link, or chain of links, fstat.pl
will follow them, printing out bits of information along the way, until it finds a real
file the link(s) terminate with. It will then print out most of the data available via the stat(1)
and file(1)
commands provided by
most UNIX variants.
Some years ago I came across a virtual forest of symbolic links inside an application tree that required the resolution of as many as 58 links before a real
real file was to be found!!! While I wrote fstat.pl
to help with that contract, it has proven to be so handy over the years that it has found a home in my
private bin
directory. Aside from the link following capability, the ability of this script to provide access to stat(3)
data from the command line in a
consistent way on different UNIX variants is probably its most charming feature.
3. stats.pl
: Compute Statistics
stats.pl started life as a bit of one line perl magic, and has grown over the years into what you see here. The idea is simple: bust
the text data up into columns of numeric values and then report various statistical information. The stats.pl
script is capable of some fairly
sophisticated statistical computations. Of course the standard things like mean, average, max, min, count, standard deviation, variance, regression lines,
and histograms are all available. On the more exotic side, the script understands combinatorial (factor) variables - a feature commonly found only in
dedicated statistical packages. Another advanced feature, which is even quite rare in advanced statistical software, is the ability to generate weighted
histograms. All of the various computations may be performed on the input data or on data computed from the input data – rather like how one sometimes adds a
computed column to an spreadsheet. Finally, the format in which all of the statistical computations are reported is quite customizable allowing a range of
formats from machine readable ones like CSV and TSV to human consumable reports using fixed width tables.
Perhaps the most complex, and useful, feature of the stats.pl
script is the powerful techniques it uses to extract the data in the first place. After all,
there is no point in having sophisticated computational capabilities if one can't extract the data and get it into the tool – this is a barrier every working
statistician learns very soon after entering the real world!! This is doubly important for UNIX geeks that tend to deal with numerous oddly formatted text
files on a daily basis. Note that the script is not only capable of using the data it extracts, it is also capable of outputting the filtered and scrubbed
data in various formats (like CSV). Many people tell me they primarily use the script in this mode as a sort of a general purpose "data extractor and filter"
allowing them to feed data into tools like R, SAS, or (goodness forbid) Excel. I know of no other tool that even comes close in terms of flexibility in data
extraction.
For simple cases, the script "just works" with the default values; however, more complex examples are easy to find in the day-to-day life of a UNIX system administrator:
- How do I extract the data from
vmstat
? - The output of
vmstat
is funny as the second line has the titles while the first and third lines are junk with the data starting on line four. That sounds painful, butstats.pl
makes it easy:
-skipLines=3 -headerLine=2
- How do I get extract the data from
mpstat
? - The output of
mpstat
is another odd one in that the first line and every fourth line consists of column titles. How kooky is that? We note that each title line has the stringCPU
and none of the data lines do. So we can use something like this:
-headerLine -notRegex=CPU
- OK. I got the data from
mpstat
, but I want a summary for each CPU? - The CPU is labeled in the output of
mpstat
in a column calledCPU
- the column we used qin the previous FAQ entry to delete the title lines. All we need do is tellstats.pl
about this column. The following options will do the trick:
-headerLine=1 -notRegex=CPU -cats=CPU=
- How do I get the data from
sar
? - The output from
sar
is more complex. The first three lines are bogus, the fourth line has titles MIXED with data, and the last two lines are junk (a blank line and an "Average" line). Still, it isn't too bad tellingstats.pl
how to get the data. Because this one is so complex, there are different ways to do it. Here are three:
-notRegex=Average -goodColCnt=5= -stopRegex='^$' -skipLines=4= -notRegex='(^$|Average)' -skipLines=4=
- How can I get better titles from
sar
data? - First, see the previous question about how to get the data. Use one of the options, and add the following to the command line:
-colNames==time,usr,sys,wio,idle
4. newhostid.pl
: Change Solaris host IDs
SPARC based computers running Solaris have a feature available known as the "host ID" - a 32-bit integer intended to uniquely identify the host. This "host
ID" is burned into the PROM of older hardware, and is programmed into the NVRAM of newer SPARC platforms. The design of the UNIX operating system is such that
software always interacts with the actual hardware via the kernel - thus one may effectively change the host ID by manipulating the running kernel. The kernel
in a UNIX system is nothing more than a program, and thus may be manipulated with a debugger. The Host ID is stored in the kernel symbol
hw_serial
. Unfortunately, stuffing a new Host ID into hw_serial
is rather convoluted:
- Convert the hex host ID into decimal.
- Compute the ASCII code for each digit.
- Compute the hex equivalent of each ASCII code.
- Place these ASCII code, hex numbers into groups of 4.
- Pad the last hex number with zeros.
- Place the resulting 3, 32-bit, hex integers into
hw_serial
,hw_serial+4
, andhw_serial+8
.
The script newhostid.pl performs the necessary conversions, and the uses adb
debugger to make the changes to the running
kernel. BTW, you can change the Host ID by patching the binary file /kernel/misc/sysinit
on Solaris x64.
5. Solaris Patch Tools
Solaris system administrators are often required to perform several extremely tedious tasks related to Solaris patches - hopefully the following scripts will help a bit…
checkpatch.pl
- Takes a list of patches on the command line, or a stream of text on
STDIN
that contains patch IDs, and determines if the current system has the patches, or newer versions, installed. The text stream sent toSTDIN
can contain text other than the patch IDs, and the application will extract the patch IDs from the text - so one may simplycat
in a README file or e-mail into the script and let it find the patch numbers. - diffpatches.pl=
- Takes two host names as arguments. It then tells you what the differences are between the patches on the two hosts. It even tells you information regarding the versions of the various patches found on both hosts. This script can be invaluable when trying to find out why a program works on host A and doesn't on host B.
patchmach.sh
- Takes a host name, and will tell you what patches need to be installed on the current host to bring it up to the same patch levels as the given host. This is handy when you simply what to bring a particular host up to the level of some reference host on your network.
kerpatch.sh
- This is a template for a script that can install Solaris kernel patches. Kernel patches generally require a reboot and need to be installed at run level 2. This script checks to see if a particular patch is installed on the current host based upon the OS version. If the patch is not installed, it installs it in the correct way for the OS version. It then reboots the host. It logs EVERYTHING. If it fails, it will NOT attempt to install the patch on the next boot until the lock file is removed. This prevents a host from falling into a "reboot loop". This is a simple, but very handy little script that can make the install of a kernel patch on thousands of hosts painless. It can be changed to install other patches as well.
6. Work with syslog message files
The syslogd
daemon is a part of most versions of UNIX ranging from commercial systems like Solaris and HP/UX to free systems like Linux and FreeBSD. This
daemon provides a central, and uniform device through which all applications on a computer, or set of networked computers, may log messages. Unfortunately,
this venerable tool has some quirks that make it difficult to use. These problems include:
- The "
last message repeated n times
" messages make line oriented tools likegrep
less than useful. - It is often difficult to judge the spread of a network wide problem by looking at the messages file for a large network.
- Many times one simply wishes to count the number of errors by host in order to judge the severity of a problem on a large network.
- It is very difficult to find time dependencies in syslog data. For example, it is difficult to see hourly or daily repeating errors.
- All of the "don't care" messages can really get in the way and obscure messages that are important.
I have used a uniform naming convention for these scripts. If the script has "xsyslog
" in the name, then it processes a syslog file that has been "expanded"
by the first script listed below. If the script only has "syslog
" in the name, then it processes a raw syslog file.
- expandSyslog.pl
- Probably the most useful script. It expands the "last message repeated n times" messages into "n" copies of the previous message. It also prepends each
line with a "-" if it is a unique message found in the file, and a "+" if it is generated from a "last message repeated n times" message. This tool opens
up the processing of syslog files to the considerable collection of text processing utilities available on all UNIX platforms - like
grep(1)
,sed(1)
, etc… - countXsyslogByHost.pl
- Takes an expanded messages file, one that has been processed by
expandSyslog.pl
and counts the messages that match a given regular expression by host - i.e. how many times each host has produced a message that matches the search criteria. This kind of thing is very difficult to get a feel for by just looking at the messages file on a large network. - countSyslogByTime.pl
- Takes a raw messages file and creates a histogram based on buckets that are formed by time so that problems that occurring at regular intervals may be
easily identified. For example, a
cron
task causing a spike of errors every day at 1PM would show up as a spike in the histogram graph. This script allows one to specify the time quantum, and a regular expression to select the messages to count.countSyslogByTime.gp
is an input file forgnuplot
that can be used to graph the histogram. - extractHostXsyslog.pl
- Extracts the messages that were generated from a set of hosts given in a file.
- extractMultiHostsSyslog.pl
- Extracts the messages that were generated from a set of hosts given in a set of files. The extracted messages are sorted into different files based upon the host groupings specified to the tool. This allows one to break a messages file up into classes based upon host class or group.
- filterSyslog.pl
- Extracts all the messages that do NOT match ANY of the regular expressions given to the script. Basically this is a handy way to
grep
for interesting messages. ThefilterXsyslog.pl
script is similar except that it requires an "expanded" syslog file.
7. Fast filesystem traversal
The traditional way to traverse a file system is to simply use a recursive algorithm.
This algorithm is generally I/O bound; however, the culprit on modern systems is often I/O latency - not bandwidth. This is particularly true with today's transaction based I/O subsystems and network file systems like NFS. One way to alleviate this bottleneck is to have multiple I/O operations simultaneously in flight. Using this technique on a single CPU Linux box with a local file system only produces marginal performance increases, but when dealing with NFS file systems the speedup can be quite significant. Experiments with multi-CPU hosts utilizing gigabit Ethernet with large NFS servers show incredible performance improvements of well over 50x (20 hours cut down to 20 minutes). This set of programs has been used to traverse hundreds of terabytes of storage distributed across more than a billion files and 100 fileservers in just a few hours.
The idea is to first store every directory given on the command line in a linked list. Then a thread pool is created, and the threads pop entries off of that
linked list in the order they were placed in the list (FIFO). Each thread then reads all the entries in the directory it popped off the list, performs user
defined actions on each entry, and stores any subdirectories at the end of the linked list. This algorithm leads to a roughly depth-first directory
traversal. The nature of the algorithm places a heavy load upon the caching systems available in many operating systems. For example, ncsize
plays a roll in
how effective this program is on a Solaris system. Also in Solaris the number of simultaneous NFS connections dramatically effects performance. Depending on
what the optional processing functions are doing, this program can place an incredible load on nscd
.
The version of the code linked here is written in C, and makes use of ancient C techniques to provide for tool customization. The C++ version provides a dramatically superior extension and abstraction model, and is much less difficult to extend. While the C++ version was written at about the same time as the C version it has seen much less testing in the real world, and I am hesitant to release it into the wild. In addition, an MPI version for both the C and C++ code exists that can spread the system across many hosts in a network. Like the C++ version, I am not comfortable enough with this version to release it.
The code base is designed to be customized so that binaries may be easily produced to do special tasks as the need arises. As an example of this, several compile options exist for the code in the archive that generate different binaries that do very different things. Currently the following examples may be compiled right out of the box:
du
- A very fast version of
/bin/du
. It has no command line options, and simply displays the output of a 'du -sk
'. dux
- A very fast, extended version of
/bin/du
that displays much more data about the files traversed including: file sizes, number of blocks, detects files with holes, and lots of other data. own
- Prints the names of all files in a directory tree that are owned by a specified list of users.
age
- Produces a report regarding the ages of the files in a directory tree.
noown
- Prints the names of all files in a directory tree that are NOT owned by a specified list of users.
dirgo
- Simply lists the files it finds. This is similar to a '
find ./
', only it does an almost depth-first search.