Jump to my Home Page Send me a message Check out stuff on GitHub Check out my photography on Instagram Check out my profile on LinkedIn Check me out on Facebook

Solaris Resource Monitor

Before SRM (Solaris Resource Manager) became a standard part of Solaris, it was very difficult to set and enforce policies that would effectively mange large, shared Solaris servers (thin client servers in particular). Such systems were plagued with the recurrent problem of a single user suddenly consuming the entire system and denying others a fair share. This is the environment tyr (named after the Norse god) is designed to monitor and manage. This program has no fancy user interface, just a simple config file, a reporting tool, and a kernel module. In normal operation, it sets quietly in the background monitoring activity and acting on violations of the policies contained in its config file. The policies can be quite complex rule sets based upon user ID, system overall resource availability, binary name, binary MD5, user group, physical connection (what X11 terminal or SunRay), time of day, date, and network activity. Some real examples:

  • Use nice on any process owned by a particular student ID that is consuming 15MB/s network bandwidth to a lab X11 terminal when someone in the faculty group is trying to run a Maple simulation.
  • Kill any student process that is consuming more than 15% of total RAM during work hours on a work day, but only send a nasty e-mail for such processes during nights and weekends.
  • Insure that no user can use more than 10% of total CPU resources, integrated over a one hour period, during working hours except for three guys in physics lab one.
  • Kill any process owned by a non-system account that consumes more than 40% of total RAM.
  • Nice any process that consumes an entire CPU for half an hour, and kill the process if it continues to run for more than two hours.
  • If a user has had any process nice'ed within the last hour, and a new process starts that consumes more than 100% of a CPU for more than 5min, then immediately nice the new process.
  • Forbid the department chairman's student aid from running poker at the receptionist desk terminal during his shift.
  • Prevent IM (Instant Messaging) clients from running during labs and testing periods.
  • Limit HTTP, SSH, and FTP download speed for all users during system upgrade downloads and remote boot.

Because this tool is tied directly into the kernel, it is able to extract much more accurate information about process resource consumption than is available via the ps command (CPU statistics to several digits and memory information to one digit). In addition, the direct tie into the kernel makes it immune to many "root kit" techniques by which one may attempt to hide resource consumption -- I don't know how many e-mails I have received from system administrators who install tyr for the first time only to find out that some critical server they manage has been rooted for months.

Not only is the reporting module of the tyr system a more accurate replacement for ps, it is also capable of much more sophisticated reporting. For example, it can summarize usage by user, terminal (TTY), or user group. One may also create ad-hoc policies and take ad-hoc actions from the reporting tool -- like selecting a process ID and killing it. Finally, the reporting module can be set to continuously display process information as it changes and report any actions that the kernel module is taking to enforce policy.

Resource consumption by user
Reporting on processes
Continuous reporting
© 2009 Mitch Richling