HPC CPU Oversubscription
Oversubscription, running multiple "tasks" per "core", carries a risk of CPU overload. Unfortunately this risk is often vaguely understood, and vaguely understood risks lead to overly conservative behavior – translation: missed opportunity and higher compute costs. By transforming this vaguely understood risk into a real world, actuarial model we can quantify both the risk and the reward enabling a rational, financial decision making process.
While there are many components to a full oversubscription TCO, this paper focuses on one of the most difficult steps:
How to use measured job efficiency data to compute a first order risk model for CPU overload
Yes. The focus really is just on this single computation. Because the computation can take years to complete if approached directly, much of the material covers the computational algorithms required to make the calculation tractable.
Download PDF: CPU Oversubscription in Compute Clouds