E-Book Content
Computing System Reliability Models and Analysis This page intentionally left blank Computing System Reliability Models and Analysis Min Xie Yuan-Shun Dai and Kim-Leng Poh National University of Singapore Singapore KLUWER ACADEMIC PUBLISHERS NEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW eBook ISBN: Print ISBN: 0-306-48636-9 0-306-48496-X ©2004 Kluwer Academic Publishers New York, Boston, Dordrecht, London, Moscow Print ©2004 Kluwer Academic/Plenum Publishers New York All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America Visit Kluwer Online at: and Kluwer's eBookstore at: http://kluweronline.com http://ebooks.kluweronline.com Preface Computing systems are widely used today and in many areas they serve the key function in achieving highly complicated and safety-critical mission. At the same time, the size and complexity of computing systems have continued to increase, making its performance evaluation more difficult than ever before. The purpose of this book is to provide a comprehensive coverage of tools and techniques for computing system reliability modeling and analysis. Reliability analysis is a useful tool in evaluating the performance of complex systems. Intensive studies have been carried out to improve the likelihood for computing systems to perform satisfactorily in operation. Software and hardware are two major building blocks in computing systems. They have to work together successfully to complete many critical computing tasks. This book systematically studies the reliability of software, hardware and integrated software/hardware systems. It also introduces typical models in the reliability analysis of the distributed/networked systems, and then further develops some new models and analytical tools. “Grid” computing system has emerged as an important new field, distinguished from conventional distributed computing systems by its focus on large-scale resource sharing, innovative applications, and, in many cases, highperformance orientation. This book also presents general reliability models for the grid and discusses analytical tools to estimate the grid reliability related to the resource management system, wide-area network communication, and parallel running programs with multiple shared resources. v vi Computing System Reliability Furthermore, this book introduces the basic reliability theories and models for various multi-state systems. Based on the models, some interesting decision problems in system design and resource allocation are further discussed. This book is organized as follows. Chapter 1 provides an introduction to the field of computing systems and reliability analysis. Simple reliability concepts are also discussed. Chapter 2 provides the basic knowledge in reliability analysis and summarizes some common techniques for analyzing the computing system reliability. The fundamentals of Markov processes and Nonhomogeneous Poisson processes (NHPP) are also introduced, which are essential tools used in this book. Chapters 3 and 4 present important models for the reliability analysis of hardware and software systems, respectively. They are useful when hardware and software issues are dealt with separately at the system analysis stage. Chapter 5 discusses the models for integrated systems. This is essential in computing system analysis as both software and hardware systems have to work together. In Chapter 6, the reliability of various distributed computing systems which incorporate the network communication into the hardware/software reliability is studied. The distributed computing system is a common and widely-used network