Browsing Server Architecture's Archives »»
Introduction
Availability is the measure of system that how much amount of time the system is available, when it’s required. In other words we can say availability is the ratio between time in service (available for services) and total time. It can be measured MTTF / (MTTF+MTTR). Here MTTF (Mean Time To Failure) and MTTR (Mean Time To Repair or Recover). When user tries to connect to a server and server is not responding then it is called unavailable [1]. Different systems have different requirements as far as availability is concern. As we go for larger systems then it’s difficult to make them highly available [2].
Problem in High Availability:
Here are the list of problems that may cause for a system to down or unavailable. These problems can be Software Failure, planned down, careless mistakes, hardware failure or environment where system is deployed [3]. Here are the details of each problem;
Software Failure:
Any software can have faults or bugs due to any error or mistake. These bugs stay in a software and can be triggered when an input supplied to that part of software [4].
Programmer’s mistakes or errors lead to software faults/bugs. These bugs reside in the software and can be activated with an input pattern [4]. Finding and removing the bugs from software is the classic strategy of dealing with them because fixing the bugs in operations is costly as compare to finding in development and testing phase.
In software we can face two types of bugs one is Bohrbugs and other is Heisenbugs. Bohrbugs can be consistent in same sort of circumstances, these bugs can be reproduced. While Heisenbugs only triggered when we have some special set of events process in same order. These bugs are hard to reproduce that’s why programmers and testers cannot find them easily [5].
Hardware Failures:
When any physical component of system stops working due to any sort failure then it’s called Hardware Failure. Hardware components like storage devices, network devices or CPU can be failed during operation of system. These can be fail in combination or single at a time. Hardware failures are mostly initiated at designing of hardware, manufacturing time or due to any exhaustion [3].
Power Failure:
It’s not compulsory that software or hardware is the only responsible for system unavailable. Power plays an important role in high availability. If there is no proper power backup system installed then system can be down due to power failure. Unavailability of power can cause of stopping cooling at data centers and due to heating hardware can stop working.
Maintenance:
A system can be unavailable due to maintenance or operations errors. Poor maintenance plan may lead to non-availability of system in crucial hours. There should be proper schedule for maintenance and it should be done when there would be minimal load on the system.
Human Mistakes:
A system can be down due to any mistake by human being, it can be due to inexperience or wrong planning. For example if administrator wants to make some changes in the system and for this, he stops network services instead of desired service [6].
Overall Factors:
According to statistics, 40% of total downtime is due to software failures, 30% due to planned maintenance or up gradation, 15% due to careless mistakes by people, 10% due to hardware failure and 5% due to environment [3].
References
[1] F. Piedad, High Availability: Design, Techniques, and Processes. 2001.
[2] J. Gray and D. P. Siewiorek, “High-Availability Computer Systems,” Computer, vol. 24, no. 9, pp. 39-48, 1991.
[3] H. Aziz, “High Availability, Lecture slides in Server Architecture subject.”
[4] J.-C. Laprie, “DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY,” in Fault-Tolerant Computing, 1995, “ Highlights from Twenty-Five Years”., Twenty-Fifth International Symposium on, 1995, p. 2.
[5] Michael Grottke and Kishor S. Trivedi, “Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate,” Computer, vol. 40, no. 2, pp. 107-109, 2007.
[6] A. Wood, “Predicting client/server availability,” Computer, vol. 28, no. 4, p. 41, 1995.
Posted on: December 9th, 2011
Introduction
A system is called available if the user request for some service and he gets proper response and desired job done on server. It is also defined as the ratio between mean time in service and total time in service [1]. Different systems have different requirements in terms of availability of the system. Important systems have very critical requirements of availability for the systems. If user wants to access the system and user does not get proper response from system then it is called unavailable. There can be many reasons, like software, power or hardware failures can cause the unavailability of the system [2].
Solutions in High Availability:
Here are the main reasons of system unavailability and solution how to get rid of these problems
Software Failure:
Software failure is one of the major reasons of system unavailability. Software fails due to unhandled errors in software programs [3]. These errors are reside in software programs and triggered when any external input interact with that part of software program. Software errors or bugs can be divided into two categories; Bohrbugs and Heisenbugs [4]. Bohrbugs are those bugs which can be reproduced; hence developers or testers can detect and remove those bugs. Heisenbugs are hard to reproduce; hence these are difficult to find and remove from software programs. Because Heisenbugs are not reproducible that’s why these are hard to find and remove during software development.
Due to non-deterministic behavior of Heisenbugs, it can be handled by repeating those steps, so by restarting the application can solve the problem. This restarting technique can be implemented by introducing check points. Check points keep the snapshot of the system regularly during the execution and when system restarts it will restore the previous state of the system.
The other approach is that can be used for software component is to use redundant components while developing large scale applications. These redundant components can be used as backup and in case of any failure the other component may replace it. Software redundancy components prevent unavailability of the system due to failure of any other component by detecting failing component and replace it before it actually fails.
Hardware Failure:
When a system is down due to failure of any physical component then it is called hardware failure. We can overcome this hardware failure by using hardware redundancy; hardware redundancy prevents the unavailability of system caused by hardware failures by detecting a failing component before it actually fails and bypassing a failure when it does occur. For this we can use server-class hardware. This server class hardware monitors all components of server for their failure and when that component fails the server-class notifies the administrator and includes redundant component so that server is keep working during the failure [5].
There can be other solutions be used for preventing hardware failure, one of them is to use fault-tolerant design concept while design hardware components. Fault-tolerant design can be implemented by using modularity, fail-fast or independent failure modes. Modularity is the decomposition of whole system into independent components so that in case of failure only affected module fails instead of whole system. Fail-fast is basically working of each module independently. The whole concept is that each module should be independent and work by its own so that in case of single module failure the other components should be working without any interruption.
Power Failure:
Proper power backup systems should be installed with the servers so that in case of any power failures these backup power systems start working. UPS and alternative power source should be installed to overcome this failure.
Maintenance Issues:
A system could be unavailable due to wrongly planned maintenance plan, for example maintenance is doing on peak hours then majority of users suffer due to this bad maintenance plan. There must be a proper plan for maintenance of system, it should be done when there is minimal load on the system and notify to the users of system so that if anybody wants to use during that time period then user use any alternative time slot for his work.
Human Mistakes:
A system could be unavailable due to any mistake made by human being. For example administrator stops wrong services and due to this the whole system is not accessible. To overcome this problem, proper training and expertise required before dealing with critical components of system [6].
References:
[1] H. Aziz, “High Availability, Lecture slides in Server Architecture subject,” 2011.
[2] J. Gray, “Why Do Computers Stop And What Can Be Done About It?,” 1985.
[3] J.-C. Laprie, “DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY,” in Fault-Tolerant Computing, 1995, “ Highlights from Twenty-Five Years”., Twenty-Fifth International Symposium on, 1995, p. 2.
[4] Michael Grottke and Kishor S. Trivedi, “Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate,” Computer, vol. 40, no. 2, pp. 107-109, 2007.
[5] “Preventing Downtime with Redundant Components.” [Online]. Available: http://technet.microsoft.com/en-us/library/cc917700.aspx. [Accessed: 07-May-2011].
[6] A. Wood, “Predicting client/server availability,” Computer, vol. 28, no. 4, p. 41, 1995.
Posted on: December 9th, 2011
Introduction
High Performance Computing which is also called HPC uses computer clusters to solve large scale problems. When we are dealing with multiple computers to solve one single problem then we can face lots of problems and in this report I will discuss how we can solve those problems. As every problem can have more than one solution that depends upon in which situation we are applying that solution. The selection of solution for that particular problem depends upon the situation. In this report, I will discuss the introduction of each solution.
Solutions in High Performance Computing:
Here are the most commonly problems and their solutions while having High Performance Computing.
Scheduling Issues
The most common problem in High Performance computing is scheduling of resources. There are many scheduling algorithms that may be used to solve scheduling issues in high performance computing. Like First Come First Serve in which the first job has high priority, Short Job First in which shortest job has high priority so it depends upon the environment where we are implementing high performance computing [1].
Load balancing
In high performance computing workload and resource management are two important aspects that are provided at the service level of grid computing. Load balancing algorithms can be divided into two categories; static and dynamic load balancing. In static load balancing we know the work load at the start and we can calculate how much effort required and distribute the work load among available clusters. Static load balancing provides good performance on homogenous clusters in which we have an equal work load. On the other hand, we have many problems when we have dynamic work load [2]. In these conditions we use dynamic load balancing. There are many algorithms for dynamic load balancing like round-robin or biasing algorithm. It is very difficult to suggest an optimal solution for dynamic load balancing just because of its dynamic nature [3].
Race conditions
When we have parallel computing then we can face the problem of race condition. One of the solutions to race condition could be to ensure that the programs have exclusive rights to the resources that are required. For acquiring those resources locking can be used. There are different locking techniques can be implemented like POSIX record locks, mandatory locks, which are based on System V’s mandatory locking scheme to avoid race condition [4].
Fault tolerance
Availability is the important aspect of High Performance Computing. Availability is the measure of the system that how much amount of time that system is available. A system may not be in available condition due to hardware or software failure. The solution to the failure of the system we have to make the system fault tolerant. For avoiding hardware failures, we can build fault tolerant hardware where systems are decomposed into modules; by dividing system into modules failures are isolated to modules and we can prevent the activation of other failures, redundant hardware also can be used. For making software fault tolerant, wrapper and rejuvenation techniques can be used [5].
Programming for parallel computers
As High performance computing has a complex architecture which makes programming more complex. We can solve this issue by introducing new programming models. These programming models can play a role of bridge between programming and hardware. The balance between productivity and efficiency is the key while implementing these programming models [6][7].
References
[1] M. L. Fisher, “Optimal Solution of Scheduling Problems Using Lagrange Multipliers: Part I,” OPERATIONS RESEARCH, vol. 21, no. 5, pp. 1114-1127, Sep. 1973.
[2] M. Naiouf, L. De Giusti, F. Chichizola, and A. De Giusti, “Dynamic Load Balancing on Non-homogeneous Clusters,” in Frontiers of High Performance Computing and Networking–ISPA 2006 Workshops, 2006, p. 65–73.
[3] C. Kopparapu, Load Balancing Servers, Firewalls, and Caches. New York: John Wiley & Sons, Inc., 2002.
[4] D. A. Wheeler, “Secure Programming for Linux and Unix HOWTO,” p. 00, 2003.
[5] F. Piedad, High Availability: Design, Techniques, and Processes. 2001.
[6] W. D. Gropp, “Performance driven programmimg models,” in Massively Parallel Programming Models, 1997. Proceedings. Third Working Conference on, 1997, pp. 61-67.
[7] K. Asanovic et al., others, The landscape of parallel computing research: A view from berkeley. Citeseer, 2006.
Posted on: December 8th, 2011
Introduction
High Performance Computing which is also called HPC uses computer clusters to solve large scale problems. A computer cluster is a group of interlinked computers to work together in a way that looks like a single computer. The main purpose of high performance computing is to use parallel processing of interlinked computers to solve large problems in an efficient and quick way. When we talk about High Performance Computing, it is normally used computing for scientific research or solving large scale problems.
Problems in High Performance Computing:
Here are the most commonly problems while having High Performance Computing.
Scheduling Issues
The most common problem in High Performance computing is scheduling of resources. The problem of scheduling in parallel computing is actually composite problem of decision like where and when a process would be executed, it also indicates that which processor will execute that process and in which order it would be executed. The complexity of scheduling increases when we have scheduling applications on heterogeneous geographically dispersed distributed systems for working parallel [1].
Race conditions
When we have parallel computing then we can face the problem of race condition. A race condition is a flaw in a process where each process is racing to get desired resource first. Proper design techniques support designers to recognize and eliminate race conditions before they cause problems.
Security Issues
In normal traditional computing programmer protect the system from users and protect data of one user from others. While in parallel computing or grid computing we have to protect our application and data from system where it would be executed. We have to protect local execution from remote system. We have to implement stronger authentication for users and apply different security policies for admin domain [2].
Resource Management
In high performance computing we have large number of resources many applications. These resources can be heterogeneous and geographically-distributed. For managing these resources we have to do precise scheduling of resources and their utilization. There is a need to proper controlling the accessibility authorization of resources to avoid deadlock occurrence [3].
Load balancing
In high performance computing workload and resource management are two important aspects that are provided at the service level of grid computing. To get the high throughput of these grids, workloads have to be evenly scheduled among available resources. One of the main problems of high performance computing is the load balancing of resources for different process to acquire maximum throughput [4].
Software lockout
With the introduction of multiprocessors computers, software-lockout is one of the issues of performance degradation due to remain idle wait time for CPUs. Software lockout is the biggest cause of scalability deprivation in a multiprocessor system; it is posing a limit on the maximum useful number of processors.
References
[1] L. W. Dowdy, E. Rosti, G. Serazzi, and E. Smirni, “Scheduling issues in high-performance computing,” ACM SIGMETRICS Performance Evaluation Review, vol. 26, no. 4, p. 60–69, 1999.
[2] C. Neuman, Security, accounting, and assurance. Morgan Kaufmann, pp. 2\oe48, 1999.
[3] E. Afrash and A. M. Rahmani, “A New Architecture for Better Resource Management in Grid Systems,” in Convergence and Hybrid Information Technology, 2008. ICCIT ’08. Third International Conference on, 2008, vol. 2, pp. 194-198.
[4] B. Yagoubi and Y. Slimani, “Dynamic load balancing strategy for grid computing,” Transactions on Engineering, Computing and Technology, vol. 13, p. 260–265, 2006.
Posted on: December 8th, 2011