Current location - Recipe Complete Network - Catering franchise - Changping Beida Jade Bird Sharing Operation and Maintenance Programmers How to Deal with Online Problems Quickly
Changping Beida Jade Bird Sharing Operation and Maintenance Programmers How to Deal with Online Problems Quickly
For most operation and maintenance programmers, it is very necessary to always pay attention to the possible problems of servers and system programs and solve them in advance. Today, we understand how the operation and maintenance programmers can deal with online problems quickly through case analysis.

Once you fall into the pit, it must be wise to jump into the pit _> Fill the pit _> The process of avoiding the pit and online fault handling is the same, with the priority from high to low. The objectives of online fault handling are as follows:

Tiaokeng

Jump pit-quickly resume online service or reduce the impact on online service to a lower level.

The availability of online services determines the interests of service providers' customers and affects the company's income. Once the online environment is unavailable and users cannot be served, it will bring economic losses to the company/team, and more seriously, it will bring bad reputation to the company/team. Therefore, the general company will put forward the requirements of stability and reliability for the online environment, which is also the kpi of the team and even the department. To this end, an important task after encountering production failure is to restore production services. Even if the online service cannot be fully restored, we should try our best to minimize the impact on the online service.

Fill a loophole

Fill the pit-find the cause of the problem and solve it fundamentally.

After restoring online services and minimizing the impact on users/companies/teams, we need to thoroughly investigate the problem, find out the root cause of the failure and fundamentally solve the problem. Usually, pit filling and pit jumping are carried out at the same time, and the completion of pit filling means the success of pit jumping. However, in an emergency, there are some special "jump pit" methods, such as restarting services, or downgrading/merging services. In fact, the "pit filling" was not completed at that time, but unconventional means were adopted to "jump into the pit" first.

Pit avoidance

Avoid the pit'-draw inferences from others and eliminate hidden dangers.

After finding the root cause and solving the problem, we need to draw inferences from others and think about the weaknesses in this investigation and handling process. What processes/specifications/systems need to be optimized? Do such problems exist in other systems or teams? Through such reflection and self-criticism, an online accident report is formed, the process is constantly improved to avoid stepping on the pit again, and experience is also exchanged in the team for improvement.

Thoughts on online fault handling

According to the goal and priority of online troubleshooting, one goal of online troubleshooting is to restore online services or reduce the impact on online services. The key point is the word "fast". After "jumping into the pit" and "filling the pit", it comes back to avoiding the pit. Therefore, the steps of online fault handling can be divided into:

Fault discovery

fault location

trouble clearing

Fault backtracking

Among them, the first three steps are the behavior of' jumping into the pit', and the latter step includes' filling the pit' and' avoiding the pit'.

The above steps are not meant to be carried out from top to bottom in turn. It is recommended to do it in parallel, and don't be confused, because usually after an online fault, the fault handler will be started urgently, and all roles of operation and maintenance, development, testing and products will participate. At this time, we will continue to divide the work, summarize the messages in parallel, quickly troubleshoot and restore the service. This idea is similar to the fork/join design idea of operating system, aiming at improving efficiency.

When the cause of the fault cannot be found quickly, we should decisively skip the fault location link and directly eliminate the fault, such as using service degradation, server expansion and other means to ensure low controllability of online services. Jade Bird of Changping Peking University suggested that we can wait until the online service is over, and then slowly locate the cause of the failure and solve the problem fundamentally.