Sudhakar Davuluri's Blog: How I Investigate Node Reboot or Database System Restart in 10 Minutes

Database System Restart, Node Eviction or Node Reboot are day to day investigation by a DBA. As most of the time DBA's are the owner of these machines So they are first to start find reason for Database System Restart, Node Eviction or Node Reboot. Here, I am talking about cases when either this is a single instance database or Real Application Cluster DB. In both case Node reboot happens though there could be different reasons for both.

Here, I will discuss approach to investigate Database System Restart or Node Reboot. This approach applies on both single instance DB or RAC.

For each and every case of Node Reboot or Database System Restart, Fist check /var/log/messages file to verify if this is a Self suside or Node eviction. Let's see by an example in Linux env.

System Restart due to Node Eviction: First Let's see How Node eviction message looks like.

Feb 18 17:20:42 db01 kernel: SysRq : Resetting
Feb 18 17:20:44 db01 kernel: printk: 6 messages suppressed.
Feb 18 17:20:44 db01 kernel: type=1701 audit(1392744044.855:28194): auid=4294967295 uid=1000 gid=1001 ses=4294967295 pid=8368 comm="ocssd.bin" sig=6
Feb 18 17:24:26 db01 syslogd 1.4.1: restart.
Feb 18 17:24:26 db01 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Feb 18 17:24:26 db01 kernel: Linux version 2.6.18-238.12.2.0.2.el5 (mockbuild@ca-build9.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Tue Jun 28 05:21:19 EDT 2011

In above logs important point to notice is "kernel: SysRq : Resetting". When ever you see SysRq command issued. This shows this node is evicted by some other node from cluster. So you should investigate in Node eviction direction by collecting diagcollection from evicted and other nodes. Since, Node eviction is a huge topic to write on So, I will not write about that in this post.

Node Reboot/System Restart due to Memory Pressure: There could be case when system restart due to high load on it. Message file may look like this in that case.

Feb 12 07:32:42 db02 kernel: Total swap = 25165816kB
Feb 12 07:32:42 db02 kernel: Free swap:        97972kB
Feb 12 07:32:42 db02 kernel: 25690112 pages of RAM
Feb 12 07:32:42 db02 kernel: 978057 reserved pages
Feb 12 07:32:42 db02 kernel: 175190262 pages shared
Feb 12 07:32:42 db02 kernel: 45130 pages swap cached
Feb 12 07:35:49 db02 xinetd[7315]: START: omni pid=8176 from=::ffff:10.77.9.254
Feb 12 07:57:57 db02 syslogd 1.4.1: restart.
Feb 12 07:57:57 db02 kernel: klogd 1.4.1, log source = /proc/kmsg started.

In above case, No SysRq is reported. This shows a case of self suicide, So don't look at cluster side to investigate. Though, there could be many possible reason for system self suicide but most command are high load cause memory pressure on the system and system got reboot. To investigate high load start looking at OS Watcher Top command output at system restart time or prior to that.

Node Reboot/System Restart due to Linux kernel Bug: Another, Reason for self suicide could be some Linux kernel bug which cause system panic and system got reboot. Log files may look like this.

---[ end trace 288cce3e7b8bd8ba ]---
Kernel panic - not syncing: Fatal exception
Pid: 6381, comm: Thread-13686 Tainted: G      D    2.6.32-300.4.1.el5uek #1
Call Trace:
[] panic+0xa5/0x162
[] ? native_cpu_up+0x8af/0xa5d
[] ? xen_restore_fl_direct_end+0x0/0x1
[] ? _spin_unlock_irqrestore+0x16/0x18
[] ? release_console_sem+0x194/0x19d
[] ? console_unblank+0x6a/0x6f
[] ? print_oops_end_marker+0x23/0x25
[] oops_end+0xb7/0xc7
[] die+0x5a/0x63
[] do_trap+0x115/0x124
[] do_invalid_op+0x9c/0xa5
[] ? do_exit+0x67d/0x696
[] ? __dequeue_entity+0x33/0x38
[] ? pick_next_task_fair+0xa5/0xb1
[] invalid_op+0x1b/0x20
[] ? do_exit+0x67d/0x696
[] ? do_exit+0x67d/0x696
[] complete_and_exit+0x0/0x23
[] system_call_fastpath+0x16/0x1b
May 30 09:02:57 db008 syslogd 1.4.1: restart.

In Panic situation, you will see "panic" keyword on message file few line above restart. most likely this is because of Linux bug. So take help from Linux team to investigate further.

I have see case when Node restart due to memory pressure or kernel panic but DBA starts investigation from Cluster Alert logs. So, I hope this post will help readers to start right direction of investigation in case of Node Reboot or Database System Restart.

Please share, If you have more cases for Node Reboot or System Restart.

Sudhakar Davuluri's Blog

Wednesday, July 23, 2014

How I Investigate Node Reboot or Database System Restart in 10 Minutes

No comments:

Post a Comment