In this post, I will describe some useful tips on how to debug under Linux. Linux offers advanced tools suitable for profiling, debugging and crash analysis. Below, I shall describe few of the most common and useful techniques for debugging. Namely dynamic debugging, Oops analysis, strace etc..
Debugging via printing: printk
The printk function, similar to printf for user space programs, writes messages into a circular buffer that is LOG_BUF_LEN (defined in kernel/printk.c) bytes long. It then wakes any process that is waiting for messages, that is, any process that is sleeping in the syslog system call or that is reading /proc/kmsg. These two interfaces{syslog and kmsg} to the logging engine are almost equivalent.
One of the unique features that printk offers is that it lets you classify messages according to their severity by associating different loglevels, or priorities, with the messages. Usually such log-level associate with a macro and an integer value. For example, KERN_INFO, is one of the possible log-levels of the message has an integer value 6. The log-level macro expands to a string, which is concatenated to the message text at compile time; that’s why there is no comma between the priority and the format string in the following examples. Here are two examples of printk commands, a debug message and a critical message:
printk(KERN_DEBUG “Here I am: %s:%i\n”, __FILE__, __LINE_&_);
printk(KERN_CRIT “I’m trashed; giving up on %p\n”, ptr);
There are eight possible log-level strings, defined in the header <linux/kernel.h>. Integers range from 0 to 7, with smaller values representing higher priorities.
A printk statement with no specified priority defaults to DEFAULT_MESSAGE_LOGLEVEL, specified in kernel/printk.c as an integer. The default loglevel value has changed several times during Linux development and is currently 4. Based on the loglevel, the kernel prints the message to current console, be it a text-mode terminal, a serial line printer, or a parallel printer. If the priority is less than the integer variable console_loglevel, the message is displayed. If both klogd and syslogd are running on the system, kernel messages are appended to /var/log/messages , independent of console_loglevel. If klogd is not running, the message won’t reach user space unless you read /proc/kmsg. Messages in /var/log/messages can be read via dmesg utility.
The variable console_loglevel is initialized to DEFAULT_CONSOLE_LOGLEVEL and can be modified. From version 2.1.31 on it is possible to read and modify the console log-level using the text file /proc/sys/kernel/printk. The file hosts four integer values, the current console log-level and the default level for messages. If we do a look at the current console loglevel in 2.6.36 kernel, it defaults to 4. Meaning all messages from KERN_ERR[3] to KERN_EMERG[0] would get printed.
root@arunks-PC1# less /proc/sys/kernel/printk
4 4 1 7
Setting 8 as the current console value ensures, all info and debug messages appear on console. Thus by doing as below:
root@arunks-PC1# echo 8 > /proc/sys/kernel/printk
root@arunks-PC1# less /proc/sys/kernel/printk
8 4 1 7
ensures that all messages with priority lower than 8 get printed to console.
A massive use of printk can slow down the system noticeably, because syslogd
keeps syncing its output files; thus, every line that is printed causes a disk operation.
This is the right implementation from syslogd ’s perspective. It tries to write everything to disk in case the system crashes right after printing the message. To overcome this bottleneck, kernel offers the technique of Dynamic Debugging as described below.
Dynamic Debugging
Even though, debugging via printk is the most common form of debugging kernel code, it comes with its own inefficiencies and complications. Though printk offers a verbose mode for developers to classify the prints as per priority and also to define whether they should hit console or syslogd, yet processing of printk’s involve extensive rendering by kernel slowing down the performance.
Not only performance, there are other issues as well with use is printk in kernel architecture. Existing kernel debug architecture involves use of pr_debug and various other macros. 2.6.29 kernel had 3058 pr_debug() calls, 3158 dev_dbg() calls, 5618 dprintk() calls, 206 DEBUGP() calls, and countless additional debugging statements. Some subsystems have developed sophisticated debug statement frameworks, allowing fine grain control via bit flags and debug levels. As may be noticed, the kernel is sprinkled with debug statements that are only available by individually re-compiling the various subsystems of the kernel. So, the big question is how do we enable debugs in any kernel module without editing and recompiling the kernel?Before 2.6.28, the only way seemed to be to edit and recompile the specific kernel module after enabling the compile time debug flags either in the Makefile or in the headers etc. From kernel 2.8.28 onwards, the concept of Dynamic debug has been introduced. So now we can enable/disable any kernel debug statements without recompiling and that too on the fly. Even subsystems that have implemented complex and more granular logs could be also be controlled via dynamic debug.So does Dynamic debugging help in any other way apart from enabling/disabling debugs on the runtime? Alternatively it seems to be handy too with different version of debug macros and facilities. And the developer is offered great flexibility in terms. Consider this: printk messages have different levels of verbosity attached. with KERN_EMERG being most severe to KERN_DEBUG with least severe. Such severity control where eventually these printks should land up or may be not land at all. But even if a printk/debug messages doesn’t end up either in syslog or on consoled, the kernel needs to render the same. Stats point out that there is more than 86% degradation of performance when enabling all debug statements but not logging these messages to system logs or console. Amazing but in the current kernel debug infrastructure, enabling logs via makefiles etc do incur heavy penalties in terms of kernel size and performance.Dynamic debug addresses these two concerns. i.e ease of enabling/verbosity and performance. The verbosity part is tackled using a unique language which allows fine grain control of debug messages as well as at a coarse level. The performance part is taken care by introduction of a new technique called Bloom filters. Essentially all info regarding debug statements is stored as meta data which then is organized as multiple and efficient hashes to speed up performance and speed. On the flip-side enabling dynamic debug option in kernel results in a penalty of increasing kernel text size by 2%.
Usage and Examples
Dynamic debugging could be enabled under Kernel Hacking -> Enable Dynamic printk support[CONFIG_DYNAMIC_DEBUG]. It provides a debugfs control file: <debugfs>/dynamic_debug/control which can be read to display the complete list of known debug statements, to help guide the programmer.The behaviour of pr_debug() and dev_debug() are controlled by writing to a control file in the debugfs filesystem. Here is how a typical control file looks like
Subsequently, we refer to the control file as: <debugfs> /dynamic_debug/control. In my Ubuntu 10.10 setup, the debugfs is mounted in </sys/kernel/debug>.
Now if we want to enable BT_DBG on line 91 in source file l2cap.c, the command would be like,
root@arunks-PC1: echo ’file l2cap.c line 91 +p’ > /sys/kernel/debug/dynamic_debug/control
root@arunks-PC1:echo -n ‘func svc_process +p’ > /sys/kernel/debug/dynamic_debug/control
Debugging via straceUser space programs can be watched under strace. Strace command is a powerful tool that shows all the system calls issued by a user space program. Not only does it shows the calls, but it can also show arguments to the calls, as well as return values in symbolic form. When a system call fails, both the symbolic value of the error and the corresponding string are displayed. Strace has many command line options too : -T being most notable displaying time spent in each call. By default, strace prints tracing information on stderr.What is more wonderful of strace is that it can trace a program regardless of whether the program was compiled with debugging support[-g option of gcc] or whether such information is stripped or not. Strace could also be attached to a running process pretty much like a debugger. In summary strace program intercepts and records the system calls which are called by a process and the signals which are received by a process. The name of each system call, its arguments and its return value are printed on standard error or to the file specified with the -o option.Consider below dump from bluetoothd.
root@arunks-PC1:/home/arunks/bluez-4.79/src# strace bluetoothdexecve(“/usr/local/sbin/bluetoothd”, ["bluetoothd"], [/* 25 vars */]) = 0brk(0) = 0xb9424000access(“/etc/ld.so.nohwcap”, F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7832000
access(“/etc/ld.so.preload”, R_OK) = -1 ENOENT (No such file or directory)
open(“/etc/ld.so.cache”, O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=60803, …}) = 0
mmap2(NULL, 60803, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7823000
close(3) = 0
access(“/etc/ld.so.nohwcap”, F_OK) = -1 ENOENT (No such file or directory)
open(“/usr/local/lib/libbluetooth.so.3″, O_RDONLY) = 3
read(3, “\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\300C\0\0004\0\0\0″…, 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=100888, …}) = 0
mmap2(NULL, 90264, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb780c000
mmap2(0xb7820000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×13) = 0xb7820000
close(3) = 0
mmap2(NULL, 12408, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb76e2000
mmap2(0xb76e4000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×1) = 0xb76e4000
close(3) = 0
access(“/etc/ld.so.nohwcap”, F_OK) = -1 ENOENT (No such file or directory)
open(“/lib/librt.so.1″, O_RDONLY) = 3
read(3, “\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\300\30\0\0004\0\0\0″…, 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=30684, …}) = 0
mmap2(NULL, 33364, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb76d9000
mmap2(0xb76e0000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×6) = 0xb76e0000
close(3) = 0
access(“/etc/ld.so.nohwcap”, F_OK) = -1 ENOENT (No such file or directory)
open(“/lib/libc.so.6″, O_RDONLY) = 3
read(3, “\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0@n\1\0004\0\0\0″…, 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1421892, …}) = 0
mmap2(NULL, 1431976, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb757b000
mprotect(0xb76d2000, 4096, PROT_NONE) = 0
mmap2(0xb76d3000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×157) = 0xb76d3000
mmap2(0xb76d6000, 10664, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb76d6000
close(3) = 0
access(“/etc/ld.so.nohwcap”, F_OK) = -1 ENOENT (No such file or directory)
open(“/lib/libpcre.so.3″, O_RDONLY) = 3
read(3, “\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\300\20\0\0004\0\0\0″…, 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=210244, …}) = 0
mmap2(NULL, 213080, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7546000
mmap2(0xb7579000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0×32) = 0xb7579000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7545000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7544000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb75446c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xb7579000, 4096, PROT_READ) = 0
mprotect(0xb76d3000, 8192, PROT_READ) = 0
set_robust_list(0xb7544730, 0xc) = 0
futex(0xbfd7a8b0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xbfd7a8b0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, bfd7a8c0) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL,
= 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
uname({sys=”Linux”, node=”arunks-PC1″, …}) = 0
brk(0) = 0xb9424000
brk(0xb9445000) = 0xb9445000
uname({sys=”Linux”, node=”arunks-PC1″, …}) = 0
gettimeofday({1291871335, 768433}, NULL) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7544728) = 4579
exit_group(0) = ?
this dump has been truncated at places to fit in here but the output is self evident.
As we can see the bluetooth daemon program tried to open series of shared libraries, mmaping them to the process address space of bluetoothd, setting access writes to these memory pages and further. You can also figure the path where it tried to search for specific shared libs. Throughout the trace bluetoothd seems to be locating (“/etc/ld.so.nohwcap) and meeting with successive failures. Sequence of reads/ writes and any memory protection errors etc could be easily tracked and debugged via strace.
Oops Analysis.
An Oops is triggered by some exception and is a dump of CPU state and kernel stack at that point. In short if exceptions such as segmentation fault or invalid memory access happen in kernel code, lead to panics/Oops. Addresses used by processor are virtual and are mapped to physical addresses through a complex structure of so-called page tables. When an invalid pointer is de-referenced, the paging mechanism fails to map the pointer to a physical address and the processor signals a page fault to the operating system. If the address is not valid, the kernel is not able to “page in” the missing address; it generates an oops. This happens when the CPU is in supervisor mode. Sometimes, if the oops makes it into /var/log/messages, the syslogd daemon might disassemble the oops automatically. This is usually handy, unless you don’t have the correct System.map file on the system. Else you may generate the stack trace via a utility called ksymoops feeding it the correct system.map and Oops dump. To run ksymoops, do the following:root@arunks-PC1# ksymoops -K -L -O -m System.map < oops.txt > oops.outwhere the oops data above is in a file oops.txt and the corresponding disassembled oops will be written to the file oops.out. Now lets look at an actual Oops on my x86 machine hosting Ubuntu Linux distribution.Unable to handle kernel NULL pointer dereference at virtual address 00000004printing eip:80289912*pde = 591be001Oops: 0002
Modules linked in: deadman multipath md bonding1 bonding cpqci cpqhealth cpqrom e100 lpfcdd
CPU: 1
EIP: 0010:[<80289912>] Not tainted
EFLAGS: 00010202
eax: 00000000 ebx: f8f37138 ecx: f8b36d98 edx: 00000000
esi: 00619380 edi: f504ec00 ebp: 00000900 esp: d91bdcf8
ds: 0018 es: 0018 ss: 0018
Process spew (pid: 1230, stackpage=d91bd000)
Stack:
f504ec00 f504ed70 f587f370 00bb8000 00000080 00000001 00000050 00000000
80286341 d91bdd56 d91bdd58 00610180 f504ec00 00003a01 daa9c320 00609250
00bb8000 00000020 00000200 f587f200 f5a2e000 00610180 00000600 09001000
Call Trace: [<80286341>] [<80286427>] [<80219fdc>] [<8021a051>]
[<8013b78c>] [<8013b94b>] [<80139b70>] [<8012fd22>] [<80130020>] [<8013008c>][<80130b1292ff>] [<8012919c>] [<801d8425>] [<801d34d9>] [<80137717>][<80106f27>]
Code: 89 42 04 89 10 c7 01 00 00 00 00 c7 41 04 00 00 00 00 8b 03
Decoding Oops message
The first line displays reason for the panic, obviously. There is a null pointer access somewhere. Next the EIP or the Extended Instruction Pointer’s value is printed. An Oops count is also dumped which in this case is 2. Generally one Oops triggers more oops and only the first oops message is reliable.
Next is the EIP displaying code segment and instruction address. Next are the Program status, control, general purpose and more segment registers.
The stack trace is a trail of stack frames of subroutines that the system is executing at any point of time. A snap shot of same on crash time would yield valuable info on the function trail that lead to crash. Often its not the last routine executed that is a source of problem or crash rather some subroutine in middle of stack trace is usually the culprit.
The call trace dump is said to be the trail of return addresses, usually extracted from the stack trace for relevant info.
Below is a an extract after subjecting above oops thru ksymoops with appropriate system.map file. Trace; 80286341 <lvm_map+3b9/490>
Trace; 80286427 <lvm_make_request_fn+f/1c>
Trace; 80219fdc <generic_make_request+11c/12c>
Trace; 8021a051 <submit_bh+65/80>
Trace; 8013b78c <sync_page_buffers+94/ac>
Trace; 8013b94b <try_to_free_buffers+1a7/1d4>
Trace; 80139b70 <try_to_release_page+44/48>
Trace; 8012fd22 <shrink_cache+24a/3c8>
Trace; 80130020 <shrink_caches+78/a8>
Trace; 8013008c <try_to_free_pages_zone+3c/5c>
All it does is translate the hex addresses to equivalent function or symbolic names handy for the developer to correlate.
Now we will walk the call stack and verify function chain.
Now that we have a translated call trace for a particular process, we’ll need to verify the call trace. Unfortunately, the call trace isn’t always accurate. Somehow phantom function calls get inserted in the trace, yet there is no code path to those functions. To follow a call trace, you’ll need a disassembled kernel. The kernel build process will generate a uncompressed kernel image called “vmlinux”, and it will be in the top level kernel directory after the compilation process. To disassemble the vmlinux file:
% objdump -s -d vmlinux > vmlinux.disas
It is usually very helpful to re-compile the kernel with symbols and generate a new vmlinux file with line number information.
So with Ksymoops and call trace with vmlinux, it becomes easier to track the offending function often leading to exact offending instruction. But essentially, there may be incorrect parameters passed to a system call from user space or incorrect processing within kernel which leads to oops.