 |
about me
show you some things you may not have thought about and some techniques you can use
some bits are more complicated than others!
this is the kind of knowledge that seperates sysadmins from operators
should be something for everyone
even if it's a sense of scale
|
 |
cover a lot of ground
fastpath through the subject
long time on oopsen
debugging, a bit hard
rest of the time on lockups
quickly talk about making bug reports
|
 |
can't show everything
just some basics to get you started
more than one way to do it!
do others if there is interest
UNIX debugging books
|
 |
to simplify things
oops: tripped an alarm and flashing red lights
lockup: the kernel has tripped itself and can't recover
can be temporary like the wavelan problems we had last time
reboot: catastrophic failure, something very bad happened...
PSU upgrade causing reboots under load, not enough power
Q: user/sysadmin/developer
Q: crash?
Q: capture an oops?
Q: decode it?
Q: report it?
Q: fix it yourself?
|
 |
spend some time on this
quite important
you should report because:
found rare bug
only happens on your hardware config
|
 |
screen / cut and paste
ring buffer / dmesg
capture closer to decoding
decoding closer to fixing
fixing closer to developing
will this sound like yoda? (developing leads to suffering!!)
serial consoles [ IPL - kernel - lilo - null modem ]
|
 |
descriptive huh?
formatting is off to get it on a slide and readable
Q: how many programmers?
Q: who has seen assembler?
|
 |
system may continue running afterwards
undefined behaviour
|
 |
ksymoops late 98
decode by hand so you understand it
use some of the techniques later on in lockups
|
 |
NULL pointer
to a structure
accessing offset 14
Sun offset story
- Solaris development cycle
- crashed randomly all over the place
- corruption in memory
- corruption was always happening at the same offset
- worked out which structures had members at that offset
- worked out the places that altered those members
- just checked those pieces of code
|
 |
may trigger other oopsen in normally fine code
|
 |
instruction pointer
where it went tits up
come back to this later
|
 |
intel documents
could be useful once the function has been decoded
|
 |
process information..
the stack is general purpose
smashing the kernel stack (8k ?)
limited in size
|
 |
unreliable call trace
pretty useless, needs to be decoded
addresses that look right
return address is in middle of function
code is the next instructions to be executed
|
 |
address on it's own is almost useless
3 - 4 gig range
get the sorted symbol table (System.map or /proc/ksyms)
the oops happened somewhere in a function
so the EIP lies between two addresses
derive the function and the offset into the function
can do that for all the call trace functions
|
 |
you can see:
tar
mkdir
reiserfs filesystem (no I'm not having a go)
note the mov instruction, we'll come back to that
|
 |
base 10!
this decoded information is useful to developers
you should be able to get to this point easily
|
 |
debugging isn't a recipe you can just follow
there is more than one way to do it
need to adapt (use printk etc.)
understand what is going on
in this case we have an oops dump, so we'll concentrate on it
how you might use this information
dont worry if you get lost in this next bit
over quite quickly
|
 |
find function name in source
grep or cscope
dissassemble the .o file (or the kernel)
match offset to get where it failed
|
 |
binutils, every system has it
offset was 0x298
|
 |
tricky!
what it does
how it branches
more than one way to do it!
|
 |
back to the objdump output
failed on the move
about to do a call
the function called had four arguments (push)
|
 |
from:
looking at the branching
counting function calls
bloody obvious comment
|
 |
include -g in CFLAGS
code before preprocessing
shows you exactly where it failed
|
 |
|
 |
it's loading an address into a register before calling it
you have that address at compile time
strange optimisation??
|
 |
tada!
macros are a pain to debug
could run the source through the C preprocessor (-E)
coding style can make things worse; one line to 16k
|
 |
will stop there
a message from our sponsor...
Linus is notorious for being against in-kernel debuggers
- to properly fix problems you need to
- understand the written code
- and what the code is supposed to be doing
|
 |
can relax a bit now
this is a lot easier
|
 |
lockups, hangs
in general lockups are caused by waiting for something
isn't going to happen
isn't going to be released
can lockup one or more processors on a system
like capturing oopsen, the goal is to find out where the code locked up
when you have the location you can answer why
|
 |
bad hardware (like memory)
bad driver
bad microcode....
bad luck?
easier to debug when it's quickly reproducable
the print IEP patch can be used to debug the spontaneous reboot type bugs
|
 |
software problems
test with keyboard lights
hardware hackers can make an NMI board from old ISA cards
find out where it's gone with magic sysreq
|
 |
P show regs / EIP
very useful general purpose tool
slightly dangerous so it's usually turned off on most systems by default
/etc/sysctl.conf on redhat
|
 |
you'll get an interrupt
a-b lock inversion deadlocks on MP systems
|
 |
NMI watchdog producing stack dumps
print EIP patch is applicable here
X lockups are common and difficult to debug
on console look like lockups with IRQ disabled
check over the network, eject pccards on notebooks, music carries on playing
can use the sysreq key to reset the keyboard into Raw mode and change VTs
|
 |
NONONO!
read the REPORTING-BUGS file
- oops data
- kernel version
- patches applied
- kernel config file
- hardware
- MOST IMPORTANTLY WHAT YOU WERE DOING AT THE TIME!
|
 |
type
gather
process
what to do with that it
debugging basics
hopefully closer to developing code
or realise what's going wrong faster
go further? understand the kernel
learn basic languages (C and ASM)
experiment
|
 |
serial consoles?
|