2013年7月3日 星期三

寫程式容易除錯難

之前在討論22k,還有一些機車的公司用低價僱用年輕的程式設計師,我覺得很多老闆腦袋都有問題,在軟、軔體或系統的程式設計領域上,價值是與解決事情的效率成正比,高學歷、名牌學校,不見得會比較有價值,有些時候程式人人都會寫,除了巧妙差異之外,還有除錯能力上的差異,有些年輕人遇到問題實在解不了,埋起來或是乾脆換工作都大有人在,也有人寫了一個很難維護程式碼,難到連他自己本人都維護不了,出現的鬼問題也非常的藝術,若隱若現的,這也造成市場上有關這方面的工作需求,很多時候不太喜歡用新人,也不太考慮學歷的問題,在104的統計上薪資增長速度與工作的年資比較有關係,學歷差異上到大學之後就沒長進了,軟體上甚至倒退,結果是碩士生拿到比較低的價碼。

最近有個BUG是有史以來遇到最機車的,這隻蟲非常、非常、非常、、、偶爾才會出來搗蛋,可是一旦出現,整個系統就幾乎停擺,剩下中斷還活著,很湊巧是客戶的程式執行的時候會產生,也很不幸的是到客戶手上才被發現,而當認真寫個小程式去複製這個問題的時候卻又碰不到,也就是只有客戶的那隻程式在十分的湊巧才會撞見,碰到的機率要不快、不慢,不是猛呼叫會加速碰到的機率,而他出現的平均時間又很尷尬,五到八小時,我好不容易拿著客戶的原始碼調到兩個時左右,可是我按照客戶的邏輯直接呼叫函式猛叫,反而怎麼跑都出不來。

解決這個死結的問題,靠的是SysRq,利用中斷去倒出當時系統執行了哪些thread、process,以及call stack。

底下就是SysRq拿到的LOG
[ 2676.460000] events/0      R running      0     4      2 0x00000000
[ 2676.460000] Backtrace:
[ 2676.460000] [<c03f52d0>] (schedule+0x0/0x370) from [<c03f571c>] (preempt_schedule_irq+0x48/0x7c)
[ 2676.460000] [<c03f56d4>] (preempt_schedule_irq+0x0/0x7c) from [<c00279a8>] (svc_preempt+0x8/0x20)
[ 2676.460000]  r4:ffffffff
[ 2676.460000] [<bf00e2a0>] (K_UARTTxReadyEx+0x0/0x64 [cadrv]) from [<bf02b780>] (_MAP_RegGetData+0xc0/0x20c [cadrv])
[ 2676.460000] [<bf02b6c0>] (_MAP_RegGetData+0x0/0x20c [cadrv]) from [<bf02bc5c>] (CLR_SetBitData+0x38/0x94 [cadrv])
[ 2676.460000]  r7:c3027ef7 r6:00000001 r5:00000020 r4:00000000
[ 2676.460000] [<bf02bc24>] (CLR_SetBitData+0x0/0x94 [cadrv]) from [<bf0298f0>] (K_BatterySendPowerSwitch+0x30/0x5c [cadrv])
[ 2676.460000]  r7:bf0c79fa r6:bf0c79f5 r5:00000001 r4:00000001
[ 2676.460000] [<bf0298c0>] (K_BatterySendPowerSwitch+0x0/0x5c [cadrv]) from [<bf02a1ec>] (K_Battery_Power_Switch+0x1ec/0x2e0 [cadrv])
[ 2676.460000]  r6:bf0c7a2f r5:00000000 r4:00000002
[ 2676.460000] [<bf02a000>] (K_Battery_Power_Switch+0x0/0x2e0 [cadrv]) from [<c00574f8>] (run_workqueue+0xc4/0x1e4)
[ 2676.460000]  r7:c3027f94 r6:c3001820 r5:c3026000 r4:bf0c7a6c
[ 2676.460000] [<c0057434>] (run_workqueue+0x0/0x1e4) from [<c0058288>] (worker_thread+0x9c/0x120)
[ 2676.460000] [<c00581ec>] (worker_thread+0x0/0x120) from [<c005c228>] (kthread+0x88/0x90)
[ 2676.460000]  r7:00000000 r6:c00581ec r5:c3001820 r4:c301bef0
[ 2676.460000] [<c005c1a0>] (kthread+0x0/0x90) from [<c0042ab8>] (do_exit+0x0/0x2ec)
[ 2676.460000]  r6:00000000 r5:00000000 r4:00000000

[ 2676.460000] AppMan        R running      0   888    759 0x00000002
[ 2676.460000] Backtrace:
[ 2676.460000] [<c002b8f4>] (dump_backtrace+0x0/0x118) from [<c002ba28>] (show_stack+0x1c/0x20)
[ 2676.460000]  r6:00000000 r5:c2b80680 r4:c2b80680
[ 2676.460000] [<c002ba0c>] (show_stack+0x0/0x20) from [<c0036680>] (sched_show_task+0xac/0xe4)
[ 2676.460000] [<c00365d4>] (sched_show_task+0x0/0xe4) from [<c0037d6c>] (show_state_filter+0x6c/0xb8)
[ 2676.460000]  r5:c2b80680 r4:c2b80680
[ 2676.460000] [<c0037d00>] (show_state_filter+0x0/0xb8) from [<c0213e38>] (sysrq_handle_showstate+0x18/0x1c)
[ 2676.460000]  r8:00000074 r7:60000193 r6:c2b82000 r5:00000001 r4:c056e96c
[ 2676.460000] [<c0213e20>] (sysrq_handle_showstate+0x0/0x1c) from [<c0213c78>] (__handle_sysrq+0xa8/0x198)
[ 2676.460000] [<c0213bd0>] (__handle_sysrq+0x0/0x198) from [<c0213de8>] (handle_sysrq+0x34/0x38)
[ 2676.460000] [<c0213db4>] (handle_sysrq+0x0/0x38) from [<c0219100>] (bcm5892uart_rx_chars+0x1dc/0x2ec)
[ 2676.460000]  r5:00010074 r4:c056ec78
[ 2676.460000] [<c0218f24>] (bcm5892uart_rx_chars+0x0/0x2ec) from [<c02197a8>] (bcm5892uart_int+0x94/0xe8)
[ 2676.460000]  r8:00000000 r7:c2b82000 r6:000000ff r5:c056ec78 r4:00000040
[ 2676.460000] [<c0219714>] (bcm5892uart_int+0x0/0xe8) from [<c0076944>] (handle_IRQ_event+0x78/0x1e0)
[ 2676.460000]  r7:00000046 r6:c31b2920 r5:c2b82000 r4:c31b2920
[ 2676.460000] [<c00768cc>] (handle_IRQ_event+0x0/0x1e0) from [<c0079434>] (handle_level_irq+0xa8/0x160)
[ 2676.460000] [<c007938c>] (handle_level_irq+0x0/0x160) from [<c002506c>] (asm_do_IRQ+0x4c/0x94)
[ 2676.460000]  r7:00000002 r6:00000000 r5:c056a094 r4:00000046
[ 2676.460000] [<c0025020>] (asm_do_IRQ+0x0/0x94) from [<c0027904>] (__irq_svc+0x44/0xe0)
[ 2676.460000] Exception stack(0xc2b83d28 to 0xc2b83d70)
[ 2676.460000] 3d20:                   00000001 bf0500f0 00000000 bf050118 bf04c524 00000001
[ 2676.460000] 3d40: bf04c524 00000000 c2b83dd2 c2b82000 40024000 c2b83d9c c2b83d70 c2b83d70
[ 2676.460000] 3d60: bf02b884 bf00e2a4 20000013 ffffffff
[ 2676.460000]  r6:00000040 r5:d102b000 r4:ffffffff
[ 2676.460000] [<bf02b818>] (_MAP_RegGetData+0x158/0x20c [cadrv]) from [<bf02bb0c>] (Page0_ReadREG+0x40/0x90 [cadrv])
[ 2676.460000]  r7:00000000 r6:00000005 r5:00000000 r4:c2b83dd2
[ 2676.460000] [<bf02bacc>] (Page0_ReadREG+0x0/0x90 [cadrv]) from [<bf029c64>] (K_BatteryGetCapacityOne+0x44/0x3e0 [cadrv])
[ 2676.460000] [<bf029c20>] (K_BatteryGetCapacityOne+0x0/0x3e0 [cadrv]) from [<bf02a300>] (K_BatteryGetCapacity+0x20/0x84 [cadrv])
[ 2676.460000]  r7:c337d3a0 r6:c2b83e78 r5:c2b83e17 r4:00000000
[ 2676.460000] [<bf02a2e0>] (K_BatteryGetCapacity+0x0/0x84 [cadrv]) from [<bf02a38c>] (K_BatteryStatus+0x28/0x78 [cadrv])
[ 2676.460000]  r5:c2b83e78 r4:beabac98
[ 2676.460000] [<bf02a364>] (K_BatteryStatus+0x0/0x78 [cadrv]) from [<bf007bd4>] (constructing_function_call+0x80/0x98 [cadrv])
[ 2676.460000]  r4:c2b83f00
[ 2676.460000] [<bf007b58>] (constructing_function_call+0x4/0x98 [cadrv]) from [<bf007c60>] (ioctl_entry+0x74/0x128 [cadrv])
[ 2676.460000]  r6:beabac10 r5:beabac10 r4:beabac1c
[ 2676.460000] [<bf007bec>] (ioctl_entry+0x0/0x128 [cadrv]) from [<bf007808>] (drv_ioctl+0x50/0x88 [cadrv])
[ 2676.460000] [<bf0077b8>] (drv_ioctl+0x0/0x88 [cadrv]) from [<c00d14c0>] (vfs_ioctl+0x90/0xa0)
[ 2676.460000]  r5:c0045306 r4:c337d3a0
[ 2676.460000] [<c00d1430>] (vfs_ioctl+0x0/0xa0) from [<c00d1728>] (do_vfs_ioctl+0x74/0x1ec)
[ 2676.460000]  r6:c0045306 r5:beabac10 r4:beabac10
[ 2676.460000] [<c00d16b4>] (do_vfs_ioctl+0x0/0x1ec) from [<c00d18e4>] (sys_ioctl+0x44/0x6c)
[ 2676.460000]  r4:0000000a
[ 2676.460000] [<c00d18a0>] (sys_ioctl+0x0/0x6c) from [<c0027d60>] (ret_fast_syscall+0x0/0x2c)

[ 2676.460000]  r7:00000036 r6:4056e214 r5:0000000a r4:000000bd

其實從這個LOG來看,最後是卡在UART的txReady上,而兩個行程都訪問到了差不多的點,最後的結論是在等txReady是個死結,要不是有工具就真的找到死,真的不誇張,從User space查到Linux Kernel module超長的路程。

http://www.ibm.com/developerworks/cn/linux/l-cn-sysrq/