Monday, July 23, 2012

Rebuilding RPMDB (RPM Database) - open rpm file handles

Following with the output, there has been a process being initiated by cron.daily since may 9th and not finishing the execution. The files in SOSREPORT help troubleshoot the issue if you are troubleshooting remotely. [root@BUTWIS01 ~]# ps -ef|grep rpm root 1225 6420 0 Jul12 ? 00:00:00 /bin/sh /etc/cron.daily/rpm root 1233 1225 0 Jul12 ? 00:00:00 /usr/lib/rpm/rpmq -q --all --qf %{name}-%{version}-%{release}.%{arch}.rpm\n root 1353 6255 0 Jun05 ? 00:00:00 /bin/sh /etc/cron.daily/rpm root 1359 1353 0 Jun05 ? 00:00:00 /usr/lib/rpm/rpmq -q --all --qf %{name}-%{version}-%{release}.%{arch}.rpm\n root 1997 6451 0 May21 ? 00:00:00 /bin/sh /etc/cron.daily/rpm root 2000 1997 0 May21 ? 00:00:00 /usr/lib/rpm/rpmq -q --all --qf %{name}-%{version}-%{release}.%{arch}.rpm\n
......
......



The problem appears to have begun on certain date (lets say May 9th); that's the earliest log entry, and /var/log/rpmpkgs is a 0-byte file created on May 10. Unfortunately, we do not seem to have logs stretching back nearly that far on the server, so determining what happened may not be possible. Does the server itself have any 'messages' files in /var/log other than messages and messages.1?

One thing we can see is that each of the temporary files created by the cron job still have open file handles: sort 1360 0 1 unknown /var/log/rpmpkgs.bvNeC1355 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable)
sort 2001 0 1 unknown /var/log/rpmpkgs.AqCXY1999 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable) sort 2145 0 1 unknown /var/log/rpmpkgs.CMQBk2143 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable) sort 2580 0 1 unknown /var/log/rpmpkgs.unPLs2578 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable) sort 3485 0 1 unknown /var/log/rpmpkgs.LbxJe3483 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable) sort 4017 0 1 unknown /var/log/rpmpkgs.AiNLk4009 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable) sort 6523 0 1 unknown /var/log/rpmpkgs.BEVAC6518 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable) sort 6584 0 1 unknown /var/log/rpmpkgs.NfVjU6582 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable) sort 7071 0 1 unknown /var/log/rpmpkgs.blYnX7062 lstat: Resource temporarily unavailable) (stat: Resource temporarily unavailable)

ls -hl /var/log

The fact that all the files still exist in /var/log/messges* suggests this isn't an issue with your storage or filesystem. Unfortunately, the logs still don't go back far enough to tell us what might have happened. The next thing I like to do is collect some data about what the stuck processes are doing. To that end, please run: strace -Tttfvo /tmp/strace.out -p15223 Let that run for 15 seconds or so, then press Ctrl+C and send us /tmp/strace.out.



The strace output shows that rpm process is currently stalled in a futex wait: 15223 14:48:53.284527 futex(0x2ae242d4a6cc, FUTEX_WAIT, 1, NULL <unfinished ...> This usually indicates a problem within rpmdb so perform following steps to rebuild rpmdb. 1, capture current status. # cd /var/lib/rpm # /usr/lib/rpm/rpmdb_stat -CA > /tmp/rpmdb.out 2, Kill the rpm processes by running "killall -9 rpm" 3, back rpmdb then rebuild. # mv /var/lib/rpm/__db.* /tmp # rpm --rebuilddb Running "rpm -qa" or "yum checl-update" should confirm if the rpmdb is back in working state. Read /tmp/rpmdb.out along with the result after you have run above for more insights.

No comments: