[23:59:51] safer than safe enough isn't useful. [00:01:09] it's easy, just use http://docs.python.org/library/os.html#os.setuid [00:01:49] maplebed: hrm, i only fully tested against pdns.controlsocket [00:01:51] hm. I will try that. [00:02:28] of course you'll need to lookup the uid [00:02:37] since it can vary between systems [00:02:46] sure. [00:04:36] that's in another library. pwd, I believe [00:04:52] pwd.getpwnam(name) [00:05:24] pwd.getpwnam(name)[2] <— that specifically [00:05:26] i think you do want the recursor for the particular stats you're getting [00:06:04] Ryan_Lane: thanks. that's the one. [00:06:09] yw [00:12:34] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [00:37:47] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [00:37:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [00:37:48] blast. ryan left. binasher - I added dropping privs and sanitized input from the statefile. Care to look again? [00:46:36] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 294 seconds [00:49:18] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [00:49:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [00:53:14] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [00:53:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [00:58:24] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [00:58:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [01:00:15] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 25 seconds [01:00:51] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 10 seconds [01:16:23] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [01:16:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [01:32:02] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [01:32:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [01:42:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 302 seconds [01:45:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:37:41] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [03:05:53] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:46:41] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [03:48:38] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [04:11:43] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [04:11:52] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [08:07:09] so quiet this time of the morning... [08:43:39] nothing happen til 2pm CET (or noon GMT) [08:43:41] ++ [08:58:30] back [08:58:37] notpeter: so yeah European morning are really quiet [08:59:11] probably cause there are few staff/contractors in Eu [08:59:20] and most volunteer are at school / work / sleeping [09:02:12] PROBLEM - Puppet freshness on gurvin is CRITICAL: Puppet has not run in the last 10 hours [09:04:09] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [09:04:09] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [09:04:09] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [09:04:09] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [09:05:12] hashar: gotcha [09:05:34] hashar: it also seems like lots of folk in europe sleep late/work late so that they get some overlap with west coast [09:06:13] we are kind of forced to do it :-( [09:06:51] the 9hours difference is really not helping [09:07:03] as an example, I end my day of work at 9am your time [09:07:11] some of us are just busy coding so we lay low [09:07:16] unless something big is broken [09:07:25] get my daughter, lunch, kiss my wife etc… then resume work at noon SF time when most people get out for lunch [09:07:32] and can work till 2pm (11pm eu) [09:07:59] notpeter: so I mostly use async communication with SF folk. Aka the good old email ;-D [09:08:11] and enjoy the morning coding [09:10:02] break [09:10:08] will be back this afternoon [10:03:33] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:57] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:07:18] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 7.049 seconds [10:11:03] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:13:36] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [10:16:18] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:39] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 2.189 seconds [10:18:42] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:12] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:34:03] seeing some segfaults from page.cgi on kaulen [10:34:28] trying to stop and restart apache over there, just stopping it is taking a very long time [10:35:03] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:35:20] gotta wait it out [10:37:54] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:38:00] going to watch load drop for a minute here [10:38:57] PROBLEM - Apache HTTP on kaulen is CRITICAL: Connection refused [10:38:58] mark: ping [10:40:36] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.013 seconds [10:42:09] !log restarted apache on kaulen, was seeing page.cgi segfaults in dmesg and he logs, huge cpu wait spikes (why?) [10:42:16] Logged the message, Master [10:47:26] New patchset: ArielGlenn; "weak sync of wmf media from swift/other backend to local filesystem" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/8894 [10:52:36] RECOVERY - Host search13 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [10:52:45] RECOVERY - Host search14 is UP: PING WARNING - Packet loss = 64%, RTA = 0.40 ms [10:52:45] RECOVERY - Host search15 is UP: PING WARNING - Packet loss = 93%, RTA = 0.75 ms [10:55:36] PROBLEM - SSH on search13 is CRITICAL: Connection refused [10:55:54] PROBLEM - SSH on search15 is CRITICAL: Connection refused [10:56:03] PROBLEM - SSH on search14 is CRITICAL: Connection refused [10:56:03] PROBLEM - Lucene disk space on search13 is CRITICAL: Connection refused by host [10:56:03] PROBLEM - Lucene disk space on search15 is CRITICAL: Connection refused by host [10:56:16] paravoid: pong [10:56:57] PROBLEM - Lucene disk space on search14 is CRITICAL: Connection refused by host [10:57:19] hi [10:57:31] so, I have several commits to pybal [10:57:37] ah? [10:57:38] how do I push them [10:57:43] do I need change-ids? [10:57:49] yes [10:57:54] yours didn't have [10:58:01] that's because I didn't push them via gerrit [10:58:05] but you should now [10:58:05] ah! [10:58:12] just push to refs/for/master [10:58:18] RECOVERY - Host search19 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [10:58:18] RECOVERY - Host search16 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [10:58:27] RECOVERY - Host search20 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:58:30] I need to add the hook first and amend the commits :) [10:58:34] yeah [10:58:34] dammit [10:58:40] or maybe [10:58:44] change-ids are not required in that repo [10:58:45] try it first ;) [10:58:48] not sure what I set [10:59:26] New patchset: Faidon; "Remove stub/placeholder files from debian/" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8895 [10:59:27] New patchset: Faidon; "Ship bgp.py with pybal for now" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8896 [10:59:27] New patchset: Faidon; "Change homepage to wikitech" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8897 [10:59:28] New patchset: Faidon; "Use Python absolute imports" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8898 [10:59:29] New patchset: Faidon; "Add twisted to setup.py's requires" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8899 [10:59:29] New patchset: Faidon; "Move main.py to scripts/pybal" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8900 [10:59:30] New patchset: Faidon; "Modernize Debian packaging" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8901 [10:59:31] New patchset: Faidon; "Modernize init script" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8902 [10:59:31] New patchset: Faidon; "Make example configuration less Wikipedia-specific" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8903 [10:59:32] New patchset: Faidon; "Rewrite debian/copyright" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8904 [10:59:33] New patchset: Faidon; "Add a debian/changelog entry" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8905 [10:59:36] yay :) [10:59:47] awesome :) [10:59:48] PROBLEM - Lucene on search15 is CRITICAL: Connection refused [10:59:48] enjoy [10:59:48] reviewing [11:00:31] also, which instance do you use for building packages? [11:00:44] I now use my 'varnish' instance [11:00:47] hah [11:00:52] which has a pybal setup with lucid and precise [11:00:55] er [11:00:56] pbuilder [11:01:04] i'm gonna make a precise instance now, for testing pybal on it [11:01:27] PROBLEM - SSH on search16 is CRITICAL: Connection refused [11:01:27] PROBLEM - SSH on search19 is CRITICAL: Connection refused [11:01:43] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8895 [11:01:51] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8895 [11:01:52] mark: I made one, you can do whatever with it, if you like [11:01:53] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8895 [11:01:54] PROBLEM - Lucene on search14 is CRITICAL: Connection refused [11:01:54] PROBLEM - Lucene on search13 is CRITICAL: Connection refused [11:02:06] there's a general package builder instance as well [11:02:11] which is supposed to work [11:02:12] PROBLEM - Lucene disk space on search16 is CRITICAL: Connection refused by host [11:02:12] PROBLEM - Lucene disk space on search19 is CRITICAL: Connection refused by host [11:02:14] but I never used it [11:02:26] yeah, I have built on that [11:02:39] we should add a deb lint checker to gerrit [11:03:50] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8896 [11:03:52] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8896 [11:04:30] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8897 [11:04:32] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8897 [11:04:52] which one's that? [11:05:05] what do you mean? [11:05:29] the "general package builder instance" [11:05:41] * mark checks [11:05:49] labs-build1? [11:05:53] yeah I think so [11:05:57] fits :) [11:06:19] I had to filter project "testlabs" first, heh [11:07:02] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8898 [11:07:11] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8898 [11:07:13] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8898 [11:07:27] PROBLEM - Lucene on search16 is CRITICAL: Connection refused [11:07:36] PROBLEM - Lucene on search19 is CRITICAL: Connection refused [11:07:47] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8899 [11:07:49] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8899 [11:08:29] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8900 [11:08:31] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8900 [11:08:39] RECOVERY - SSH on search13 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:09:48] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8901 [11:09:50] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8901 [11:10:09] RECOVERY - SSH on search14 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:10:09] RECOVERY - SSH on search15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:10:12] hm, I was thinking of writing an upstart script for pybal [11:10:24] but do you know what ubuntu (debian) are gonna do with systemd? [11:10:51] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8902 [11:10:59] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8902 [11:11:01] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8902 [11:11:50] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8903 [11:11:52] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8903 [11:11:57] PROBLEM - Host search19 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:06] PROBLEM - Host search20 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:33] RECOVERY - SSH on search20 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:12:42] RECOVERY - Host search20 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [11:13:00] RECOVERY - SSH on search19 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:13:00] RECOVERY - SSH on search16 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:13:09] RECOVERY - Host search19 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:13:38] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8904 [11:13:40] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8904 [11:13:54] PROBLEM - NTP on search14 is CRITICAL: NTP CRITICAL: No response from NTP server [11:13:54] PROBLEM - NTP on search15 is CRITICAL: NTP CRITICAL: No response from NTP server [11:13:54] PROBLEM - NTP on search13 is CRITICAL: NTP CRITICAL: No response from NTP server [11:14:35] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8905 [11:14:37] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8905 [11:14:44] thanks a lot faidon :-) [11:16:00] RECOVERY - Host search17 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [11:19:00] PROBLEM - Lucene disk space on search17 is CRITICAL: Connection refused by host [11:19:00] PROBLEM - SSH on search17 is CRITICAL: Connection refused [11:23:57] PROBLEM - Lucene on search17 is CRITICAL: Connection refused [11:25:54] hm, bug [11:26:04] I just created a 'pybal' project, gave my name as member [11:26:07] but now i'm not listed [11:26:10] and can't do anything with it [11:31:45] RECOVERY - SSH on search17 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:32:30] PROBLEM - NTP on search16 is CRITICAL: NTP CRITICAL: No response from NTP server [11:32:41] mark: ubuntu has said that they're not going to switch to systemd [11:32:49] debian has not migrated to either upstart or systemd [11:32:51] for various reasons [11:32:57] ok [11:33:04] technical and social [11:33:24] if Debian chooses systemd, I'd expect Ubuntu to follow [11:33:42] well then I'll continue creating upstart scripts from time to time [11:33:51] PROBLEM - NTP on search19 is CRITICAL: NTP CRITICAL: No response from NTP server [11:33:53] I've never written one [11:34:52] upstart has its problems [11:36:10] tbh, without having a very well educated opinion, I tend to prefer upstart over systemd [11:36:16] not a big fan of do-it-all daemons [11:36:38] yeah [11:36:42] PROBLEM - NTP on search17 is CRITICAL: NTP CRITICAL: No response from NTP server [11:36:58] anyway, as for pybal [11:37:08] maybe you should bump versions at some point? :) [11:37:11] yes [11:37:14] it's still at 0.1 :) [11:37:16] gonna do that now [11:37:22] I just didn't want to bother with it before ;) [11:37:25] and I was thinking of switching to git-buildpackage [11:37:30] absolutely [11:37:40] it's going to mess the workflow a bit [11:37:56] you have a master branch with just the upstream tree and a debian branch with the debian changes [11:37:59] I built it with git-buildpackage yesterday [11:38:12] it's a native package now, so everything on one branch [11:38:15] any reason to change that now? [11:39:03] it doesn't make a big difference for us but being a native package is a stopper for a Debian/Ubuntu upload [11:39:10] ok [11:39:12] feel free to change it [11:39:49] okay [11:40:02] later though, I have some pending tasks that I shouldn't postpone more [11:40:08] yeah, no rush [11:40:12] i'm gonna add in ipv6 now [11:40:15] although pybal brought me back to my comfort zone :) [11:40:18] gonna put pybal on an instance and test it there [11:40:21] haha [11:40:36] yeah and I'm happy too, didn't really feel like reading up on all the debian packaging updates [11:42:24] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:43:08] I saw the DSA too, but was waiting for the USN [11:49:02] grr [11:49:06] why doesn't my new instance let me in [11:49:31] did you try to login before you added yourself? [11:49:39] if so, there's a negative cache :) [11:49:45] been there, done that [11:49:59] i would think I've been added upon creation [11:50:10] ok, let me try [11:50:13] * mark checks [11:51:05] the labsconsole interface could really use some improvements ;) [11:51:13] ohrly? :) [11:52:26] negative caching... and then people always wonder why I hate ldap [11:52:27] hm, can't login either [11:52:38] I like ldap, I don't like pam/nss ldap :) [11:53:39] do you have access rights to remove the 'pybal' project? [11:53:48] it has no members so I can't [11:53:50] projects cannot get removed afaik [11:53:56] ever [11:53:56] whut [11:54:48] can't login to pybal-precise either, but I wasn't a sysadmin [11:55:02] added myself now, but perhaps I'm in the negative cache too now [11:55:04] fail :) [11:56:00] hrm, or puppet is still running? [11:56:03] might be [11:56:10] it's setup to create pbuilder instances [11:56:11] those take a while [11:56:15] especially when labs I/O is slow [11:56:25] yep, I'm looking at the console [11:56:38] it's creating instances for hardy, lucid and precise by default [11:56:59] i'll try again later [12:04:22] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:38:16] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [12:46:13] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [12:49:13] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [12:49:13] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [12:55:13] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [12:55:13] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [12:55:13] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [12:57:19] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [13:07:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:37:28] !log updating drac on search18, shouldnt cause system reboot. [13:37:32] Logged the message, RobH [13:47:19] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [13:49:16] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [13:50:46] !log palladium has a bad disk, goign to replace it [13:50:46] so, wait, we have rsyslog in production and syslog-ng in labs? [13:50:49] Logged the message, RobH [13:50:52] mark: ^ palladium is a varnish cache [13:50:59] but it appears to be dual disk 1 TB raid 1 [13:51:09] yes it's a bits server [13:51:11] oh? we have caches that are neither cp or sq? [13:51:16] yes [13:51:25] mark: so since its hardware raid1 i should be ok to pull it to swap [13:51:26] dammit, and I thought I got a hang of our naming conventions [13:51:33] but since its varnish i am glad you are about ;] [13:51:39] there are 3 other ones [13:51:41] just shut it down [13:51:51] i would like to try the hot swap just to see if it works =] [13:51:57] we have not hot swapped that many 610s [13:52:17] !log replacing ps2 on mw1017 [13:52:21] Logged the message, Master [13:53:13] yep, still up [13:53:14] huzzah [13:53:28] New patchset: Hashar; "split filebackend conf out of CommonSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8914 [13:53:35] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/8914 [13:56:50] !log palladium disk replaced [13:56:54] Logged the message, RobH [14:05:29] gah, labs is unusable again [14:11:19] :( [14:36:01] PROBLEM - Lucene disk space on search20 is CRITICAL: Connection refused by host [14:40:19] okay, silly question [14:40:25] why are we naming our backport section "universe" [14:40:29] instead of "backports"? :) [14:41:00] and our patched/custom-built "main", instead of say, "wikimedia", "patched", "custom" or whatnot [14:41:00] "cause that's how it's always been?" ? [14:41:05] idk if that's right though [14:41:39] sounds like the kind of thing that maybe hasn't changed in ~5 yrs [14:43:48] paravoid: there's an RT ticket somewhere to change that [14:44:08] Setting up pybal (0.1+r20120524-1) ... [14:44:08] * Starting pybal pybal Traceback (most recent call last): [14:44:08] File "/usr/sbin/pybal", line 10, in [14:44:08] from pybal import pybal [14:44:08] ImportError: No module named pybal [14:44:22] oh? [14:44:26] hehe [14:44:27] that may be me [14:44:31] only the /usr/sbin/pybal binary is included [14:44:31] most probably :) [14:44:32] and configs [14:44:35] not the actual app ;) [14:44:36] eh?! [14:44:40] s/binary/script/ [14:44:45] are you sure? [14:44:52] that's what dpkg -L tells me [14:45:15] and dpkg -c [14:45:15] built on precise? [14:45:17] yes [14:45:24] on pybal-precise [14:45:24] I only built it on Debian [14:45:42] Use of uninitialized value $python_default in substitution (s///) at /usr/share/perl5/Debian/Debhelper/Buildsystem/python_distutils.pm line 121. [14:45:44] a lot of those [14:45:47] that may have something to do with it [14:46:23] * jeremyb wonders if we should start a pool on how long until there's a buildd ;) [14:46:31] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:51] do we need a buildd? [14:46:58] # ls [14:47:02] yeah [14:47:02] and waits [14:47:12] gah, argh, grr [14:47:17] people keep wishing for a buildd [14:47:20] and although it would be nice [14:47:25] that's not actually where you spend most time ;) [14:47:30] you still have to test your packages [14:47:35] and thus do manual builds [14:47:48] ...which is pretty much one command anyway [14:48:19] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:38] I just built it on pybal-precise [14:48:43] and the deb includes them [14:48:47] so most probably a missing build depends [14:49:11] ah [14:49:15] I presume you built it in a clean pbuilder? [14:49:16] because I just installed twisted and stuff perhaps [14:49:16] mark: i guess the main benefit would be to ensure it builds in a clean, minimal env. has an accurate depends line, etc. [14:49:17] yes [14:49:19] brb [14:49:23] good mark, bad faidon [14:49:29] jeremyb: aka pbuilder, which we've had for years [14:49:39] oh, ok [14:51:01] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 4.509 seconds [14:51:02] i think it would be nice to have a build environment which builds/checks a package on every git checkin [14:51:03] automatically [14:51:08] from gerrit and perhaps jenkins [14:51:26] but it's not essential [14:51:50] sounds like an overkill :) [14:52:10] yeah definitely not something we can spend a lot of time on ;) [14:52:17] Sounds like a task for Jenkins [14:52:30] as I just said ;) [14:52:41] I put it as one of the berlin hackathon topics, but I don't expect much to happen [14:52:45] as we have more important stuff on our plate ;) [14:52:58] IPv6, git-deploy, ... [14:54:48] I think I just added ipv6 support to pybal, but still need to test it [15:00:41] hm, I just build pybal on a sid pbuilder and got the .py too [15:01:22] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:18] ah! [15:04:17] we're missing a build-dep on python-all [15:04:20] that's at least a problem [15:04:29] hehe [15:05:34] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 7.850 seconds [15:05:43] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:08:43] New patchset: Faidon; "Add b-d on python-all and use dh_python2" [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8922 [15:08:45] mark: ^ [15:09:01] thanks! [15:09:03] I wasn't able to reproduce the issue but I think this will fix it [15:09:08] i'll test it now [15:09:11] *think* :) [15:09:41] New review: Mark Bergsma; "(no comment)" [operations/debs/pybal] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8922 [15:09:43] Change merged: Mark Bergsma; [operations/debs/pybal] (master) - https://gerrit.wikimedia.org/r/8922 [15:09:55] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:04] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:21] btw, re: ipv6, two questions for you: [15:10:41] a) I think we don't really need precise — and this might make the upgrade riskier and more complicated [15:11:00] a backport of ipvsadm to lucid should be fine, lucid's kernel is old enough to have ipv6 [15:11:10] ipv6 lvs that is [15:11:48] nah I wanna upgrade those boxes anyway [15:11:54] okay :) [15:11:56] since we're doing ipv6 only on nonactive hosts first, I see no risks [15:12:11] could even do it in only one data center [15:12:21] the risk is that if things go wrong we won't be sure if it's precise or ipv6 [15:12:28] but I guess it's a small risk [15:12:43] yeah but we can move some ipv4 traffic over also [15:12:44] for testing [15:12:54] i'm not that worried [15:12:56] okay [15:13:15] if you're not worried, I have no reason to be :) [15:13:33] b) maybe we should run a teredo/miredo relay and maybe even a 6to4 (half) relay [15:13:34] well, if you know where we're coming from [15:13:48] where I was testing pybal changes by live edits in /usr/lib/python files on active load balancer hosts [15:14:01] yes, good one [15:14:01] hahahaha [15:14:04] please make an RT ticket for that [15:14:10] with 2391 as parent [15:14:17] ok, doing that now [15:18:19] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.345 seconds [15:18:55] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:21:42] buil [15:21:43] ding [15:21:45] one [15:21:46] step [15:21:47] at [15:21:48] a [15:21:49] time [15:22:52] are you aware of bugzilla being slow / unreachable ? [15:23:34] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:43] yeah kaulen :-) [15:23:57] !log kaulen (bugzilla) unreacheable :-( [15:24:00] Logged the message, Master [15:24:44] mark: hahahahaha [15:26:16] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:26:35] paravoid: looks better now :) [15:30:37] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:22] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:15] !log Power cycled kaulen [15:34:21] Logged the message, Master [15:35:06] mark: yay :) [15:35:53] \O/ [15:35:53] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:35:56] brute force ftw [15:36:05] but it may happen again [15:37:05] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.002 seconds [15:44:16] happened this morning but I was able to get on and stop/start apache [15:44:25] and it seemed fine for a while after that [15:51:47] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:53:51] I feel like I'm cursed to work on an environment where commands respond minutes after you type them [15:54:41] yeah [15:54:50] good excuse to yell hard at ryan ;) [15:58:50] mark: for deployment-prep, we need wikimedia-task-appserver for precise [15:58:53] and I wonder [15:59:00] why do we have that instead of having it all in puppet? [15:59:04] it's just a metapackage, isn't it? [15:59:12] also contains scap scripts [15:59:14] but yeah [15:59:19] I think those have been edited in puppet too [15:59:23] not sure what the current status is [15:59:27] it used to do everything puppet does now [15:59:38] No, some scripts are in puppet and some are in the package [15:59:42] I don't /think/ there's any duplication there [16:00:09] but that's really the only reason for that package nowadays [16:00:20] Specifically, the scripts executed by humans on fenari (such as sync-file and scap) are in puppet, while the scripts that actually run on the nodes (such as scap-1 and friends) are in the package [16:00:32] right [16:00:57] so, keeping the package? [16:01:03] no we can get rid of it [16:01:16] as long as puppet is not gonna deploy some extensive set of scripts/apps anyway [16:01:40] and I also hate it when puppet touches files in /usr/bin or /usr/sbin ;) [16:01:51] should use /usr/local instead [16:01:59] Feel free to puppetize it and put the scripts in /usr/local/{s,}bin [16:02:06] Just don't break the deployment system plz :) [16:02:11] heh [16:02:15] that's the main reason noone's done it yet ;) [16:02:31] paravoid: but likely, just copying that _all package from lucid to precise repo will work [16:02:40] (if you choose not to modify things now) [16:02:45] Also, if git-deploy happens soon, it'll be obsolete anyway [16:02:49] yeah [16:02:59] Well, mostly. Stuff like apache-sanity-check is in there too [16:03:06] can get rid of that too [16:03:23] hm, I think I should just reprero copy to precise and wait for git-deploy then [16:03:31] yeah [16:03:41] no reason to mess with things I don't understand if they're about to get replaced :) [16:03:57] ok, and another question for you guys, bear with me :) [16:04:04] it seems we have a patched php5 in lucid-wikimedia [16:04:05] by Ryan [16:04:13] that may no longer be necessary [16:04:22] it did like two things or something [16:04:24] the changelog mentions a) enable cdb, b) enable gdb3 symbols and don't strip [16:04:28] yes [16:04:37] if both of those are now handled by ubuntu [16:04:43] feel free to not use put it in precise [16:04:48] so, how can I test for (a) [16:04:49] all the better [16:04:54] and why php5-dbg does not suffice for (b)? [16:05:01] I think that didn't exist back then [16:05:16] i've heard that (a) is included now [16:05:27] it's just one added --enable option in debian/rules [16:05:35] ok, I'll look at the source [16:05:36] if the current ubuntu php package has that, you should be set [16:07:13] debdiff to the rescue [16:07:23] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:08:29] nope, still no cdb in stock ubuntu [16:08:41] and wtf, we're missing dozens of security updates [16:09:52] that's why it would be nice if it were included [16:10:27] * paravoid cries [16:11:20] use after free, remote execution, open_basedir bypass, null pointer dereference, arbitrary code execution [16:11:23] among others [16:11:41] yeah yeah, we know ;) [16:17:34] So I don't want to hijack the conversation, but who's in charge of things like patching in ops? I've been meaning to ask ryan, but since paravoid brought it up... [16:17:55] noone and everyone is in charge [16:19:10] it's the wiki way [16:19:38] Good to know.... if slightly scary. [16:23:46] another one for berlin [16:23:51] actually way more important than the buildd one ;) [16:24:17] gaah, labs is just unusable [16:24:24] yes [16:24:25] I'm gonna stop working on this soon, try again on... tuesday [16:24:40] (monday is day off) [16:24:56] following the U.S. holidays? :) [16:25:02] no, dutch [16:25:29] pentecost [16:25:30] ah, greek holiday too [16:25:43] ascencion day for us [16:25:51] we had that last week [16:26:08] hmm, or not [16:26:26] no, ascension day was yesterday for us, but we don't celebrate it as a day off I guess [16:27:38] ah, Whit Monday, same as you apparently [16:27:45] http://en.wikipedia.org/wiki/Whit_Monday [16:27:49] monday after the pentecost [16:29:26] well, would be good to get that php package in git then, in gbp format [16:30:43] I think they have cdb now, it's just enabled by default [16:30:50] ah good [16:30:52] ./configure --help doesn't have --enable-cdb, just --without-cdb [16:30:57] but I'm not sure, I'm trying to find a way to test that now :) [16:31:03] if that holds for lucid too, then let's switch ASAP [16:37:51] # php5 [16:37:51] [16:37:51] Array [16:37:51] ( [0] => cdb [1] => cdb_make [16:37:56] that's on stock lucid [16:38:17] I should ask Ryan [16:39:43] i don't think ryan knows any more about this than you do now [16:39:48] it just happens that he did the last security rebuild ;) [16:40:35] mediawiki uses cdb for things like the localization cache [16:43:00] I was told that [16:43:21] who should I ask before replacing php from our repo and upgrade servers then? :) [16:44:28] I'm nearly positive stock php has cdb now [16:44:46] mark: and if the ldap negative cache is too long, we can shorten it ;) [16:44:54] please do [16:44:59] had to wait for like an hour today [16:45:06] in general, you'll only hit that if you try to log into an instance before you're in the project [16:45:07] paravoid: noone ;) [16:45:09] just test it on one box [16:45:26] test what?? :) [16:45:29] Ryan_Lane: also I managed to create one project (pybal) without members today [16:45:30] otherwise it probably wasn't negative ldap cache [16:45:35] paravoid: stock php package [16:45:42] there's no error checking for membership [16:45:52] no, test mediawiki how? [16:45:55] when creating a project, that is [16:46:04] I just tested dba_handlers() on a stock lucid php package [16:46:08] and says cdb/cdb_make [16:46:14] so, it's /probably/ okay [16:46:36] but I don't know how (or who to ask) to actually test mediawiki & l10n cache [16:46:50] roan maybe? [16:46:58] o [16:47:02] ok. off to the office [16:51:52] ... lucid? [16:51:57] * rcoli shudders slightly [16:52:10] what? [16:52:49] * rcoli comes from a shop where they were still running lucid in 2010, is familiar with the sort of problems one might have doing so [16:53:00] er [16:53:05] lucid came out in 2010 [16:53:06] precise was released like two weeks ago [16:53:13] err, lol [16:53:16] *lenny* [16:53:19] hash fail lookup [16:53:40] lenny was released in 2010 too [16:53:42] :) [16:53:43] if only it were named "exploding cow" [16:53:58] sorry, 2009 [16:54:40] and the last point release was released like 2 months ago [16:54:42] it's not /that/ old [16:54:47] no [16:54:48] I suppose you're right [16:54:51] we have a few hardy boxes left still [16:54:55] even those are doing fine [16:55:06] they'll be gone soon, but [16:55:09] yeah, it's just integration awkwardness etc [16:55:09] it's not like they're really in our way [16:57:28] so, mark, shall I send a mail to ops? [16:57:48] ask jeff [16:57:51] he tested this for fundraising [16:57:52] with the hope than Roan or someone else who reads it and knows MW will help me? [16:58:08] confirm that he has the stock php package working with an l10n cache install there, and afaic, you can just roll it then [16:58:19] he's also claimed the stock package was fine now [17:20:59] New review: Aaron Schulz; "Can you move $wgDefaultUserOptions to a proper place before being making this change?" [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/8914 [17:25:00] !log resetting gurvin, load spiking at 370+, SSH unreachable, 214 days of uptime [17:25:03] Logged the message, Master [17:27:49] yvon will go that way soon, it's also at 214 [17:29:57] apergos: good catch [17:30:02] I'm upgrading and will reboot [17:30:05] cool [17:30:08] I should depool first though [17:30:14] (i've gone off the clock already) [17:30:42] I need to schedule the snapshot hosts soon (not instantly but "soon") [17:33:16] is loudon the same case? [17:33:21] New patchset: Bhartshorne; "adding ganglia metrics to pds recursors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [17:33:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8876 [17:34:45] woosters: ganglia shows it as been dead for weeks, if not months [17:35:22] Jeff_Green, hi - is my cluster access supposed to work now? [17:36:18] MaxSem: in theory yeah [17:36:29] is it full of fail? [17:36:37] Jeff_Green, I couldn't log in [17:36:44] where did you try? [17:37:22] host: fenari.wikimedia.org, account name: maxsem, using the key I sent you [17:37:41] looking [17:38:19] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8876 [17:38:22] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8876 [17:40:34] hm, shouldn't I able to find gurvin in fenari's /home/wikipedia/conf/pybal? [17:41:13] oh how nice, gurvin is not getting any traffic or has nginx installed [17:41:55] MaxSem: can you try again? [17:42:06] !log rebooting gurvin & yvon with new kernel [17:42:10] Logged the message, Master [17:43:14] MaxSem: ah, key_read error. looking [17:46:15] MaxSem: try now? [17:46:29] whee [17:46:51] thanks, works now [17:47:00] you may run into the same issue on other hosts, let me double-check puppet config [17:48:30] New patchset: Bhartshorne; "yay typos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8932 [17:48:37] somehow 'bqF' got mangled to '> ' :-( [17:48:50] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8932 [17:48:50] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8932 [17:49:45] New patchset: Jgreen; "fix maxsem's ssh public key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8933 [17:50:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8933 [17:50:27] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8933 [17:50:29] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8933 [17:51:46] MaxSem: ok fixed in puppet too, it will take a while to propagate, certainly should be out within an hour though [17:52:23] cool [17:52:33] sorry about that. [18:04:01] RECOVERY - Puppet freshness on gurvin is OK: puppet ran at Fri May 25 18:03:44 UTC 2012 [18:25:46] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:19] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:49] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 8.757 seconds [18:30:25] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Fri May 25 18:30:13 UTC 2012 [18:34:10] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:52] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 1.591 seconds [18:44:33] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:31] is someone actually looking at kaulen? [18:49:58] woosters: bugzilla (kaulen) is totally unreachable again [18:50:44] robla - someone will be [18:50:54] thanks! [18:51:45] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 2.633 seconds [18:52:41] robla - robh is investigating [18:52:54] ? [18:52:58] i am? [18:52:59] ok.. [18:53:33] ssh is borked, going in via serial [18:53:50] drac is borked, resettting drac [18:54:38] drac reset is slow. [18:55:09] !log kaulen serial console unresponsive, rebooting [18:55:13] Logged the message, RobH [18:56:02] its posting, im monitoring the boot [18:57:00] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:57:25] !log kaulen is rebooted, it may have had a runaway process or a memory leak, not sure yet, but it was locked up from access [18:57:29] Logged the message, RobH [18:57:32] !log bugzilla appears back online [18:57:35] Logged the message, RobH [18:57:37] robla: ^ [18:57:47] looks like it's back, but since this is the second reboot today, probably more investigation is in order [18:57:49] im going to grep the logs, but half th etime this kind of lockup shows little to nothing [18:58:03] but since it just happened, hopefully there is something [18:58:10] thanks [18:58:30] * robla disappears into interview [18:59:08] btw, kaulen in hokkien means mischevious [18:59:14] so it is acting up [18:59:16] ;-P [19:01:25] alright, from now on, all machines will be named for synonyms of "obedient" [19:01:32] i see a load spike around the last crash and this one [19:02:59] the cpuwait time skyrockets and the machine locks up [19:04:48] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [19:04:48] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [19:04:48] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [19:04:48] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [19:09:57] !log disabled the outdated /etc/init.d/gmond on spence. use ganglia-monitor instead. [19:10:00] Logged the message, Master [19:11:14] RobH: the spikes correlate with something blowing its memory and pushing the machine into swap death [19:11:17] (hence the iowait) [19:11:52] yea, im just not sure what process is doing it [19:20:57] hexmode: are you running this? https://www.mediawiki.org/wiki/Special:Code/MediaWiki/115435 [19:41:34] robla: sumanah asked me to check in the stuff I have. So I did. [19:42:19] hexmode: sure, thanks for that! I'm just wondering if you're actually running it, or if we need to look elsewhere for causes of bugzilla freaking out and dying [19:42:45] robla: none of that stuff is running. [19:42:56] at least, not on a regular basis [19:43:07] did you run it a couple of hours ago? [19:44:24] "it"? no [19:44:36] I didn't take down bz :) [19:46:06] I was thinking that Sam was working on the extension I asked for, but wasn't aware of a problem till Sumanah told me [19:46:47] k...thanks :) [19:47:42] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [20:14:14] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [20:47:50] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:48:17] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:49:11] (so people know about the bz prob) [20:49:38] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 5.203 seconds [20:53:59] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:56:15] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 3.908 seconds [21:07:03] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:30:30] New patchset: Sara; "Ensure only one of the gmond and ganglia-monitor init scripts is in production." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8980 [22:30:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8980 [22:32:50] the above change is analogous to https://gerrit.wikimedia.org/r/8977 which i just pushed to labs. i don't think it should break anything, but i can also hold off until next week. [22:35:42] PROBLEM - swift-object-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:35:42] PROBLEM - swift-object-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:36:00] PROBLEM - swift-account-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [22:36:27] PROBLEM - swift-container-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:39:35] ^^^^^ that's me. [22:39:36] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [22:41:15] PROBLEM - swift-account-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [22:41:15] PROBLEM - swift-object-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [22:41:15] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:41:33] PROBLEM - swift-container-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:41:33] PROBLEM - swift-account-reaper on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:42:00] PROBLEM - swift-account-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:42:18] PROBLEM - swift-container-updater on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:42:18] PROBLEM - swift-object-updater on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:42:27] yup. still me ^^^^ [22:42:52] (there's no user visible impact; I stopped the swift processes on one backend node for a moment to reformat one of the disks) [22:43:03] RECOVERY - swift-container-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:43:03] RECOVERY - swift-object-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:43:03] RECOVERY - swift-account-reaper on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:43:21] RECOVERY - swift-account-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:43:21] RECOVERY - swift-container-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:43:21] RECOVERY - swift-object-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:43:26] grumble. [22:43:48] RECOVERY - swift-object-updater on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:43:48] RECOVERY - swift-container-updater on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:45:45] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:48] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:47:15] PROBLEM - swift-container-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:47:15] PROBLEM - swift-object-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:47:15] PROBLEM - swift-account-reaper on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:47:24] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [22:47:33] PROBLEM - swift-account-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:47:33] PROBLEM - swift-container-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:48:00] RECOVERY - swift-account-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [22:48:00] PROBLEM - swift-object-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:48:00] PROBLEM - swift-object-updater on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:48:09] PROBLEM - swift-container-updater on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:48:18] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:48:18] RECOVERY - swift-object-auditor on ms-be2 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [22:48:18] RECOVERY - swift-account-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [22:48:36] RECOVERY - swift-container-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [22:48:36] RECOVERY - swift-account-reaper on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [22:48:36] RECOVERY - swift-object-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [22:48:54] RECOVERY - swift-container-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [22:48:54] RECOVERY - swift-account-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [22:49:21] RECOVERY - swift-object-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [22:49:21] RECOVERY - swift-object-updater on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:49:30] RECOVERY - swift-container-updater on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [22:49:39] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.018 seconds [22:50:24] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [22:50:24] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [22:50:48] maplebed: in amongst the nagios spam, bugzilla/kaulen just went down again [22:51:19] chrismcmahon: I didn't know it went down before> [22:51:21] :D [22:51:26] heh [22:51:44] (04:45:45 PM) nagios-wm: PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:51:59] whee!!!! [22:51:59] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=kaulen.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [22:52:12] someone has a runaway process! [22:52:24] 3rd time today [22:52:40] I think it was RobH handled it before? [22:52:43] your only choices are to wait for it to recover or pull the plug. [22:53:03] pulling the plug, of course, brings along with it all the wonderful possibilities of data corruption. [22:53:11] we've power cycled kaulen twice today I think [22:53:22] rough. [22:53:49] bad news late on Memorial Day Friday :( [22:53:52] this looks like the fourth time it's happened in the last 14 hours. [22:54:09] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:54:17] I'll cycle it again, I suppose. [22:54:28] one of these times it won't come back and folk'll be pissed... [22:54:52] yeah, it's a matter of a few hours between cycles. it would suck to have bugzilla down for 3 days [22:56:24] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [22:56:24] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [22:56:24] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [22:57:09] ok, power cycled. [22:57:19] give it a few minutes to boot. [22:57:45] !log powercycled kaulen on the mgmt interface [22:57:49] Logged the message, Master [22:58:30] ping is back [22:59:06] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [23:00:00] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.006 seconds [23:05:06] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:05:39] I am SO ready to ditch Comcast. maplebed if you replied I may have missed it. [23:06:20] chrismcmahon: set up a relay? [23:06:44] anyway, I powercycled kaulen and it should be back now. [23:07:28] maplebed: my worry is that kaulen will die again in 4 or 5 hours [23:07:48] reasonable worry. [23:08:04] put in a cronjob that will kill anything that takes too much memory? [23:08:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:08:25] but I don't know how to identify the runaway process there [23:09:03] according to ganglia there's about a 10minute window to catch it before things go boom [23:09:29] I may be able to set up a nagios alert to fire when memory exceeds 1G or something, [23:09:54] after which there'll be about 3 minutes to kill the process (assuming 5 minutes to catch it and 2 minutes to respond) [23:10:07] maplebed: that'd be awesome. whatever it is, it's fairly new [23:10:15] do you know who (besides rotos) have shell on kaulen? [23:10:26] no idea. pretty sure I don't [23:13:39] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:31:05] New patchset: Bhartshorne; "adding check for memory usage and retabbing the file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8985 [23:31:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8985 [23:32:00] anybody want to review ^^^? [23:32:21] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8985 [23:32:23] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8985 [23:35:23] thanks maplebed [23:39:47] chrismcmahon: I've put in what I think will be an alert at 2.5G used on kaulen. we'll see if it works. [23:41:54] maplebed: good stuff, thanks [23:42:44] * chrismcmahon hopes that will keep bugzilla up for the weekend [23:48:28] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [23:50:34] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours