[00:00:20] (AaronSchulz, thanks) [00:03:00] > var_dump( $redis->sRandMember( 'testset' ) ); [00:03:02] string(1) "e" [00:03:03] Segmentation fault (core dumped) [00:03:05] aaron@aaron-HP-HDX18-Notebook-PC:/var/www/DevWiki/core$ [00:03:06] TimStarling: ;) [00:03:56] it works fine until I switch from using nothing to using php unserialize to unserialize [00:04:02] then it works maybe the first time and segfaults [00:04:36] maybe it's just sRandMember [00:05:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:05:57] TimStarling: maybe we can just deploy 61927 and investigate later [00:06:11] sure [00:06:20] I'm in a meeting now [00:07:52] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:07:51 UTC 2013 [00:08:15] hmm, I should go home [00:08:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:09:02] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:08:57 UTC 2013 [00:09:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:10:02] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:09:59 UTC 2013 [00:10:33] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:02] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:10:57 UTC 2013 [00:11:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:52] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:11:47 UTC 2013 [00:12:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:42] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:12:32 UTC 2013 [00:13:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:13:42] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:13:38 UTC 2013 [00:14:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:15:39] !log @stafford:~# puppetstoredconfigclean.rb db10.pmtpa.wmnet [00:15:46] Logged the message, Master [00:28:46] hey - did anyone get the urgent ticket ? [00:30:46] Leslie, I can do it if there's a HOWTO somewhere. I have all the bits, not the know-how [00:31:14] LeslieCarr: Works for me. [00:31:53] do we have any problems with traffic right now? [00:33:44] hm, I guess we don't. For some reason blog.wm.org's not been responding for me for a while. [00:35:02] PROBLEM - Puppet freshness on cp1031 is CRITICAL: No successful Puppet run in the last 10 hours [00:39:07] New patchset: awjrichards; "Override CentralAuth cookie domains for commons/meta to work with mobile" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61941 [00:47:05] paravoid: ping [01:03:15] New patchset: Aaron Schulz; "Keep the GettingStarting redis objects using no serialization." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61927 [01:05:38] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [01:09:26] is it possible to run tests against production from private IPs that are part of our network? I need a way to simulate different IPs, one per ZERO provider, hitting prod servers [01:10:00] a test would verify that any request coming from an IP 1 gets mapped to provider 1, 2 => 2, etc [01:10:18] and it would verify that all responses are correct for that provider [01:11:45] ideally I wouldn't want to have that many machines, so if one machine can take up 500+ private ips and issue calls from them, that should solve the testing needs [01:15:40] New review: Mattflaschen; "I think we should do this in the extension proper: https://gerrit.wikimedia.org/r/#/c/61943/ . I al..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61927 [01:17:05] !log mwalker synchronized php-1.22wmf2/extensions/CentralNotice/modules/ext.centralNotice.bannerController/bannerController.js 'Poking bits to try and get the new banner controller deployed for CentralNotice' [01:17:13] Logged the message, Master [01:20:01] !log mwalker synchronized php-1.22wmf3/extensions/CentralNotice/modules/ext.centralNotice.bannerController/bannerController.js 'Poking bits to try and get the new banner controller deployed for CentralNotice' [01:20:09] Logged the message, Master [01:21:44] mutante: *waves* [01:31:48] hallo? [01:46:20] Hi. [01:57:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [02:06:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [02:13:52] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [02:18:12] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [02:20:12] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [02:20:41] !log kaldari Started syncing Wikimedia installation... : [02:20:45] !log on wikibugs-l: disabled bounce processing and re-enabled mail delivery to wikibugs-irc (was disabled due to excessive bounces) [02:20:49] Logged the message, Master [02:20:57] Logged the message, Master [02:21:30] !log LocalisationUpdate completed (1.22wmf3) at Thu May 2 02:21:30 UTC 2013 [02:21:38] Logged the message, Master [02:32:10] !log LocalisationUpdate completed (1.22wmf2) at Thu May 2 02:32:10 UTC 2013 [02:32:18] Logged the message, Master [02:43:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 185 seconds [02:45:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [02:45:31] !log kaldari Finished syncing Wikimedia installation... : [02:45:38] Logged the message, Master [02:46:58] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [02:46:58] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [02:48:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [02:50:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [02:50:39] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:50:48] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:50:50] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:50:53] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:51:18] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 85.04 ms [02:51:20] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.37 ms [02:51:29] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 88.08 ms [02:51:38] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms [02:58:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [03:00:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [03:05:58] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [03:08:23] New review: Ori.livneh; "> Paths seem to be working properly now. This still breaks the apache restart, though, so the wiki d..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [03:19:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 224 seconds [03:20:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [03:32:48] TimStarling: I can do the submodule bump and sync [03:33:05] I'm already doing it [03:33:47] Thanks. I didn't want to give it a meaningless +2, and I hadn't yet followed through Aaron's explanation. [03:34:59] Aaron reproduced it, he didn't isolate it [03:35:55] I'm going to try it on test.wikipedia.org first [03:38:26] OK. You might not see entries under all three task types, but that's normal. We're not diligent about making sure all three are prepopulated for testwiki. [03:39:36] !log tstarling synchronized php-1.22wmf3/extensions/GettingStarted [03:39:45] Logged the message, Master [03:39:55] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:40:01] looks OK [03:41:11] nothing suspicious in the logs on fluorine [03:41:20] nothing recent, at least [03:41:58] New patchset: Tim Starling; "Remove IRC link from error message" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/61950 [03:46:32] !log tstarling synchronized php-1.22wmf2/extensions/GettingStarted [03:46:39] Logged the message, Master [03:46:57] looks OK too [03:47:36] why only three pages per task type? [03:47:45] it doesn't seem like enough [03:50:49] I'm not sure. I'm satisfied that Steven et al are studying it carefully and mostly just implement what they tell me to. There's a different interface that we're trying out in https://gerrit.wikimedia.org/r/#/c/59575/ [03:51:14] New review: Andrew Bogott; "There's really nothing to show -- the problem is that Apache didn't pick up the change, so the http:..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [03:52:20] Whenever I express opinion about the interface of anything I end up eating sand for it, so meh. [03:52:51] I don't put too much stock in my intuitions anyway [03:53:43] New patchset: Tim Starling; "Use a password for job queue and session redis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61734 [03:53:48] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61734 [03:54:00] New review: MZMcBride; "Related to bug 16043." [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/61950 [03:55:12] It's E3, get it? [03:56:43] !log tstarling synchronized wmf-config/jobqueue-eqiad.php [03:56:50] Logged the message, Master [03:56:57] https://bugzilla.wikimedia.org/show_bug.cgi?id=20079 [03:58:15] New review: Ori.livneh; "OK. Let's restore the previous behavior for now. I'll update the patch." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [03:59:00] !log tstarling synchronized wmf-config/CommonSettings.php [03:59:07] Logged the message, Master [04:03:14] New patchset: Ori.livneh; "Improvements to mediawiki_singlenode" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [04:03:24] !log tstarling synchronized wmf-config/CommonSettings.php [04:03:31] Logged the message, Master [04:05:08] New review: Ori.livneh; "Note though that this will restart Apache every single run. The optimal way to use Puppet is to defi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [04:06:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:07:56] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:07:51 UTC 2013 [04:08:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:08:57 UTC 2013 [04:09:18] New patchset: Tim Starling; "Respect GettingStarted default options" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61952 [04:09:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:53] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61952 [04:10:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:09:59 UTC 2013 [04:10:19] Change abandoned: Tim Starling; "Superseded" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61927 [04:10:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:11:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:10:56 UTC 2013 [04:11:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:11:56] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:11:46 UTC 2013 [04:12:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:13:03 UTC 2013 [04:13:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:52] New patchset: Tim Starling; "Require a password for Redis" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61740 [04:15:15] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61740 [04:18:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [04:18:56] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [04:20:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [04:34:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [04:35:19] PROBLEM - search indices - check lucene status page on search18 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 55856 bytes in 0.113 second response time [04:35:20] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [04:39:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 2 04:39:48 UTC 2013 [04:39:56] Logged the message, Master [04:44:05] !log on mc1-16 and mc1001-1016, setting requirepass and masterauth to the new password in soft state [04:44:13] Logged the message, Master [04:44:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [04:45:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [04:46:28] !log on rdb1001-1002, set requirepass and masterauth in soft state [04:46:36] Logged the message, Master [04:54:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 183 seconds [04:55:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [05:02:48] New patchset: Tim Starling; "Added a couple of missing passwords" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61956 [05:03:21] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61956 [05:05:09] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [05:07:45] !log tstarling synchronized wmf-config/jobqueue-eqiad.php [05:07:54] Logged the message, Master [05:11:18] !log tstarling synchronized wmf-config/jobqueue-pmtpa.php [05:11:32] Logged the message, Master [05:14:27] !log professor is down, no response on serial console, rebooting [05:14:35] Logged the message, Master [05:19:20] RECOVERY - Host professor is UP: PING OK - Packet loss = 0%, RTA = 27.29 ms [05:19:29] RECOVERY - carbon-cache.py on professor is OK: PROCS OK: 1 process with args carbon-cache.py [05:27:27] !log on professor: manually started carbon-cache.py [05:27:29] PROBLEM - RAID on snapshot1003 is CRITICAL: Timeout while attempting connection [05:27:35] Logged the message, Master [05:27:39] !Log snapshot1003 powercucle, upgrading to precise [05:27:46] Logged the message, Master [05:29:09] PROBLEM - Host snapshot1003 is DOWN: PING CRITICAL - Packet loss = 100% [05:29:29] PROBLEM - carbon-cache.py on professor is CRITICAL: PROCS CRITICAL: 2 processes with args carbon-cache.py [05:33:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [05:34:20] RECOVERY - Host snapshot1003 is UP: PING OK - Packet loss = 0%, RTA = 1.79 ms [05:35:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [05:36:20] PROBLEM - SSH on snapshot1003 is CRITICAL: Connection refused [05:36:29] PROBLEM - Disk space on snapshot1003 is CRITICAL: Connection refused by host [05:36:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:36:49] PROBLEM - DPKG on snapshot1003 is CRITICAL: Connection refused by host [05:37:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [05:37:49] PROBLEM - Host professor is DOWN: PING CRITICAL - Packet loss = 100% [05:43:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [05:44:20] RECOVERY - SSH on snapshot1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [05:44:39] New patchset: Tim Starling; "Short client timeout for graphite event logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61958 [05:45:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [05:45:41] !log powercycle snapshot1004, upgrade to precise [05:45:48] Logged the message, Master [05:46:09] PROBLEM - RAID on snapshot1004 is CRITICAL: Timeout while attempting connection [05:46:39] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61958 [05:48:29] PROBLEM - NTP on snapshot1003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:50:09] PROBLEM - SSH on snapshot1004 is CRITICAL: Connection refused [05:50:09] PROBLEM - Disk space on snapshot1004 is CRITICAL: Connection refused by host [05:50:20] PROBLEM - DPKG on snapshot1004 is CRITICAL: Connection refused by host [06:02:29] PROBLEM - NTP on snapshot1004 is CRITICAL: NTP CRITICAL: No response from NTP server [06:03:09] RECOVERY - SSH on snapshot1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:03:52] !log powercycle snapshot1001, upgrade to precise [06:04:01] Logged the message, Master [06:05:27] PROBLEM - Host snapshot1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:56] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:56] RECOVERY - Host snapshot1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [06:18:06] PROBLEM - DPKG on snapshot1001 is CRITICAL: Connection refused by host [06:18:16] PROBLEM - Disk space on snapshot1001 is CRITICAL: Connection refused by host [06:18:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [06:18:16] PROBLEM - SSH on snapshot1001 is CRITICAL: Connection refused [06:18:26] PROBLEM - RAID on snapshot1001 is CRITICAL: Connection refused by host [06:19:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 18 seconds [06:28:16] RECOVERY - SSH on snapshot1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:30:16] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 06:30:07 UTC 2013 [06:30:56] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [06:30:56] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 06:30:52 UTC 2013 [06:31:56] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [06:32:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 06:32:00 UTC 2013 [06:32:56] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [06:33:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [06:35:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [06:38:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [06:40:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [06:42:44] PROBLEM - NTP on snapshot1001 is CRITICAL: NTP CRITICAL: No response from NTP server [06:44:41] TimStarling: i was pretty diligent about testing those two patches to filter logmsgbot connections, btw, if you feel like merging them [07:02:01] New patchset: ArielGlenn; "on precise use mysql clent 5.5 for snapshot hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61959 [07:03:33] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61959 [07:06:07] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:26] RECOVERY - NTP on snapshot1003 is OK: NTP OK: Offset -0.0105394125 secs [07:19:16] RECOVERY - Disk space on snapshot1003 is OK: DISK OK [07:19:36] RECOVERY - RAID on snapshot1003 is OK: OK: no RAID installed [07:19:46] RECOVERY - DPKG on snapshot1003 is OK: All packages OK [07:33:13] New review: Hashar; "Can't you make cidr to be an array ? Maybe keeping the string form whenever one only want to pass on..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/61920 [07:34:22] hello [07:38:46] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:46] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:46] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:46] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:46:09] New review: Hashar; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61425 [07:46:19] New patchset: Hashar; "multiversion: ability to destroy singleton" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61425 [07:52:03] New review: Hashar; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61428 [07:55:06] RECOVERY - NTP on snapshot1004 is OK: NTP OK: Offset -0.01800429821 secs [08:05:08] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [08:05:24] morning hashar [08:06:59] i spent the last day fighting with the most infuriating rubygems problem and i think i finally figured it out [08:07:48] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 08:07:42 UTC 2013 [08:07:50] ori-l: rewrote the script to python ? [08:07:54] err [08:07:58] ported the script to python? [08:08:08] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [08:08:13] i wish [08:08:28] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 08:08:22 UTC 2013 [08:08:38] 'bundle install' for the qa/browsertests repo would work whenever i ran it but not through puppet [08:09:06] and i kept digging in the wrong direction -- is it the fact that i'm not root? that i'm running in a login shell? some environment variable? a dotfile? [08:09:08] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [08:09:59] turns out 'bundle install' just eats up a lot of memory and so does puppet [08:10:26] and the compilation of native extension is just aborted when it runs out of memory without an error message indicating what happened [08:10:35] ah that is very useful :D [08:10:52] ideally the gems should be packaged [08:11:16] yes, some but not all are available in apt [08:11:37] I had the same issue with the Zuul gateway, I had to package a couple python modules [08:12:19] in general i don't like the ruby attitude to packaging which is a little, oh, "après moi, le déluge" [08:12:38] I love the quote [08:12:38] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 08:12:29 UTC 2013 [08:12:51] zeljkof might be of rescue [08:12:56] he knows ruby [08:13:08] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:09] also maybe one of the gem is wrong and has a huge mem leak [08:13:17] it's ffi, specifically [08:13:20] i think i have a workaround [08:13:23] hashar: what is the prooblem? [08:13:33] 'bundle install' is a monster, but 'gem install ffi' followed by 'bundle install' works [08:13:38] zeljkof: ori-l having some problems installing the gems for qa/browsertests [08:13:52] ori-l: what is the problem? [08:13:53] zeljkof: gems bundle install dies with an out of memory [08:14:01] zeljkof: see scrollback [08:14:29] i think i can work around it, but hashar's point about packaging is important [08:15:18] i looked at the gemfile and i suspect some of the version choices (esp. when they're greater than what has been packaged for debian) were not principled but just based on whatever was newest at the time [08:16:28] i've been reading advice from ruby people online tonight and it's distressingly pretty consistent: just ditch debian packages entirely and move to rvm + gemsets [08:16:38] ori-l: yes, we usually use the latest versions of everything [08:17:14] ori-l: I am not sure what is the best way to go [08:17:19] i don't know that this is a good choice for ruby, but if that's the choice the community made, then it would help to have a really good puppet manifest for setting up rvm in some controlled way (i.e., not writing stuff all over the filesystem but more or less sandboxed somewhere) [08:17:55] ori-l: rvm is also not not only choice [08:17:58] a lot of the rvm installation guides are: "just run `curl some.domain.com/install-rvm | sh`" [08:18:14] there is at least one more major player there, maybe it behaves better [08:18:26] ported the script to python? [08:18:27] http://rbenv.org/ [08:18:32] oh [08:18:44] :) that last part was just me trolling :P [08:18:58] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [08:20:19] hrm, rbenv looks interesting [08:20:49] ori-l: I am pretty sure there are others, but as far as I know, rvm and rbenv are the two big players [08:20:58] they specifically talk about compatibility with configuration management software (chef specifically) as a selling point [08:23:32] ori-l: I am open to change :) [08:23:47] zeljkof: which one do you use personally? [08:23:55] ori-l: rvm [08:24:04] but that is for historical reasons [08:24:23] rbenv was not there when I was picking a tool [08:24:35] I think rvm was the only choice [09:06:25] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [09:08:41] zeljkof: http://dpaste.de/B8Uw0/raw/ [09:08:48] i can't tell you how many times i've seen that trace today [09:09:00] retiring for the night, will try again tomorrow [09:09:13] ori-l: good night :) [09:09:24] good night [09:09:37] "Failed to build gem native extension" usually means dev tools are not installed [09:12:45] zeljkof: http://dpaste.de/niYDi/raw/ [09:12:50] everything installed [09:13:20] strange [09:13:31] you still think it is a memory problem? [09:13:55] well, if i start a login shell, chdir to the directory, and run 'bundle install', it works [09:14:09] if i tell puppet to do the exact same thing, it fails with that eror [09:14:21] strange [09:14:39] but fortunately, if you want to help, you can reproduce this rather easily :) [09:14:52] I will try to reproduce it today [09:14:54] just pull the patch into your vagrant dir and 'vagrant up' and off you go [09:15:16] awesome, let me know if you discover something [09:15:27] bye for now [09:15:37] good night [09:24:00] New patchset: ArielGlenn; "try to fix issue on fresh installs where nrpe starts with weird uid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61963 [09:25:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61963 [09:29:26] New patchset: Hashar; "udp2log: let daemon recreate files after logrotate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61964 [09:30:15] RECOVERY - Disk space on snapshot1004 is OK: DISK OK [09:30:15] RECOVERY - DPKG on snapshot1004 is OK: All packages OK [09:30:48] New review: Hashar; "I have added as reviewers Tim, Ori and Andrew Otto who have some knowledge about udp2log daemon :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61964 [09:30:55] RECOVERY - RAID on snapshot1004 is OK: OK: no RAID installed [09:39:41] !log maxsem synchronized php-1.22wmf3/extensions/GeoData/GeoData.body.php 'https://gerrit.wikimedia.org/r/#/c/61962/' [09:39:49] Logged the message, Master [09:43:49] New patchset: ArielGlenn; "include the new nrpe::user class in nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61967 [09:45:17] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61967 [09:50:08] snapshot1001: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! [09:50:11] !log maxsem synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'https://gerrit.wikimedia.org/r/#/c/61926/' [09:50:18] Logged the message, Master [09:51:00] MaxSem: you can update the known host from fenari [09:51:39] MaxSem: scp fenari.wikimedia.org:/etc/ssh/ssh_known_hosts ~/.ssh/known_hosts-wmf [09:51:40] Then in your .sshconfig: UserKnownHostsFile ~/.ssh/known_hosts-wmf [09:52:00] the user UserKnowHostFile should be applied to your Host *.wmnet and Host *.wikimedia.org entries [09:52:12] if you make that scp a shell function, you can update it manually from time to time [09:52:21] as long as you trust fenari fingerprint, you will be fine [09:52:35] note that the known_hosts-wmf is generated by puppet [09:52:58] !log maxsem synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'https://gerrit.wikimedia.org/r/#/c/61926/' [09:53:05] Logged the message, Master [09:54:22] !log maxsem synchronized php-1.22wmf2/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'https://gerrit.wikimedia.org/r/#/c/61926/' [09:54:29] Logged the message, Master [10:21:16] RECOVERY - NTP on snapshot1001 is OK: NTP OK: Offset -0.01322698593 secs [10:28:36] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [10:28:46] RECOVERY - RAID on snapshot1001 is OK: OK: no RAID installed [10:29:17] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [10:31:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [10:35:06] PROBLEM - Puppet freshness on cp1031 is CRITICAL: No successful Puppet run in the last 10 hours [11:03:34] New patchset: ArielGlenn; "second try at getting icinga user its nagios group, thanks hashar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61970 [11:06:34] New patchset: ArielGlenn; "second try at getting icinga user its nagios group, thanks hashar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61970 [11:06:50] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [11:07:55] this wasn't fixed yet? [11:08:01] no [11:08:26] I still don't get why we need an icinga user in the first place [11:08:32] the nrpe package just uses a nagios user [11:08:37] but this will do for now I guess [11:08:52] since you're here do you want to look at this before it goes out? [11:11:26] nah, just push it [11:11:35] what could possibly go wrong :) [11:11:37] hahaha just when I added hashar as a reviewer [11:11:53] well I could break puppet on all hosts. already did that today on an earlier version of his change :-P [11:12:11] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61970 [11:18:00] PROBLEM - DPKG on snapshot1001 is CRITICAL: Connection refused by host [11:18:10] PROBLEM - RAID on snapshot1001 is CRITICAL: Connection refused by host [11:18:20] PROBLEM - Disk space on snapshot1001 is CRITICAL: Connection refused by host [11:18:57] New patchset: ArielGlenn; "icinga extra groups shouldn't have primary group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61971 [11:19:07] that's me on snapshot1001 for testing [11:19:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61971 [11:27:20] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [11:28:00] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [11:32:18] New patchset: ArielGlenn; "just one require in icinga user stanza" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61972 [11:33:05] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61972 [11:34:10] PROBLEM - RAID on snapshot1001 is CRITICAL: Connection refused by host [11:35:06] still me [11:36:20] PROBLEM - Disk space on snapshot1001 is CRITICAL: Connection refused by host [11:37:00] PROBLEM - DPKG on snapshot1001 is CRITICAL: Connection refused by host [11:37:12] New patchset: ArielGlenn; "don't require a group the user type will create for you" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61973 [11:37:53] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61973 [11:40:08] New patchset: ArielGlenn; "someone else will get to figure out where the 'dialout' group comes from" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61974 [11:40:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61974 [11:41:30] RECOVERY - DPKG on mw98 is OK: All packages OK [11:43:00] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [11:43:20] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [11:43:39] New patchset: Aude; "(bug 47610) Update Wikidata test settings to use $wgWBClientSettings and $wgWBRepoSettings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61975 [11:43:59] apergos: when did you push new rings? [11:44:03] it's already halfway there [11:44:10] a couple days ago [11:44:19] that can't be [11:44:21] ms-be2? [11:45:15] ~9am yesterday? [11:46:43] New patchset: ArielGlenn; "must... define.. icinga group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61976 [11:46:48] whatever day it was [11:46:53] yestreday? day before? [11:47:03] 2013-05-01T09:18:00+00:00 [11:47:09] no earlier than a couple days anyways [11:47:25] yeah may1, tha's right [11:47:48] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61976 [11:48:14] wonder if I got something wrong then [11:48:22]