[00:00:20] (AaronSchulz, thanks) [00:03:00] > var_dump( $redis->sRandMember( 'testset' ) ); [00:03:02] string(1) "e" [00:03:03] Segmentation fault (core dumped) [00:03:05] aaron@aaron-HP-HDX18-Notebook-PC:/var/www/DevWiki/core$ [00:03:06] TimStarling: ;) [00:03:56] it works fine until I switch from using nothing to using php unserialize to unserialize [00:04:02] then it works maybe the first time and segfaults [00:04:36] maybe it's just sRandMember [00:05:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:05:57] TimStarling: maybe we can just deploy 61927 and investigate later [00:06:11] sure [00:06:20] I'm in a meeting now [00:07:52] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:07:51 UTC 2013 [00:08:15] hmm, I should go home [00:08:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:09:02] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:08:57 UTC 2013 [00:09:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:10:02] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:09:59 UTC 2013 [00:10:33] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:02] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:10:57 UTC 2013 [00:11:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:52] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:11:47 UTC 2013 [00:12:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:42] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:12:32 UTC 2013 [00:13:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:13:42] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 00:13:38 UTC 2013 [00:14:32] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [00:15:39] !log @stafford:~# puppetstoredconfigclean.rb db10.pmtpa.wmnet [00:15:46] Logged the message, Master [00:28:46] hey - did anyone get the urgent ticket ? [00:30:46] Leslie, I can do it if there's a HOWTO somewhere. I have all the bits, not the know-how [00:31:14] LeslieCarr: Works for me. [00:31:53] do we have any problems with traffic right now? [00:33:44] hm, I guess we don't. For some reason blog.wm.org's not been responding for me for a while. [00:35:02] PROBLEM - Puppet freshness on cp1031 is CRITICAL: No successful Puppet run in the last 10 hours [00:39:07] New patchset: awjrichards; "Override CentralAuth cookie domains for commons/meta to work with mobile" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61941 [00:47:05] paravoid: ping [01:03:15] New patchset: Aaron Schulz; "Keep the GettingStarting redis objects using no serialization." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61927 [01:05:38] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [01:09:26] is it possible to run tests against production from private IPs that are part of our network? I need a way to simulate different IPs, one per ZERO provider, hitting prod servers [01:10:00] a test would verify that any request coming from an IP 1 gets mapped to provider 1, 2 => 2, etc [01:10:18] and it would verify that all responses are correct for that provider [01:11:45] ideally I wouldn't want to have that many machines, so if one machine can take up 500+ private ips and issue calls from them, that should solve the testing needs [01:15:40] New review: Mattflaschen; "I think we should do this in the extension proper: https://gerrit.wikimedia.org/r/#/c/61943/ . I al..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61927 [01:17:05] !log mwalker synchronized php-1.22wmf2/extensions/CentralNotice/modules/ext.centralNotice.bannerController/bannerController.js 'Poking bits to try and get the new banner controller deployed for CentralNotice' [01:17:13] Logged the message, Master [01:20:01] !log mwalker synchronized php-1.22wmf3/extensions/CentralNotice/modules/ext.centralNotice.bannerController/bannerController.js 'Poking bits to try and get the new banner controller deployed for CentralNotice' [01:20:09] Logged the message, Master [01:21:44] mutante: *waves* [01:31:48] hallo? [01:46:20] Hi. [01:57:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [02:06:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [02:13:52] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [02:18:12] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [02:20:12] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [02:20:41] !log kaldari Started syncing Wikimedia installation... : [02:20:45] !log on wikibugs-l: disabled bounce processing and re-enabled mail delivery to wikibugs-irc (was disabled due to excessive bounces) [02:20:49] Logged the message, Master [02:20:57] Logged the message, Master [02:21:30] !log LocalisationUpdate completed (1.22wmf3) at Thu May 2 02:21:30 UTC 2013 [02:21:38] Logged the message, Master [02:32:10] !log LocalisationUpdate completed (1.22wmf2) at Thu May 2 02:32:10 UTC 2013 [02:32:18] Logged the message, Master [02:43:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 185 seconds [02:45:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [02:45:31] !log kaldari Finished syncing Wikimedia installation... : [02:45:38] Logged the message, Master [02:46:58] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [02:46:58] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [02:48:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [02:50:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [02:50:39] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:50:48] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:50:50] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:50:53] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [02:51:18] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 85.04 ms [02:51:20] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.37 ms [02:51:29] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 88.08 ms [02:51:38] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms [02:58:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [03:00:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [03:05:58] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [03:08:23] New review: Ori.livneh; "> Paths seem to be working properly now. This still breaks the apache restart, though, so the wiki d..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [03:19:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 224 seconds [03:20:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [03:32:48] TimStarling: I can do the submodule bump and sync [03:33:05] I'm already doing it [03:33:47] Thanks. I didn't want to give it a meaningless +2, and I hadn't yet followed through Aaron's explanation. [03:34:59] Aaron reproduced it, he didn't isolate it [03:35:55] I'm going to try it on test.wikipedia.org first [03:38:26] OK. You might not see entries under all three task types, but that's normal. We're not diligent about making sure all three are prepopulated for testwiki. [03:39:36] !log tstarling synchronized php-1.22wmf3/extensions/GettingStarted [03:39:45] Logged the message, Master [03:39:55] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:40:01] looks OK [03:41:11] nothing suspicious in the logs on fluorine [03:41:20] nothing recent, at least [03:41:58] New patchset: Tim Starling; "Remove IRC link from error message" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/61950 [03:46:32] !log tstarling synchronized php-1.22wmf2/extensions/GettingStarted [03:46:39] Logged the message, Master [03:46:57] looks OK too [03:47:36] why only three pages per task type? [03:47:45] it doesn't seem like enough [03:50:49] I'm not sure. I'm satisfied that Steven et al are studying it carefully and mostly just implement what they tell me to. There's a different interface that we're trying out in https://gerrit.wikimedia.org/r/#/c/59575/ [03:51:14] New review: Andrew Bogott; "There's really nothing to show -- the problem is that Apache didn't pick up the change, so the http:..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [03:52:20] Whenever I express opinion about the interface of anything I end up eating sand for it, so meh. [03:52:51] I don't put too much stock in my intuitions anyway [03:53:43] New patchset: Tim Starling; "Use a password for job queue and session redis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61734 [03:53:48] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61734 [03:54:00] New review: MZMcBride; "Related to bug 16043." [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/61950 [03:55:12] It's E3, get it? [03:56:43] !log tstarling synchronized wmf-config/jobqueue-eqiad.php [03:56:50] Logged the message, Master [03:56:57] https://bugzilla.wikimedia.org/show_bug.cgi?id=20079 [03:58:15] New review: Ori.livneh; "OK. Let's restore the previous behavior for now. I'll update the patch." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [03:59:00] !log tstarling synchronized wmf-config/CommonSettings.php [03:59:07] Logged the message, Master [04:03:14] New patchset: Ori.livneh; "Improvements to mediawiki_singlenode" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [04:03:24] !log tstarling synchronized wmf-config/CommonSettings.php [04:03:31] Logged the message, Master [04:05:08] New review: Ori.livneh; "Note though that this will restart Apache every single run. The optimal way to use Puppet is to defi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61816 [04:06:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:07:56] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:07:51 UTC 2013 [04:08:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:08:57 UTC 2013 [04:09:18] New patchset: Tim Starling; "Respect GettingStarted default options" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61952 [04:09:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:53] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61952 [04:10:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:09:59 UTC 2013 [04:10:19] Change abandoned: Tim Starling; "Superseded" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61927 [04:10:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:11:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:10:56 UTC 2013 [04:11:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:11:56] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:11:46 UTC 2013 [04:12:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 04:13:03 UTC 2013 [04:13:46] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:52] New patchset: Tim Starling; "Require a password for Redis" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61740 [04:15:15] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61740 [04:18:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [04:18:56] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [04:20:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [04:34:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [04:35:19] PROBLEM - search indices - check lucene status page on search18 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 55856 bytes in 0.113 second response time [04:35:20] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [04:39:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 2 04:39:48 UTC 2013 [04:39:56] Logged the message, Master [04:44:05] !log on mc1-16 and mc1001-1016, setting requirepass and masterauth to the new password in soft state [04:44:13] Logged the message, Master [04:44:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [04:45:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [04:46:28] !log on rdb1001-1002, set requirepass and masterauth in soft state [04:46:36] Logged the message, Master [04:54:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 183 seconds [04:55:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [05:02:48] New patchset: Tim Starling; "Added a couple of missing passwords" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61956 [05:03:21] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61956 [05:05:09] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [05:07:45] !log tstarling synchronized wmf-config/jobqueue-eqiad.php [05:07:54] Logged the message, Master [05:11:18] !log tstarling synchronized wmf-config/jobqueue-pmtpa.php [05:11:32] Logged the message, Master [05:14:27] !log professor is down, no response on serial console, rebooting [05:14:35] Logged the message, Master [05:19:20] RECOVERY - Host professor is UP: PING OK - Packet loss = 0%, RTA = 27.29 ms [05:19:29] RECOVERY - carbon-cache.py on professor is OK: PROCS OK: 1 process with args carbon-cache.py [05:27:27] !log on professor: manually started carbon-cache.py [05:27:29] PROBLEM - RAID on snapshot1003 is CRITICAL: Timeout while attempting connection [05:27:35] Logged the message, Master [05:27:39] !Log snapshot1003 powercucle, upgrading to precise [05:27:46] Logged the message, Master [05:29:09] PROBLEM - Host snapshot1003 is DOWN: PING CRITICAL - Packet loss = 100% [05:29:29] PROBLEM - carbon-cache.py on professor is CRITICAL: PROCS CRITICAL: 2 processes with args carbon-cache.py [05:33:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [05:34:20] RECOVERY - Host snapshot1003 is UP: PING OK - Packet loss = 0%, RTA = 1.79 ms [05:35:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [05:36:20] PROBLEM - SSH on snapshot1003 is CRITICAL: Connection refused [05:36:29] PROBLEM - Disk space on snapshot1003 is CRITICAL: Connection refused by host [05:36:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:36:49] PROBLEM - DPKG on snapshot1003 is CRITICAL: Connection refused by host [05:37:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [05:37:49] PROBLEM - Host professor is DOWN: PING CRITICAL - Packet loss = 100% [05:43:19] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [05:44:20] RECOVERY - SSH on snapshot1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [05:44:39] New patchset: Tim Starling; "Short client timeout for graphite event logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61958 [05:45:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [05:45:41] !log powercycle snapshot1004, upgrade to precise [05:45:48] Logged the message, Master [05:46:09] PROBLEM - RAID on snapshot1004 is CRITICAL: Timeout while attempting connection [05:46:39] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61958 [05:48:29] PROBLEM - NTP on snapshot1003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:50:09] PROBLEM - SSH on snapshot1004 is CRITICAL: Connection refused [05:50:09] PROBLEM - Disk space on snapshot1004 is CRITICAL: Connection refused by host [05:50:20] PROBLEM - DPKG on snapshot1004 is CRITICAL: Connection refused by host [06:02:29] PROBLEM - NTP on snapshot1004 is CRITICAL: NTP CRITICAL: No response from NTP server [06:03:09] RECOVERY - SSH on snapshot1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:03:52] !log powercycle snapshot1001, upgrade to precise [06:04:01] Logged the message, Master [06:05:27] PROBLEM - Host snapshot1001 is DOWN: PING CRITICAL - Packet loss = 100% [06:05:56] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:56] RECOVERY - Host snapshot1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [06:18:06] PROBLEM - DPKG on snapshot1001 is CRITICAL: Connection refused by host [06:18:16] PROBLEM - Disk space on snapshot1001 is CRITICAL: Connection refused by host [06:18:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [06:18:16] PROBLEM - SSH on snapshot1001 is CRITICAL: Connection refused [06:18:26] PROBLEM - RAID on snapshot1001 is CRITICAL: Connection refused by host [06:19:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 18 seconds [06:28:16] RECOVERY - SSH on snapshot1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:30:16] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 06:30:07 UTC 2013 [06:30:56] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [06:30:56] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 06:30:52 UTC 2013 [06:31:56] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [06:32:06] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 06:32:00 UTC 2013 [06:32:56] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [06:33:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 186 seconds [06:35:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [06:38:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [06:40:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [06:42:44] PROBLEM - NTP on snapshot1001 is CRITICAL: NTP CRITICAL: No response from NTP server [06:44:41] TimStarling: i was pretty diligent about testing those two patches to filter logmsgbot connections, btw, if you feel like merging them [07:02:01] New patchset: ArielGlenn; "on precise use mysql clent 5.5 for snapshot hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61959 [07:03:33] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61959 [07:06:07] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:26] RECOVERY - NTP on snapshot1003 is OK: NTP OK: Offset -0.0105394125 secs [07:19:16] RECOVERY - Disk space on snapshot1003 is OK: DISK OK [07:19:36] RECOVERY - RAID on snapshot1003 is OK: OK: no RAID installed [07:19:46] RECOVERY - DPKG on snapshot1003 is OK: All packages OK [07:33:13] New review: Hashar; "Can't you make cidr to be an array ? Maybe keeping the string form whenever one only want to pass on..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/61920 [07:34:22] hello [07:38:46] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:46] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:46] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [07:38:46] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:46:09] New review: Hashar; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61425 [07:46:19] New patchset: Hashar; "multiversion: ability to destroy singleton" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61425 [07:52:03] New review: Hashar; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61428 [07:55:06] RECOVERY - NTP on snapshot1004 is OK: NTP OK: Offset -0.01800429821 secs [08:05:08] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [08:05:24] morning hashar [08:06:59] i spent the last day fighting with the most infuriating rubygems problem and i think i finally figured it out [08:07:48] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 08:07:42 UTC 2013 [08:07:50] ori-l: rewrote the script to python ? [08:07:54] err [08:07:58] ported the script to python? [08:08:08] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [08:08:13] i wish [08:08:28] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 08:08:22 UTC 2013 [08:08:38] 'bundle install' for the qa/browsertests repo would work whenever i ran it but not through puppet [08:09:06] and i kept digging in the wrong direction -- is it the fact that i'm not root? that i'm running in a login shell? some environment variable? a dotfile? [08:09:08] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [08:09:59] turns out 'bundle install' just eats up a lot of memory and so does puppet [08:10:26] and the compilation of native extension is just aborted when it runs out of memory without an error message indicating what happened [08:10:35] ah that is very useful :D [08:10:52] ideally the gems should be packaged [08:11:16] yes, some but not all are available in apt [08:11:37] I had the same issue with the Zuul gateway, I had to package a couple python modules [08:12:19] in general i don't like the ruby attitude to packaging which is a little, oh, "après moi, le déluge" [08:12:38] I love the quote [08:12:38] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 08:12:29 UTC 2013 [08:12:51] zeljkof might be of rescue [08:12:56] he knows ruby [08:13:08] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [08:13:09] also maybe one of the gem is wrong and has a huge mem leak [08:13:17] it's ffi, specifically [08:13:20] i think i have a workaround [08:13:23] hashar: what is the prooblem? [08:13:33] 'bundle install' is a monster, but 'gem install ffi' followed by 'bundle install' works [08:13:38] zeljkof: ori-l having some problems installing the gems for qa/browsertests [08:13:52] ori-l: what is the problem? [08:13:53] zeljkof: gems bundle install dies with an out of memory [08:14:01] zeljkof: see scrollback [08:14:29] i think i can work around it, but hashar's point about packaging is important [08:15:18] i looked at the gemfile and i suspect some of the version choices (esp. when they're greater than what has been packaged for debian) were not principled but just based on whatever was newest at the time [08:16:28] i've been reading advice from ruby people online tonight and it's distressingly pretty consistent: just ditch debian packages entirely and move to rvm + gemsets [08:16:38] ori-l: yes, we usually use the latest versions of everything [08:17:14] ori-l: I am not sure what is the best way to go [08:17:19] i don't know that this is a good choice for ruby, but if that's the choice the community made, then it would help to have a really good puppet manifest for setting up rvm in some controlled way (i.e., not writing stuff all over the filesystem but more or less sandboxed somewhere) [08:17:55] ori-l: rvm is also not not only choice [08:17:58] a lot of the rvm installation guides are: "just run `curl some.domain.com/install-rvm | sh`" [08:18:14] there is at least one more major player there, maybe it behaves better [08:18:26] ported the script to python? [08:18:27] http://rbenv.org/ [08:18:32] oh [08:18:44] :) that last part was just me trolling :P [08:18:58] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [08:20:19] hrm, rbenv looks interesting [08:20:49] ori-l: I am pretty sure there are others, but as far as I know, rvm and rbenv are the two big players [08:20:58] they specifically talk about compatibility with configuration management software (chef specifically) as a selling point [08:23:32] ori-l: I am open to change :) [08:23:47] zeljkof: which one do you use personally? [08:23:55] ori-l: rvm [08:24:04] but that is for historical reasons [08:24:23] rbenv was not there when I was picking a tool [08:24:35] I think rvm was the only choice [09:06:25] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [09:08:41] zeljkof: http://dpaste.de/B8Uw0/raw/ [09:08:48] i can't tell you how many times i've seen that trace today [09:09:00] retiring for the night, will try again tomorrow [09:09:13] ori-l: good night :) [09:09:24] good night [09:09:37] "Failed to build gem native extension" usually means dev tools are not installed [09:12:45] zeljkof: http://dpaste.de/niYDi/raw/ [09:12:50] everything installed [09:13:20] strange [09:13:31] you still think it is a memory problem? [09:13:55] well, if i start a login shell, chdir to the directory, and run 'bundle install', it works [09:14:09] if i tell puppet to do the exact same thing, it fails with that eror [09:14:21] strange [09:14:39] but fortunately, if you want to help, you can reproduce this rather easily :) [09:14:52] I will try to reproduce it today [09:14:54] just pull the patch into your vagrant dir and 'vagrant up' and off you go [09:15:16] awesome, let me know if you discover something [09:15:27] bye for now [09:15:37] good night [09:24:00] New patchset: ArielGlenn; "try to fix issue on fresh installs where nrpe starts with weird uid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61963 [09:25:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61963 [09:29:26] New patchset: Hashar; "udp2log: let daemon recreate files after logrotate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61964 [09:30:15] RECOVERY - Disk space on snapshot1004 is OK: DISK OK [09:30:15] RECOVERY - DPKG on snapshot1004 is OK: All packages OK [09:30:48] New review: Hashar; "I have added as reviewers Tim, Ori and Andrew Otto who have some knowledge about udp2log daemon :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61964 [09:30:55] RECOVERY - RAID on snapshot1004 is OK: OK: no RAID installed [09:39:41] !log maxsem synchronized php-1.22wmf3/extensions/GeoData/GeoData.body.php 'https://gerrit.wikimedia.org/r/#/c/61962/' [09:39:49] Logged the message, Master [09:43:49] New patchset: ArielGlenn; "include the new nrpe::user class in nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61967 [09:45:17] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61967 [09:50:08] snapshot1001: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! [09:50:11] !log maxsem synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'https://gerrit.wikimedia.org/r/#/c/61926/' [09:50:18] Logged the message, Master [09:51:00] MaxSem: you can update the known host from fenari [09:51:39] MaxSem: scp fenari.wikimedia.org:/etc/ssh/ssh_known_hosts ~/.ssh/known_hosts-wmf [09:51:40] Then in your .sshconfig: UserKnownHostsFile ~/.ssh/known_hosts-wmf [09:52:00] the user UserKnowHostFile should be applied to your Host *.wmnet and Host *.wikimedia.org entries [09:52:12] if you make that scp a shell function, you can update it manually from time to time [09:52:21] as long as you trust fenari fingerprint, you will be fine [09:52:35] note that the known_hosts-wmf is generated by puppet [09:52:58] !log maxsem synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'https://gerrit.wikimedia.org/r/#/c/61926/' [09:53:05] Logged the message, Master [09:54:22] !log maxsem synchronized php-1.22wmf2/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'https://gerrit.wikimedia.org/r/#/c/61926/' [09:54:29] Logged the message, Master [10:21:16] RECOVERY - NTP on snapshot1001 is OK: NTP OK: Offset -0.01322698593 secs [10:28:36] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [10:28:46] RECOVERY - RAID on snapshot1001 is OK: OK: no RAID installed [10:29:17] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [10:31:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [10:35:06] PROBLEM - Puppet freshness on cp1031 is CRITICAL: No successful Puppet run in the last 10 hours [11:03:34] New patchset: ArielGlenn; "second try at getting icinga user its nagios group, thanks hashar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61970 [11:06:34] New patchset: ArielGlenn; "second try at getting icinga user its nagios group, thanks hashar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61970 [11:06:50] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [11:07:55] this wasn't fixed yet? [11:08:01] no [11:08:26] I still don't get why we need an icinga user in the first place [11:08:32] the nrpe package just uses a nagios user [11:08:37] but this will do for now I guess [11:08:52] since you're here do you want to look at this before it goes out? [11:11:26] nah, just push it [11:11:35] what could possibly go wrong :) [11:11:37] hahaha just when I added hashar as a reviewer [11:11:53] well I could break puppet on all hosts. already did that today on an earlier version of his change :-P [11:12:11] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61970 [11:18:00] PROBLEM - DPKG on snapshot1001 is CRITICAL: Connection refused by host [11:18:10] PROBLEM - RAID on snapshot1001 is CRITICAL: Connection refused by host [11:18:20] PROBLEM - Disk space on snapshot1001 is CRITICAL: Connection refused by host [11:18:57] New patchset: ArielGlenn; "icinga extra groups shouldn't have primary group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61971 [11:19:07] that's me on snapshot1001 for testing [11:19:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61971 [11:27:20] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [11:28:00] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [11:32:18] New patchset: ArielGlenn; "just one require in icinga user stanza" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61972 [11:33:05] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61972 [11:34:10] PROBLEM - RAID on snapshot1001 is CRITICAL: Connection refused by host [11:35:06] still me [11:36:20] PROBLEM - Disk space on snapshot1001 is CRITICAL: Connection refused by host [11:37:00] PROBLEM - DPKG on snapshot1001 is CRITICAL: Connection refused by host [11:37:12] New patchset: ArielGlenn; "don't require a group the user type will create for you" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61973 [11:37:53] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61973 [11:40:08] New patchset: ArielGlenn; "someone else will get to figure out where the 'dialout' group comes from" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61974 [11:40:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61974 [11:41:30] RECOVERY - DPKG on mw98 is OK: All packages OK [11:43:00] RECOVERY - DPKG on snapshot1001 is OK: All packages OK [11:43:20] RECOVERY - Disk space on snapshot1001 is OK: DISK OK [11:43:39] New patchset: Aude; "(bug 47610) Update Wikidata test settings to use $wgWBClientSettings and $wgWBRepoSettings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61975 [11:43:59] apergos: when did you push new rings? [11:44:03] it's already halfway there [11:44:10] a couple days ago [11:44:19] that can't be [11:44:21] ms-be2? [11:45:15] ~9am yesterday? [11:46:43] New patchset: ArielGlenn; "must... define.. icinga group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61976 [11:46:48] whatever day it was [11:46:53] yestreday? day before? [11:47:03] 2013-05-01T09:18:00+00:00 [11:47:09] no earlier than a couple days anyways [11:47:25] yeah may1, tha's right [11:47:48] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61976 [11:48:14] wonder if I got something wrong then [11:48:22] why? [11:48:26] I triple and quadruple checked it :-( [11:48:34] because every other run has taken much longer than 4 days [11:48:40] and we're doing more in this run [11:48:45] 20% of partitions reshuffled [11:49:03] you're not accounting for h310 vs. h710 [11:49:19] I told you should have bumped it more :-) [11:49:29] I am; there are h310s still in the pile that are either moving or getting data from those reshuffled partitions [11:49:32] I'll try to login on Sat or so to bump it some more [11:49:38] when it's done [11:49:51] if it's not done by tomorrow that is [11:49:59] also remove ms-be11's sdh [11:50:23] I'll be on the road part of tomorrow but I can check on it in the evening [11:50:54] no worries [11:51:02] I'll be home :) [11:51:18] not going out of town? [11:52:11] nah [11:52:55] friends are visitn from france but the rendevous point is three hours from here so... [11:53:10] PROBLEM - RAID on snapshot1001 is CRITICAL: Connection refused by host [11:53:35] fricking finallllly [11:53:42] that was a ginrmous timesink [11:53:54] root@snapshot1001:~# ps axuww | grep nrpe [11:53:54] icinga 13391 0.0 0.0 25476 1164 ? Ss 11:53 0:00 /usr/sbin/nrpe -c /etc/icinga/nrpe.cfg -d [11:54:10] RECOVERY - RAID on snapshot1001 is OK: OK: no RAID installed [11:54:22] three hours? [11:54:25] more [11:54:26] volos? [11:54:29] oh [11:54:40] heh I thought you meant about puppet which was also that long :-P [11:54:43] no, think south [11:55:13] I had an invite for larisa but put it off [11:57:10] break now, forgot to eat yesterday til 11:30 pm, today I must do better [11:58:00] RECOVERY - DPKG on mw2 is OK: All packages OK [12:07:52] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 12:07:48 UTC 2013 [12:08:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:02] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 12:08:55 UTC 2013 [12:09:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:52] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 12:09:49 UTC 2013 [12:10:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [12:10:52] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 12:10:44 UTC 2013 [12:11:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:32] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 12:11:27 UTC 2013 [12:12:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [12:12:42] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 12:12:32 UTC 2013 [12:13:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [12:14:02] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [12:23:58] New review: Hydriz; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61428 [12:27:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:28:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [12:32:50] New patchset: Mark Bergsma; "Migrate amslvs3/4 PyBal BGP peerings from csw2-esams to cr2-knams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61977 [12:42:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61977 [12:45:45] re [12:47:50] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [12:47:50] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [12:48:33] paravoid: you bad, bad guy. [12:48:40] huh? [12:48:58] didn't you get my PM yesterday? [12:49:04] I was off yesterday :) [12:49:07] what's up? [12:52:00] apergos: the icinga stuff works for me in labs \O/ [12:56:30] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61941 [12:57:10] thanks for updating that andre__ I forgot to do it yesterday [12:57:59] Thehelpfulone, hmm, don't know which bug you refer to, but you're welcome :P [12:58:08] lol, wikimania2014 wiki [12:58:41] !log Migrated amslvs3 and amslvs4 PyBal BGP peerings from csw2-esams to cr2-knams [12:58:49] Logged the message, Master [13:00:57] !log maxsem synchronized wmf-config/mobile.php 'https://gerrit.wikimedia.org/r/#/c/61941/' [13:01:05] Logged the message, Master [13:02:55] hashar: yeah I tsted it on my last host for reinstall [13:03:04] it only took 20 tries :-/ [13:04:15] New review: Andrew Bogott; "A couple of questions, inline" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/61816 [13:05:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [13:06:05] New review: Andrew Bogott; "This looks OK... you've tested it, I trust?" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/61975 [13:09:44] New review: Aude; "yes, tried it and using these settings on our dev "test" wikis" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61975 [13:09:47] New patchset: Andrew Bogott; "Remove morebots classes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58922 [13:11:21] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61975 [13:11:51] New patchset: Hashar; "beta: Echo uses the local wiki db" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61980 [13:12:35] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61980 [13:40:12] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [14:00:33] New review: Deyan; "It's true that Mojolicious is a bit volatile at the moment and is not the most comfy web framework f..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61767 [14:06:25] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [14:08:25] !log upgrading Jenkins (unscheduled maintenance). [14:08:33] Logged the message, Master [14:08:42] * hashar is shutting down Jenkins for a while. NO ETA [14:09:11] New review: Ottomata; "> I am not sure why there is a hadoop::defaults class" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [14:09:26] !log shutting down Zuul [14:09:33] Logged the message, Master [14:11:35] !log Upgraded Jenkins from 1.480.3 to LTS 1.509.1. Restarted it. [14:11:42] Logged the message, Master [14:11:44] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [14:12:54] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [14:16:54] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [14:17:14] New review: Ottomata; "So for deps that don't already exist in debian/ubuntu apt, what should I do?" [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/53170 [14:19:18] ottomata: which ones? [14:19:27] I don't mind packaging-wise [14:19:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [14:19:41] but if they're GPL, it's a copyright violation for us to distribute jars without the source [14:20:04] (this is another reason why downloading them at build-time is wrong) [14:20:52] paravoid: I don't know i think, i haven't looked into it much, and probably won't have time this week (and I'm going on vaca next week) [14:21:07] i'm just trying to understand what the requirement is [14:21:39] i understand not downloading and installing deps at install time, but I don't quite understand how this is bad to do at build time [14:21:50] doing it at build time gets the deps frozen in the .deb you are building [14:21:54] New review: Faidon; "I don't mind packaging-wise, but if they're licensed under a copyleft license (GPL) it might be a co..." [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/53170 [14:22:03] you're fetching binaries [14:22:13] not that it's okay to fetch sources [14:22:24] but fetching binaries and embedding them in the jar is also a copyright violation [14:22:34] (depending on the license) [14:22:48] also, you're fetching unsigned binaries, what makes you think they're not backdoored? [14:24:14] plus, do we really want to e.g. rebuild kafka and get a newer zookeeper or libfoo-java than what we run in production? [14:25:06] my phone just notified me that I am invited into a meeting with you and drdee in 5'? [14:27:13] oh? [14:27:19] hmmm [14:27:27] drdee likes to make sneaky meetings! [14:28:14] RECOVERY - RAID on ms-fe1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:28:25] paravoid, i'm for seeing if we can satisfy deps using .debs that already exist [14:28:28] i will try that fo sho [14:28:33] i'm talking just more generally [14:28:38] about what the right thing to do is [14:28:53] in Debian the right thing to do would be to package each jar individually [14:29:01] every dep? [14:29:01] I don't have as high standards for wikimedia [14:29:28] so embedding a few jars would be okay with me, but be careful of copyright violations [14:29:49] yes, every dep [14:30:05] ok, example: let's say that kafka has a dep that it needs from github or whatever [14:30:41] for wmf, you are saying that depending on the license, it is ok for us to include the binary for that dep in this kafka debianization repository [14:30:47] so that we don't have to dl it at build time [14:31:02] it's not great, but I guess it's a reasonable compromise [14:31:20] ok. that's fine, but i'm going to argue against myself for a second [14:31:58] security wise, is that really any different than dling at build time? [14:32:01] feel to argue with me to, I won't blame you :-) [14:32:14] the difference is that this won't change under your feet [14:32:17] next time you try to rebuild [14:32:30] there's also going to be recorded on git [14:32:36] what if the build-dep was for a pre-packaged .deb? [14:32:42] ? [14:32:49] debian might upgrade the version in apt or something [14:32:55] that would change under your feet for the next person [14:33:04] yes, but then it goes through Debian, with a version and a corresponding source [14:33:15] hm. [14:33:26] and signed by the developer, so there's a track record [14:33:40] we don't have the resources to security audit every software we use, so we rely on third parties [14:34:48] I even gave you a plausible attack scenario :) [14:34:54] our address space is well known [14:35:03] several people use labs for building packages [14:35:55] oh totally, i understand the attack scenario, i'm just trying to understand philosophically the difference, i guess i get it. it kinda seems like a line in the sand to me, but it is a line [14:36:16] as I said, embedding the jar is still a compromise [14:36:24] ideally we'd build all of them from source [14:36:29] yeah [14:36:30] and have the source in git [14:36:41] ok, well i haven't even looked into what these kafka deps are yet [14:36:48] that's what I'd do, but if I ask you to you or drdee are going to kill me :-) [14:36:58] haha [14:36:58] yup [14:37:15] i will do the best I can with what is there, I'll come back for more debate later :) [14:37:25] most of the deps existed in Debian [14:37:28] ok cool [14:37:29] if not all [14:38:09] yeah, i'll check it out the week of the 12th when I'm back from vaca, and see if I can satisfy them all [14:38:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61892 [14:38:36] have fun :-) [14:38:37] aside from that, we have a few more .debs to do coming up (storm, jzmq,…uhhh maybe something else?) [14:38:43] going anywhere interesting? [14:38:44] you're going to amsterdam, right? :) [14:39:08] 496 git fetch origin [14:39:08] 497 git fetch origin [14:39:08] 498 git diff origin [14:39:08] 499 git merge --ff-only origin [14:39:08] yeah, next week: 2 days biking in virginia, and then 4 days canoeing back down a river [14:39:14] someone's been naughty [14:39:20] and didn't use puppet-merge [14:39:31] hehe [14:41:55] New review: Faidon; "I'm all for PEP8, but splitting the regexp strings into multiple lines of 79 cols is just bad for re..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/61889 [14:43:09] andrewbogott: keep them coming :) [14:44:12] paravoid. that's me not using puppet merge. Did I miss a memo? [14:44:41] you did :) [14:45:02] ottomata created "puppet-merge", you just run that, even without cd'ing into ~/puppet [14:45:09] and it fetches, diffs and prompts you to merge [14:46:01] OK, that seems simple enough :) [14:50:50] New review: Andrew Bogott; "I pretty much agree -- the line-length constraint in pep8 is a constant plague, often requiring code..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [14:51:16] New patchset: Ottomata; "Sending bits esams EventLogging traffic to gadolinium for vandium relay." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61985 [14:51:17] New patchset: Ottomata; "Sending varnishncsa traffic to gadolinium instead of oxygen for multicast relay." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61986 [15:00:32] New patchset: Andrew Bogott; "Pep8 cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [15:01:02] New review: Andrew Bogott; "# noqa turns off the line-length check for a given line (and also makes the line even longer.)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [15:04:22] eww :) [15:05:08] we could just ask hashar to turn off line-length checks altogether. [15:05:20] At this point I'm habituated to the 80-char limit but I'm not emotionally attached :) [15:05:27] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [15:05:31] yeah that could be done in the rakefile [15:05:36] need to find out the name of the check [15:05:40] E501 [15:05:44] ah for pepe8 [15:05:50] edit the .pep8 file at the root [15:05:57] ignore = E501 [15:06:00] (comma separated) [15:06:06] that should do it [15:06:16] Yeah… I'm not sure if it's the right call or not. It would be nice if we could have a limit but have it be 120 instead of 80 [15:06:17] there is no .pep8 oh my [15:06:27] Avoiding crazy run-on lines seems generally good. [15:06:28] I'm usually okay with 80-chars, but in this case it's just wrong [15:06:31] the standard says 80 :D [15:06:47] OK. Well, let's leave it on for now, and see if we wind up with hundreds of noqas. [15:06:48] so either you ignore the standard or apply it but there is no gray area *grin* [15:07:14] aeah [15:07:23] then we ignore the standard [15:07:23] I think I solved jenkins slow start [15:07:28] The pep8 standard is all about maximizing code readability, and also it is 1986. [15:07:39] i'm sure 80 chars made sense in 1986 [15:07:47] 80 chars is great for me. [15:08:09] 80 char line lengths is what makes people choose 2-space indentation :P [15:08:12] I do all my edits in terminals, and that is nice when reviewing git patches [15:08:13] so, the case here is [15:08:14] match = re.match(r'^/(?P[^/]+)/(?P[^/]+)/((?Ptranscoded|thumb|temp)/)?(?P((temp|archive)/)?[0-9a-f]/(?P[0-9a-f]{2})/.+)$', req.path) [15:08:20] splitting the regexp is just wrong [15:08:23] very wrong [15:08:27] I disagree [15:08:31] the regex itself is wrong [15:08:42] it is too long and hard to understand / debug / edit [15:08:45] not at all [15:08:54] it's very easy and readable [15:09:14] this is basically the purpose of this script mind you [15:09:20] and we reuse those regexps in squid and varnish as well [15:09:36] I don't use regexps at all, but I consider myself regexdyslexic. I assume that everyone else can immediately tell what they do. [15:10:06] re.X for the win! [15:10:34] Anyway, we don't need to argue about this thanks to # noqa! [15:10:47] re.match( r'my super long [15:10:47] regex # explanation [15:10:48] regexbit again # some other explain [15:10:49] ', re.X ) [15:10:59] that disables the pep8 check entirely? [15:11:15] !log restarted Zuul [15:11:20] mark: Appended to a single line, it disables certain checks for that line ony. [15:11:22] only. [15:11:23] Logged the message, Master [15:11:30] yuck [15:11:41] so you add a comment to a long line to have your checker not complain about that line [15:11:44] that's just wrong [15:11:46] 18:04 < paravoid> eww :) [15:11:47] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [15:11:51] 18:11 < mark> yuck [15:11:54] heh [15:12:11] just disable that stupid check [15:12:18] I don't need a program like that impose rules on me [15:12:37] !log Jenkins made Jenkins to instantly restart (was {{bug|47120}}) ) by deleting the downstream-buildview plugin. [15:12:45] Logged the message, Master [15:13:23] pep8 doesn't seem to attempt to read a .pep8 [15:13:57] .config/pep8 [15:14:02] unless I'm using an old pep8 [15:14:22] Hm, or supposedly .pep8 in the dir with the code, I guess. [15:15:11] Marked in a section with [pep8] <- just realized I am quoting the docs which paravoid is surely also reading right now [15:15:34] yeah, that's a newer pep8 [15:15:38] than what squeeze has :) [15:17:15] and gallium has an old pep8 too :( [15:17:24] 1.3.3 [15:17:31] I will have to backport the one from raring I guess [15:21:08] New patchset: Reedy; "Move all deployment path vars to using /usr/local/apache/common-local" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60434 [15:24:07] New patchset: Faidon; "Swift: pep8 clean rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [15:24:35] andrewbogott: that's as far as I'm willing to go, and pep8 runs cleanly here [15:26:09] * andrewbogott notes that long lines are a pain to read in Gerrit [15:27:54] paravoid: if Jenkins is happy then I'm happy [15:28:12] jenkins doesn't seem to run there [15:29:58] It should eventually… it did for previous patches [15:31:24] Anybody in to check an issue with a specific ogv file not updating on Commons? See https://bugzilla.wikimedia.org/show_bug.cgi?id=48004 [15:34:24] andre__: I can confirm that [15:35:15] New patchset: Reedy; "Add a sqldump script wrapper around mysqldump" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43844 [15:35:58] LeslieCarr: ^^ You took a look last time on purging issues... I'm wondering whether CC'ing you by default on purging issues is okay, or better not? [15:37:03] New patchset: Mark Bergsma; "Migrate amslvs1/2 PyBal BGP peerings from csw1-esams to cr1-esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61990 [15:39:17] need bigger routers [15:40:06] ? [15:40:17] running out of ports ;) [15:42:44] mark: I heard you like routers, so I put... [15:43:34] mark: why aren't we replacing HTCP with some more reliable transport btw? [15:43:43] like 0mq or similar? [15:44:36] because I like packet loss [15:48:28] ...? :) [15:58:24] New patchset: BBlack; "Work-In-Progress vhtcpd code." [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/60390 [16:03:19] New patchset: Reedy; "wikimania2014.wikimedia.org config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61991 [16:07:57] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [16:08:08] New patchset: Jgreen; "switch db1013 in for db1025" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61993 [16:08:58] New patchset: Reedy; "wikimania2014.wikimedia.org config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61991 [16:12:17] andre__: don't just cc me by default for example i'll be gone for a few weeks [16:12:24] andre__: email the ops list instead [16:12:49] andre__: or like right now where i'm in a class [16:14:06] Reedy, for 'wmgCentralAuthLoginIcon' => array( [16:14:06] 'wikimania2013wiki' => '/usr/local/apache/common/images/sul/wikimania.png', [16:14:06] seems to be only wikimania wiki where that has been added - do you know why, and do we need it for wikimania2014wiki? [16:14:22] Because it's the current active wiki [16:14:41] There's almost no reason for 99.9% of users to be logged into the old wikimania wikis [16:15:07] When wikimania 2013 is over, that entry should probably be removed [16:15:09] ah, yeah the old wikis are locked - so you mean the logo links to that? [16:15:17] Yeah for autologin [16:15:32] The question is when we add wikimania2014, certainly at this point IMHO it's too early [16:15:55] indeed, after Wikimania 2013 is over I'd imagine + a month or two for people to update submission pages [16:16:16] Not an urgent task, but shouldn't sit around for ever [16:17:16] LeslieCarr, true, totally makes sense. Thanks! [16:17:56] New patchset: Andrew Bogott; "Don't override the logo if it has already been customized." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61994 [16:21:38] New patchset: Andrew Bogott; "Don't override the logo if it has already been customized." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61994 [16:22:02] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61993 [16:23:31] aww, gerrit-wm didn't even report my review - does it only report +2s? [16:23:47] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [16:24:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [16:25:36] dang, Jenkins is ignoring that .pep8 file :( hashar, any suggestions? https://gerrit.wikimedia.org/r/#/c/61889/ [16:28:04] andrewbogott: pep8 is run from the root of the repository [16:28:13] andrewbogott: that is where it looks for the .pep8 file [16:28:23] hashar, yeah, but it should pick up local files [16:28:24] andrewbogott: that also mean the ignore will be applied to all .py :) [16:28:36] At least, I'm pretty sure I've used versions that allow selective rule changes [16:28:40] apparently it does not :/ [16:28:46] i got v1.3.3 on gallium [16:28:57] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [16:29:07] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [16:29:17] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:17] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:27] andrewbogott: seems to work with pep8 1.4.5 on my laptop [16:29:27] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [16:29:27] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [16:29:27] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [16:29:28] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [16:29:28] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:29:58] hashar: Well… ok, I guess I'll go ahead and turn off those warnings everywhere then. It's that or bikeshed for eternity :) [16:30:07] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.268 second response time [16:30:21] andrewbogott: or upgrade pep8 on gallium? :D that would need a backport of the package from raring [16:30:45] hashar: Presuming that the new version manages per-directory settings. Let me verify. [16:30:50] test breaking [16:31:03] bah we already did backport it : / [16:31:07] wth [16:31:14] raring as v1.3.3 http://packages.ubuntu.com/search?keywords=pep8 [16:32:02] test breaking [16:32:07] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.526 second response time [16:32:10] so it couldn't restart [16:32:11] yay it's working [16:32:21] now let's fix rendering.svc.eqiad.wmnet [16:32:37] it's not rendering, it's swift [16:32:42] ah [16:32:46] goddamn infrastructure loop [16:33:06] it's recovering now, but still looking on what happened [16:34:27] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 66160 bytes in 0.151 second response time [16:34:39] hashar: Confirmed, 1.4.5 seems to observe that local file just fine. [16:34:57] wtf? I can't see our ipv6 routes [16:35:05] andrewbogott: so we need to find out where the debian package is maintained and get it up to 1.4.5 :D [16:35:09] maybe it's my isp [16:35:15] no v6 here sorry :( [16:35:28] hashar: Would also be nice to verify that the version running on integration /doesn't/ work on the cmdline [16:35:51] paravoid: what's your as ? [16:36:02] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.649 second response time [16:36:04] hashar, are you equipped to do that? [16:36:22] andrewbogott: well that is what jenkins does. [16:36:22] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.047 second response time [16:36:33] hashar: nevertheless... [16:37:00] hashar: If you want to do the backport based on faith I won't stand in your way :) [16:37:02] 5408 [16:37:02] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [16:37:05] and yet it's imagescalers [16:37:32] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:38:02] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.631 second response time [16:38:22] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 66160 bytes in 0.166 second response time [16:38:34] andrewbogott: too many things to handle right now. Maybe ping the ubuntu people at https://launchpad.net/pep8 They have to allow 1.4.4 in raring. [16:39:12] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [16:39:52] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [16:40:05] andrewbogott: or if you are brave try back porting the v1.4.4 from debian to our apt.wm.o http://packages.debian.org/source/unstable/pep8 might have dependencies issues sthough [16:40:53] it will [16:45:23] New patchset: Krinkle; "contint: Move apache logs to readable place for localhost testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61997 [16:45:38] damn, was hoping they were on ams-ix [16:46:51] i'll email their upstream and see if they'll peer [16:46:56] ? [16:47:08] New patchset: Ottomata; "- Sending varnishncsa traffic to gadolinium instead of oxygen for multicast relay. - Removing locke varnishncsa instance. locke is no longer used." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61986 [16:47:38] your AS isn't on amsix directly but all the routes to it are via https://www.peeringdb.com/view.php?asn=20965 (that we're seeing at ams) [16:47:41] New review: Krinkle; "Why vhost combined instead of combined?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61195 [16:47:57] and it's all a few hops away [16:48:26] my AS is where I previously worked [16:48:28] hashar: I'm wrong, 1.4.5 doesn't work right either if you run it from the root dir. [16:48:46] on the network operations center :) [16:49:01] both grnet and geant have open looking glasses [16:49:03] looking [16:50:42] or on cr2-knams you can look at show route aspath-regex ".* 5408 .*" [16:50:53] so many hops! [16:50:57] hm, it's local to them it seems [16:51:01] tell me about it... :) [16:51:06] I'll notify them [16:52:03] !log nikerabbit synchronized php-1.22wmf3/extensions/UniversalLanguageSelector/ 'ULS to master' [16:52:10] Logged the message, Master [16:52:26] New review: Physikerwelt; "According to 1)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61767 [16:53:46] New review: Thehelpfulone; "Testing gerrit-wm" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61991 [16:53:57] ah there we go, something was wrong with it the first time I reviewed [16:57:51] mark: ping! [16:58:05] yes? [16:58:30] I was wondering -- would it be possible to serve device detection along with the GeoIP stuff from varnish in one call? [16:59:05] !log Zuul is somehow having trouble kicking off Jenkins jobs (less than 1 event processed per minute). Jenkins shows that 10/10 executors are idle. Investigating... [16:59:08] with a parameter maybe? [16:59:12] Logged the message, Master [16:59:22] !log Jenkins is nearing 100% CPU on gallium, what is Jenkins doing? [16:59:30] Logged the message, Master [17:00:10] we couldn't handle every call being a device lookup? (I'm looking at this in context of centralnotice) [17:00:23] I could set a cookie/localstorage object [17:00:53] we could but why would we want to do it on every call? [17:00:54] hashar: Can we change the Jenkins line to find . -name "*.py" -exec /usr/bin/pep8 {} \; [17:00:55] ? [17:00:57] not every call needs it [17:01:37] also, can't it be done client side? :) [17:02:24] it can be done in JS -- https://gerrit.wikimedia.org/r/#/c/61988/ -- but if it resides in multiple places its just multiple places to update [17:03:00] well I don't like unnecessary processing on the varnish layer [17:03:34] interesting, another person, going via geant to tele2 having some issues [17:03:58] mark: makes sense -- but you aren't totally opposed to having varnish do it so long as I can cache the result? [17:04:19] we can do it but let's make it optional [17:04:37] kk -- I'll investigate some other options [17:04:43] thanks :) [17:05:06] and in general i'd like to keep our http caching layer as close to transparent as possible [17:05:12] andrewbogott: https://github.com/wikimedia/integration-jenkins-job-builder-config/blob/master/operations-puppet.yaml#L63-L72 https://github.com/wikimedia/integration-jenkins-job-builder-config/blob/master/python-jobs.yaml#L1-L12 https://github.com/wikimedia/integration-jenkins-job-builder-config/blob/master/macro.yaml#L224-L239 [17:05:31] the more features we put there, the harder it becomes to manage that, migrate to other solutions, the less efficient, etc ;) [17:05:40] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [17:05:45] mark: maybe after we switch 404 handler to varnish we should do chash for image scalers [17:05:55] Krinkle, you are encouraging me to be bold? [17:06:02] paravoid: what's the point? [17:06:33] andrewbogott: don't self-merge, but suit yourself yes :) I don't contribute to any python projects, so I wouldn't be sure how to verify it [17:06:38] to only overload one server instead of all on weird originals [17:06:45] ah, right [17:06:51] overloads should be handled by limits & cgroups, but some times these fail [17:06:51] yes [17:06:59] Krinkle: Hm… not obvious to me how we can run a find command and still capture errors... [17:07:03] andrewbogott: What is the problem exactly? [17:07:05] I think that's what happened now, I found a 3M gif on several of those servers [17:07:22] poolcounter? ;) [17:07:24] andrewbogott: Doesn't pep8 skip non-python files? [17:07:39] heh, maybe that too [17:07:49] Krinkle: For each run of pep8 it reads the .pep8 config once and only once. I want it to read one per-directory instead. [17:08:12] Krinkle: So, a recursive 'run this test in each directory' would work fine, if Jenkins has syntax for that already [17:08:37] andrewbogott: it;s all bash, so everything it possible [17:08:39] anyway [17:08:40] ttyl [17:08:50] andrewbogott: however, you'd need to aggregate the errors in one valid package [17:09:07] one pep8-report package that is [17:09:21] andrewbogott: seems like a feature request for pep8 [17:09:44] andrewbogott: I know jshint supports it. It recurses the directory and for each file it uses the closest config file [17:09:53] Krinkle, maybe we should be using git-changed-in-head anyway [17:10:21] andrewbogott: If you can make pep8 do what you want in a single pep8 command we can do that [17:10:23] otherwise not [17:10:36] if you call it once for each it will generate seperate reports [17:10:59] which are probably a bitch to interpret and aggregate validly [17:11:03] Does git-changed-in-head run once per file, or run once with a list of files? [17:11:17] it returns a list of file names [17:11:41] Oh… not so useful. [17:11:45] OK. [17:12:12] andrewbogott: does pep8 find a config file for each if you pass it a variadic list of argumentts? [17:12:25] Dunno. I'll check. [17:12:26] or only once per overal invocation? [17:13:03] nope [17:13:05] Only once per [17:13:06] andrewbogott: anyway, I'd recommened filing a request upstream. They'll either implement it, or give a way to do it already I guess. [17:13:24] Well, we can just apply the rule exceptions universally... [17:13:41] Or fix the warnings ;-) [17:13:49] We might wind up doing that piece by piece anyway, since people (mark, paravoid) seem to hate these rules anyway… and I'm not in love with them either. [17:14:23] If we adopt a different style then it should be consistent, so yeah, doing it universally would be justified [17:14:27] Krinkle, in this case there are good reasons to ignore the warnings… the code would have to be obfuscated to work around it. [17:14:40] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [17:14:46] sure [17:15:06] so adopt it as a coding style (instead of a local exception) in general [17:15:40] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [17:22:31] New patchset: Andrew Bogott; "Turn off pep8 rules about line width and operator spacing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [17:23:38] New patchset: Andrew Bogott; "Swift: pep8 clean rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [17:24:09] New patchset: Aaron Schulz; "Removed job queue migration config." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62000 [17:25:14] * andrewbogott -> lunch [17:31:44] New review: Demon; "Not sure I agree with ignoring operator spacing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [17:32:56] New patchset: Ottomata; "Adding Ram on analytics nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62001 [17:33:32] New review: Ottomata; "This should not be merged until May 6." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/62001 [17:36:56] !log Jenkins keeps clogging up. Starting an emergency restart. [17:37:03] Logged the message, Master [17:38:28] Krinkle: Know the cause? [17:39:02] marktraceur: immediate cause is Jenkins having 100% CPU while it appears to be completely idle. [17:39:19] Hrm. [17:39:19] so whatever it is the thing that caused it is no longer active [17:39:28] Krinkle: I think hashar was seeing that problem before [17:39:29] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [17:39:29] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:39:29] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [17:39:29] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [17:39:56] All executors are idle and yet it is 98% CPU, as a result Zuul has almost no response time from Jenkins to queue new jobs [17:40:02] it is progressing but much too slow [17:40:24] Queue has multiplied over the last hour from 10 to 50 [17:40:33] 72 events now [17:41:00] marktraceur: Expect false positives in Gerrit (job "LOST") [17:41:02] That's fine [17:41:40] Since zuul unexpectedly lost connection with Jenkins (it doesn't know to detect a restart and pick up later) [17:46:30] New patchset: Reedy; "Making $wgAllowUserJs, $wgAllowUserCss, $wgSecureLogin configurable per wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62002 [17:46:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62002 [17:47:27] !log reedy synchronized wmf-config/ [17:47:35] Logged the message, Master [17:47:49] New patchset: Reedy; "wikimania2014.wikimedia.org config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61991 [17:48:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61991 [17:49:08] New review: Andrew Bogott; "As usual, I don't care so much what the standard is as that there /be/ a standard. I think Faidon w..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [17:52:34] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61994 [17:53:25] New review: Demon; "The main problem wasn't about returning it to the users or not, but rather the logging (we got a *to..." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/60860 [17:53:45] !log reedy synchronized wmf-config/InitialiseSettings.php [17:53:53] Logged the message, Master [17:55:55] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [17:56:03] Logged the message, Master [17:59:20] !log Jenkins restart complete. No visible improvement. Jenkins is still idling most of the time while Zuul is still halted by an unknown factor on spawning jobs. [17:59:28] Logged the message, Master [17:59:41] Krinkle: that makes me sad ;_; [18:00:31] New review: Demon; "This is for MediaWiki (& extensions), but we tend to discourage vertical alignment: https://www.medi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [18:02:28] thanks Reedy, can you +crat me? [18:03:08] Where do you want +crat, Thehelpfulone? [18:03:22] wm14, I assume [18:03:36] Krenair, wikimania2014 wiki, to set up the wiki pages like I did last year and I'm helping the bid team [18:04:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [18:04:37] Thehelpfulone: Reedy: and you already merged it:) [18:04:56] need anything else? the docroot is there [18:06:03] mutante, there's that iegcom wiki too [18:08:07] ack [18:11:36] New patchset: Ottomata; "Changing Erik Bernhardson's ssh key." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62006 [18:11:47] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62006 [18:12:35] Thehelpfulone: Look in your preferences, can you confirm whether your email address is marked as confirmed or not? [18:12:53] yep [18:12:54] Same for anyone else that might be watching [18:12:56] from 2007 [18:13:01] Right, good [18:19:12] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [18:21:46] Can someone set up the search indexes for wikimania2014wiki please? https://wikitech.wikimedia.org/wiki/Lucene#Adding_new_wikis [18:22:57] New review: Ottomata; "I just double checked, udp2log will create its the file when it is SIGHUPed." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/61964 [18:24:20] binasher: gdash seems down [18:24:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61985 [18:27:28] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61986 [18:27:56] New patchset: Ottomata; "Removing bits locke varnish logging instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62008 [18:28:17] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62008 [18:29:45] !log Installed Monitoring plugin from Jenkins control panel [18:29:52] Logged the message, Master [18:31:28] Aaron|home: mwscript seems to be eating arguments [18:31:29] reedy@fenari:/home/wikipedia/common/php-1.22wmf3$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=wikimania2014wiki translate [18:31:29] This script is not configured to create tables for [18:34:54] New patchset: Reedy; "Defining BINDIR before you use it is helpful" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62010 [18:35:09] dinner [18:35:18] Krinkle: do you have a reverse proxy in front of jenkins? [18:35:38] New patchset: Ottomata; "Just in case escapes are different with single vs double quotes, I'm leaving this as it was before." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62011 [18:35:52] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62011 [18:36:06] New patchset: Vogone; "(bug 48013) Creating a flood user group for wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62012 [18:36:52] Krinkle|detached: (whenever you get back) i'm wondering if you could check the timestamps for when zuul hit the URI job/mwext-MobileFrontend-lint/buildWithParameters, to see how many times and if there were failures, and what the timestamps were [18:37:28] New patchset: Ottomata; "Keeping double quotes just in case single quote escaping is different" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62013 [18:37:38] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62013 [18:37:40] Krinkle|detached: This job https://integration.wikimedia.org/ci/job/mwext-MobileFrontend-lint/2818/console started at 18:09:47, but zuul started trying to launch it at 18:08:50 [18:40:44] !log varnishncsa now sends traffic to gadolinium instead of oxygen for multicast relay [18:40:52] Logged the message, Master [18:41:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [18:43:40] New patchset: Vogone; "(bug 48013) Creating a flood user group for wikidatawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62012 [18:44:08] Reedy, interwiki links are also broken too, not sure if that's related to the search index [18:44:17] IW cache will need rebuilding [18:44:37] New review: Hashar; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/61999 [18:44:40] hey dudes, what's the easiest way to stop and remove a varnishncsa instance from all varnishes? :) [18:44:43] dsh somehow? [18:45:18] notpeter, any thoughts?^ [18:46:03] oh i've done this before, i created a dsh group for varnish-all [18:46:03] hmm [18:46:48] !log reedy synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [18:46:51] Logged the message, Master [18:47:59] Anyone know why updateinterwikicache is missing from /usr/local/bin on fenari? [18:48:14] it's in deployment.pp.. [18:49:30] And physically there at files/misc/scripts/updateinterwikicache [18:51:43] !log removed varnishcsa-locke instance from varnish hosts: (dsh -c -g varnishncsa-all 'test -f /etc/init.d/varnishncsa-locke && service varnishncsa-locke stop && update-rc.d -f varnishncsa-locke remove && rm -v /etc/init.d/varnishncsa-locke') [18:51:50] Logged the message, Master [18:52:34] jeblair: Yes, Jenkins is running behind Apache [18:52:49] jeblair: Through the /ci path on port 80 [18:53:04] PROBLEM - Varnish traffic logger on cp3004 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:14] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:14] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:14] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:14] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:14] PROBLEM - Varnish traffic logger on cp3010 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:14] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:14] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:24] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:24] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:24] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:24] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:24] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:24] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:24] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:34] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:34] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:34] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:34] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:34] PROBLEM - Varnish traffic logger on cp3009 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:34] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:35] jeblair: I don't have access to access logs on that machine, I'll have to ask someone else [18:53:48] ottomata: hehe, might want to fix the check as well :) [18:53:54] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:54:04] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:54:21] Anyone in ops with sudo rights on gallium available to do some apache access log checks for me? [18:54:48] I'd like to know hits to job/mwext-MobileFrontend-lint/buildWithParameters for integration.wikimedia.org in the past few hours. [18:55:00] (timestamps and full url) [18:55:33] I am going to downgrade jenkins I guess [18:55:47] daoohh [18:55:47] i am p***ed of [18:55:51] on it [18:55:53] LeslieCarr: thanks [18:56:09] ottomata: was 'on it' to me? [18:56:26] hashar: can you check access logs? [18:56:31] no [18:56:36] you have sudo there [18:56:38] I am going to get Jenkins downgraded [18:56:41] k [18:56:41] LeslieCarr: nrpe_command => "/usr/lib/nagios/plugins/check_procs -w 3:3 -c 3:6 -C varnishncsa" [18:56:48] not sure what to change that too [18:56:50] looking... [18:56:55] but there should be only 2 varnishncsa procs [18:56:58] it simply does not work and I get the root cause [18:57:10] so I guess we are going to have yet another stupid 1 hour downtime [18:57:11] OR [18:57:11] hashar: It does work,https://integration.wikimedia.org/zuul/ [18:57:12] i think 2:2 [18:57:20] 2:2 -c ? [18:57:47] Krinkle: the problem seems to be that now on EACH Jenkins API call, the stupid Jenkins backend attempt to reparse the full history [18:57:58] ah thats critcal [18:58:00] i see ok coo [18:58:01] i get it [18:58:27] hashar: https://integration.wikimedia.org/ci/monitoring [18:58:35] hashar: There is over 10,000 missing class warnings every few minutes [18:58:43] WARNING: Failed to resolve class [18:58:49] over 200,000 in the last hour [18:58:53] could be unrelated though [18:58:59] http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT [18:59:03] Krinkle: is that melody thing a new thing? Never heard of that before [18:59:13] hashar: i installed the plugin an hour ago [18:59:14] Krinkle: I guess it is recent given the graph history :) [18:59:16] see ops log [18:59:21] New patchset: Ottomata; "Fixing varnishncsa process check now that there are fewer varnishncsa processes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62017 [18:59:26] have you upgraded any plugin? [18:59:30] no [18:59:43] I installed it *after* things escalated [18:59:44] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 183 seconds [18:59:57] ah you restarted it [19:00:06] that too [19:00:19] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62017 [19:00:20] though that was one hour ago [19:01:30] Krinkle: so here is my long theory. On start up jenkins used to parse all the build.xml files. With the new version that is Lazy loaded so the startup is really fast [19:01:44] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [19:01:47] Krinkle: but as soon as something requires information from a job build history, that kicks a parse of all the build.xml files. [19:01:50] hashar: startup wasn't really fast when I restarted it earlier but ok [19:01:59] ottomata: sorry, didn't see that. but looks like it worked! [19:02:05] ja think so! [19:02:05] cool [19:02:11] notpeter, i'm about to try to deploy changes to squids [19:02:11] hashar: I'm aware of it being improved in this version [19:02:12] Krinkle: and the Jenkins API does kick the the build.xml parsing. [19:02:14] heheheh [19:02:19] think I can do it!? [19:02:20] eh!? [19:02:24] hashar: and it doesn't cache it like it used to [19:02:33] ottomata: definitely! [19:02:34] (that's your theory, right?) [19:02:38] ehy! [19:02:44] there are uncommitted frontend.conf.php changes in here [19:02:44] Krinkle: in theory (and hopefully) once the history of a job has been loaded up, it will be kept in a cache. [19:02:45] hmmmmm [19:02:57] +acl badbadip src 54.244.96.173 [19:03:09] hashar: so what's your plan? [19:03:16] Krinkle: I did kick that constantly during the afternoon. [19:03:16] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 2 processes with command name varnishncsa [19:03:19] Krinkle: my plan is to get rid of jenkins :-] [19:03:31] hashar: short term [19:03:47] In the long run we'll be in heavon [19:04:02] laughing at humanity [19:04:04] Krinkle: right now confirm that the build history is kept in cache and that a second query will not kick a second parse of the build history [19:04:07] but for the short term :) [19:04:23] hashar: okay, so we're going to let it run for now? [19:04:26] Krinkle: if that does not hold. I will get Jenkins downgraded, restart it (bam 1 hour shortage) and complain at upstream [19:04:29] Krinkle: yup [19:04:35] queue is exponentially rising [19:04:40] up from 50 t 109 [19:05:00] 104 events in queue and 0 jobs in jenkins [19:05:00] because Zuul is waiting for inforamtions from Jenkins [19:05:03] what is it doing? [19:05:10] and Jenkins is busy parsing the thousands of build.xml file I guess [19:05:12] I know (I read the logs too) [19:05:16] it seems to be losing jobs [19:05:21] notpeter are you around atm? you available for me to come running if I break things? [19:05:28] right now it is busy parsing mwext-VisualEditor-merge history [19:05:30] oh metrics meeting just ended [19:05:45] hashar: I see various entries where it is blocked for a minute on "Launching jobs" [19:05:58] and then to see that the job was lost [19:06:26] ottomata: sure [19:06:35] hashar: https://gist.github.com/Krinkle/5504150#file-zuul-log-L99-L104 [19:07:01] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [19:07:13] Krinkle: can you add the monitoring link https://integration.wikimedia.org/ci/monitoring on the portal ? :D [19:07:33] hashar: Doesn't jenkins link to it from the tools menu? [19:07:34] k danke [19:07:40] There's more plugins with a sub page [19:07:51] PROBLEM - Host db1025 is DOWN: PING CRITICAL - Packet loss = 100% [19:08:30] !log kaldari synchronized php-1.22wmf3/extensions/Echo 'sync Echo ext' [19:08:37] notpeter: running ./deploy frontend [19:08:38] Logged the message, Master [19:08:38] ... [19:08:51] ottomata: kk [19:09:10] site's not down, so you're probably fine :) [19:09:17] looking good so far! [19:09:24] oh, it's really clear when it's not ok [19:09:57] ssh: connect to host sq33.wikimedia.org port 22: Connection timed out [19:10:48] I think it's dead/down/decom [19:11:03] !log kaldari synchronized php-1.22wmf3/extensions/Echo 'sync Echo ext' [19:11:10] Logged the message, Master [19:11:21] Krinkle: and about the queue, that is merely because there are too many patch sets sent on mediawiki/core [19:11:32] Krinkle: they all are locked by the parser tests that takes a looong time to run. [19:11:38] hashar: No [19:11:47] hashar: When they are locked they are queued in Jenkins normally [19:11:50] this is not happening [19:11:58] they aren't even proccesed into Jenkins. [19:12:06] mk [19:12:10] yeah that's the only weirdness [19:12:11] looking good! [19:12:17] It can't be blocked because Zuul hasn't even figured out what jobs will be spawned for those events [19:12:53] this is because Jenkins can do concurrency (configurable), Zuul has no reason to wait [19:12:54] New patchset: Reedy; "Initial config for login.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:13:34] !log deployed squid frontend.conf.php changes to remove locke and send logs directly to gadolinium for multicast relay [19:13:43] Logged the message, Master [19:13:51] !log kaldari synchronized php-1.22wmf2/extensions/Echo 'sync Echo ext for en.wiki' [19:13:59] Logged the message, Master [19:14:37] New review: Krinkle; "This looks like an outdated template. 404.html seems rather.. old. Is that still in use? https://en...." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:15:22] New patchset: Reedy; "Add initial apache config for login.wikimedia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62021 [19:16:10] !log moving db1025 into frack-fundraising1-c-eqiad [19:16:18] Logged the message, Master [19:16:23] New review: Krinkle; "I don't see a symlinked "robots.txt" in other docroots." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:16:33] New review: Reedy; "I guess that means skel-1.5 is also out of date and should be fixed first" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:17:30] New review: Reedy; "Look harder?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:18:53] New patchset: Reedy; "Add initial apache config for login.wikimedia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62021 [19:18:56] New review: Krinkle; "I see robots.txt in transitionteam, but not in wikivoyage.org or wikipedia.org. Those use a rewrite ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:19:38] New review: Krinkle; "(1 comment)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/62021 [19:20:03] New patchset: Reedy; "Initial config for login.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:20:05] RobH: ping [19:20:14] binasher: sup? [19:20:50] RobH: can you attempt to get professor back up? [19:21:03] sure, lemme take a gander at it now [19:21:27] it died last night and tim did something, but it's down again. my sun ilom foo has been forgotten [19:22:12] hrmm, serial console is unresponsive (ilom is working os just seems completely crashed) [19:22:18] !log rebooting professor [19:22:26] Logged the message, RobH [19:22:34] which makes sense as ssh and ping arent working [19:22:48] !log all webrequest udp2log loggers (squid and varnish) now send to gadolinium for socat unicast -> multicast relay [19:22:56] Logged the message, Master [19:22:59] binasher: uh oh. [19:23:14] reset sys failed... hrmm [19:23:25] ...due to power state being off [19:23:30] someone shut this down before i got to it? [19:23:31] PROBLEM - Varnish traffic logger on dysprosium is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:46] RobH: tim attempted to do something with it last night but not sure what [19:23:50] !log professor was already powered down (why?) starting it back up now [19:24:00] Logged the message, RobH [19:24:16] im babysitting its boot process now [19:25:45] New patchset: Ottomata; "Removing now unused manual socat relay confs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62022 [19:26:12] New patchset: Reedy; "Initial config for login.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:26:14] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62022 [19:27:31] RECOVERY - Host professor is UP: PING OK - Packet loss = 0%, RTA = 27.05 ms [19:27:41] RECOVERY - RAID on professor is OK: OK: 1 logical device(s) checked [19:28:19] New patchset: Reedy; "Initial config for login.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:28:20] New review: Anomie; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:29:52] jeff_green: db1025 has been moved [19:30:11] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 2 processes with command name varnishncsa [19:30:17] cmjohnson1: cool. what port did it end up on? [19:30:31] RECOVERY - Varnish traffic logger on dysprosium is OK: PROCS OK: 2 processes with command name varnishncsa [19:31:26] jeff_green: 11/0/6 [19:31:27] pfw2 [19:31:36] RobH: thanks! any sign of what happened to it before it was powered off? [19:31:52] cmjohnson1: cool, thank you [19:32:04] yw [19:32:41] binasher: it looks like it went offline at 05:35 GMT [19:32:43] !log re-enabling puppet on cp1031, it was administratively disabled. running puppet there. [19:32:51] Logged the message, Master [19:32:56] May 2 05:35:47 professor kernel: Kernel logging (proc) stopped. [19:33:14] it ran a puppet run ten minutes before, and then that and nothing [19:34:28] New patchset: Reedy; "Initial config for login.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62020 [19:34:36] it looks like it was rebooted at 518 [19:35:06] binasher: no clue really, still glancing around but im not sure. [19:35:22] i can see it booted ok when it was restarted then [19:35:26] but then nothing shortly after [19:37:38] Thehelpfulone: Someone should just fix this global userpage thing already [19:37:45] heh [19:38:44] 29/39 edits are userspace.. [19:39:37] that lot are usually the stewards/SWMTers [19:39:51] !log rebooting oxygen [19:39:55] that handle cross-wiki vandalism etc [19:39:58] Logged the message, Master [19:40:14] Reedy, do I need to create an RT ticket to get the interwiki cache rebuilt? [19:40:39] No [19:40:41] I did it already [19:40:45] hmm, [[wm2014:]] doesn't work yet [19:40:54] [19:46:43] !log reedy synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [19:41:03] That probably needs adding to the IW map on meta first [19:41:12] PROBLEM - Host oxygen is DOWN: CRITICAL - Host Unreachable (208.80.154.15) [19:41:17] oh sorry, yeah I thought it was one of the automatic ones [19:41:18] https://meta.wikimedia.org/wiki/Interwiki_map [19:41:25] Wm2012 //wikimania2012.wikimedia.org/wiki/$1 [19:41:26] Wm2013 //wikimania2013.wikimedia.org/wiki/$1 [19:41:29] I suspect not ;) [19:41:38] RECOVERY - Host oxygen is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:41:54] yep, added, can you rebuild it again please? [19:46:36] !log reedy synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [19:46:44] Logged the message, Master [19:47:09] New review: Andrew Bogott; "E222: multiple space after operator doesn't need to be disabled in order align args; one or the othe..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [19:50:56] !log authdns-update to move db1025 to frack.eqiad.wmnet [19:51:04] Logged the message, Master [19:52:03] Parsoid update ahead, please ignore related alerts in the next minutes [19:57:46] Parsoid update is done [20:07:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:08:19] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:08:09 UTC 2013 [20:08:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:10:19] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:10:10 UTC 2013 [20:10:31] New patchset: Brion VIBBER; "Update FirefoxOS Wikipedia app to current master" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62030 [20:10:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:10:57] New patchset: Reedy; "multiversion: hostname to dbname basic tests" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61426 [20:11:11] New patchset: Jeremyb; "Adding Ram on analytics nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62001 [20:11:30] New review: Jeremyb; "carry forward -1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/62001 [20:12:09] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:12:07 UTC 2013 [20:12:10] Anybody mind deploying https://gerrit.wikimedia.org/r/#/c/62030/ ? Updates for FirefoxOS Wikipedia app, won't affect anything else [20:12:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:59] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62030 [20:13:59] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:13:56 UTC 2013 [20:14:05] Scary ;) [20:14:20] \o/ [20:14:35] do those get copied out automatically or does it need a push? [20:14:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:15:21] !log reedy synchronized docroot/bits/WikipediaMobileFirefoxOS/ [20:15:28] Logged the message, Master [20:15:39] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:15:37 UTC 2013 [20:15:40] brion: brion ^ [20:16:10] New review: preilly; ":-)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62030 [20:16:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:17:04] whee thanks Reedy :D [20:17:19] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:17:16 UTC 2013 [20:17:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:18:07] * brion digs out actual phone to confirm update works [20:18:49] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:18:45 UTC 2013 [20:19:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:20:19] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:20:09 UTC 2013 [20:20:31] oh, why did i say bri on twice? whoops [20:20:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:20:48] icinga-wm: quiet [20:20:50] !log reedy synchronized php-1.22wmf3/includes/ [20:20:59] Logged the message, Master [20:21:29] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:21:28 UTC 2013 [20:21:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:22:27] New patchset: Dzahn; "(re?)-add misc::deployment::common_scripts to fenari" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62032 [20:22:39] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:22:37 UTC 2013 [20:23:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:23:39] !change 62032 | Reedy [20:23:39] Reedy: https://gerrit.wikimedia.org/r/#q,62032,n,z [20:23:49] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:23:41 UTC 2013 [20:24:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:24:49] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:24:40 UTC 2013 [20:25:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:25:39] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:25:31 UTC 2013 [20:26:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:26:49] RECOVERY - Puppet freshness on cp1031 is OK: puppet ran at Thu May 2 20:26:43 UTC 2013 [20:26:59] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:26:55 UTC 2013 [20:27:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:27:39] RECOVERY - Puppet freshness on db10 is OK: puppet ran at Thu May 2 20:27:30 UTC 2013 [20:28:39] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [20:37:11] New patchset: Krinkle; "contint: Split up apache logs by vhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61195 [20:37:56] hashar: opinion on https://gerrit.wikimedia.org/r/#/c/61720/ ? [20:38:12] New review: Hashar; "Ah nice alignments... Sorry i was not paying attention :-] So just add the error description as co..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [20:38:45] Krinkle: still somewhere in my review queue :-] [20:39:06] Krinkle: sorry :( [20:39:51] New review: Dzahn; "could you do something similar for the puppet lint check and make it ignore just the "there are tabs..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [20:40:28] Krinkle: will look at it next week I guess [20:42:27] New patchset: Ottomata; "Setting up stat1002 for hosting private webrequest access logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62035 [20:43:20] paravoid, are you at all convinced by this? https://www.mediawiki.org/wiki/Manual:Coding_conventions#Vertical_alignment [20:43:33] hm, oops, sleeping [20:43:52] New patchset: Ottomata; "Setting up stat1002 for hosting private webrequest access logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62035 [20:44:02] andrewbogott: i am not convinced by that at all [20:44:13] ottomata, so you're pro-alignment? [20:44:17] yup [20:44:18] by space [20:44:19] not tab [20:44:32] puppet-lint is pro alignment by space too :) [20:44:45] (if you are talking about puppet) [20:44:53] (but i'm in general pro-alignment too) [20:44:53] python... [20:44:56] same [20:45:00] New review: Hashar; "For puppet-lint, you can have a look at /rakefile it has an array of disabled_checks. Note that the..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [20:46:02] New review: Hashar; "Go ahead :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61964 [20:46:09] ottomata: you can get that udp2log merged [20:46:43] k [20:46:48] ottomata: maybe want someone to check that it ran properly whenever it is rotating :-]  Maybe Ariel [20:46:55] New patchset: Ottomata; "Setting up stat1002 for hosting private webrequest access logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62035 [20:47:03] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61964 [20:47:07] but yeaht that would work since udp2log create the files when they do not exist [20:47:12] thx for the double checkè [20:47:33] merged :) [20:48:06] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62035 [20:48:54] New patchset: Andrew Bogott; "Turn off pep8 rules about line width and operator spacing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [20:49:33] New review: Hashar; "awesome :-)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/61999 [20:49:43] andrewbogott: good to me :D [20:49:44] New patchset: Ottomata; "Missing comma fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62068 [20:50:03] and I am no off [20:50:09] now [20:50:21] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62068 [20:51:52] New patchset: Ottomata; "admins::globaldev, not accounts::globaldev :p" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62080 [20:52:50] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62080 [20:53:26] so, LeslieCarr is at class, who else should I ping re a possible network/dns issue right now? [20:53:33] (and faidon isn't online) [20:54:04] what's the dns issue? [20:54:11] I can't help with network, but can with dns [20:54:30] Ryan_Lane: over in -tech, LeslieCarr's there now [20:54:36] all network [20:54:49] it's all peple going via geant to cogent to tele2 [20:54:55] i really want to just peer with geant and fix the problem [20:55:00] since this is like 3 people today [20:55:10] heh [20:55:12] hashar: 220+ events. Still rising.... [20:55:21] Krinkle: they are raw events [20:55:29] hashar: I know [20:55:34] Krinkle: and l10n bot is active at that time of the day [20:55:40] Hm.. [20:55:46] ok [20:56:11] I am still not sure why Zuul does not trigger more tests [20:56:40] If there are 226 events, surely the Zuul queue screen should be full of all sorts of stuff [20:56:45] I would expect it to fill the Jenkins execution slots as fast as possible [20:56:54] RoanKattouw: No, quite the opposite, but I know what you mean [20:57:02] RoanKattouw: If the screen is full, there are 0 events pending. [20:57:16] Krinkle: Surely not [20:57:24] right now there is 229 events pending and no job running ;-D [20:57:25] If the screen is full, it's working hard [20:57:34] RoanKattouw: the number up there are raw events not yet processed into the jenkins queue [20:57:42] There miight be even more jobs queued, but at least it should be working as hard as it can [20:57:44] Oh I see [20:57:49] I see what you're saying [20:58:00] they could be event not needing jenkins jobs (e.g. regular comments), they could be duplicates etc. It's like a job queue [20:58:32] but yes, it isn't working as hard as it should [20:58:39] Jenkins is idling most executor slots [20:58:59] this is a new phenomenon as of today. Possibly related to the Jenkins upgrade breaking something [20:59:31] New phenomenon as of today?! I thought we'd been seeing this behavior for more than a week now [20:59:33] it usually overloads Jenkins 10/10 slots with buffer. Now it is doing almost using more than 1/10 executors [21:12:16] New patchset: Ottomata; "More stuff to set up stat1002 as private data host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62084 [21:13:33] !log adding GEANT via fiberring to avoid-paths [21:13:42] Logged the message, Mistress of the network gear. [21:14:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [21:15:01] !log spage synchronized php-1.22wmf2/extensions/ConfirmEdit 'update 1.22wmf2 to wmf3 version of ConfirmEdit' [21:15:07] Logged the message, Master [21:17:07] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62084 [21:23:01] New patchset: Ottomata; "Adding ro NFS on dataset to from stat1002" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62086 [21:23:09] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62086 [21:24:18] hashar, Krinkle, is "raw events not yet processed into the jenkins queue" related to jenkins-bot taking 12 minutes or more to notice a +2 and start a gate-and-submit? e.g. https://gerrit.wikimedia.org/r/#/c/62033/ [21:25:05] spagewmf: yup :/ [21:25:37] spagewmf: The entire ci process seems hit by a slug plague. Everything is slow today. [21:27:45] !log mflaschen synchronized php-1.22wmf3/skins/common/shared.css 'Sync font-size change for edit section links' [21:27:52] Logged the message, Master [21:28:50] New patchset: Ottomata; "Fixing exports" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62087 [21:29:05] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62087 [21:29:50] hashar, Krinkle: can you paste the zuul debug log between 21:25 and present? [21:30:08] sue [21:30:09] sure [21:30:35] like the last 5 minutes? [21:30:42] hashar: yes [21:31:11] just saw your message about waiting for a response... where do you see that? [21:33:16] jeblair: http://noc.wikimedia.org/~hashar/zuul_20130502.log [21:33:33] jeblair: it has the Gerrit received data though. not very helpful [21:34:38] !log restarting search indexers on searchidx2, searchidx1001 to make sure the indexer knows about new wiki [21:34:45] Logged the message, Master [21:34:47] the scheduler is waiting for a build to complete before processing some more [21:35:32] hashar: ah yeah, i see in your layout.yaml it's a couple layers deep [21:35:42] jeblair: and the waiting time comes from jenkins thread dump https://integration.wikimedia.org/ci/threadDump . Some thread ' GET /ci/job/jobname' will show a stack trace that reads some files on disk, that is most of the time the build.xml [21:36:35] jeblair: ah the debug log is not going to be very helpful. Zuul was waiting to reload. [21:38:03] !log importing wikimania2014wiki into search indexers [21:38:10] Logged the message, Master [21:41:28] !log mflaschen synchronized php-1.22wmf2/extensions/GuidedTour/ 'Sync GuidedTour to 1.22wmf2 for E3 deployment' [21:41:35] Logged the message, Master [21:43:03] !log mflaschen synchronized php-1.22wmf3/extensions/GuidedTour/ 'Sync GuidedTour to 1.22wmf3 for E3 deployment' [21:43:10] Logged the message, Master [21:47:58] New patchset: Hashar; "contint: Split up apache logs by vhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61195 [21:48:44] E3 is done deploying [21:52:11] varnish question - "opera_mini" acl is declared in wikimedia.vcl.erb, but there is a second check in mobile-frontend.inc.vcl.erb [21:52:36] would it be safe to remove the second check (against ACL), and just see if XFF header is set? [21:58:06] LeslieCarr, do you know this by any chance? Not sure whom to bug [21:58:54] LeslieCarr, https://wikitech.wikimedia.org/wiki/How_to_deploy_code says `ssh -A fenari`, but AIUI that will get us killed. [21:59:51] New patchset: Asher; "building db1059 - to be the new s4 master which switching to mariadb, upgrading db1020" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62093 [22:04:06] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [22:04:16] !log depooling search1015 in pybal [22:04:23] Logged the message, Master [22:08:41] yurik: what do you mean? [22:09:48] binasher, i am trying to figure out if mobile-frontend file's logic always follows the wikimedia.vcl.erb [22:10:04] yes, it does [22:10:50] binasher, thx, wasn't sure about what file includes what [22:11:17] see the last line in wikimedia.vcl.erb [22:12:39] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62093 [22:12:56] !log Deploying a workaround on Zuul to make it stop querying the Jenkins API when it just want to check whether a job exist. {{gerrit|62095}} [22:13:04] Logged the message, Master [22:14:29] binasher, but where does that "vcl" is getting set? [22:14:46] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [22:14:53] by the caller [22:15:10] !log restarting Zuul [22:15:18] Logged the message, Master [22:22:08] New patchset: Diederik; "Add s6 and s7 to user-metrics api." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62099 [22:22:24] !log Zuul restarted. The bug about slowness is {{bug|48025}} [22:22:32] Logged the message, Master [22:23:23] binasher, another quick question - xff_sources -- do i need to test (in ruby) for that var inside mobile-frontend, or can i assume that "allow_xff" is always set? [22:23:51] wikimedia.vcl.erb has inclusion check: <% if has_variable?("xff_sources") and xff_sources.length > 0 -%> [22:24:18] before doing any XFF manipulations [22:27:14] i'm not sure what you mean [22:28:27] the allow_xff acl will always be populated in production [22:29:10] !log repooling search1015 [22:29:17] Logged the message, Master [22:29:30] binasher, it just that in wikimedia.vcl.erb, every use of the "allow_xff" acl is wrapped in a template check [22:29:39] New patchset: Asher; "pulling db1020 for upgrade" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62102 [22:29:42] Reedy> Can someone set up the search indexes for wikimania2014wiki please? https://wikitech.wikimedia.org/wiki/Lucene#Adding_new_wikis [22:29:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:50] ah, it wasn't lucene's fault for once [22:30:17] binasher, causing me to assume that allow_xff might not exist in some cases [22:30:21] yurik: indeed. all i can tell you is that it's always set for mobile varnishes in production. [22:30:32] Nemo_bis: that's what i'm doing [22:30:34] binasher, cool, thx [22:30:36] it's such a relief having meaningful error messages, I thank Ram and Chad every day [22:30:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [22:30:37] it may not be applicable / probably isn't in other envs [22:30:40] mutante: wonderful :) [22:31:24] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62102 [22:31:26] binasher, so i can safely assume that mobile-frontend vcl file is only used for production env, which is good enough for me :) [22:31:47] !log depooling search1016, restarting lucene, etc.. (Search#Adding_new_wikis) [22:31:55] Logged the message, Master [22:31:57] yurik: it will probably be used in beta too [22:32:17] binasher, does beta define "allow_xff" acl? [22:32:23] no idea [22:32:43] * binasher ignores beta as much as possible ;) [22:32:46] sigh... i guess if it breaks, ppl will complain :) [22:34:17] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1020 for upgrade' [22:34:24] Logged the message, Master [22:34:56] PROBLEM - search indices - check lucene status page on search1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:30] New review: Dzahn; "manual verify while zuul is being worked on" [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/61195 [22:36:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61195 [22:38:44] PROBLEM - mysqld processes on db1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:39:16] New patchset: Yurik; "Allow XFF spoofing from the trusted IPs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62103 [22:39:37] mutante: thanks :-) [22:39:44] PROBLEM - MySQL disk space on db1059 is CRITICAL: NRPE: Command check_disk_6_3 not defined [22:39:44] PROBLEM - Full LVS Snapshot on db1059 is CRITICAL: NRPE: Command check_lvs not defined [22:39:54] PROBLEM - MySQL Idle Transactions on db1059 is CRITICAL: NRPE: Command check_mysql_idle_transactions not defined [22:39:54] PROBLEM - mysqld processes on db1059 is CRITICAL: NRPE: Command check_mysqld not defined [22:40:04] PROBLEM - MySQL Recent Restart on db1059 is CRITICAL: NRPE: Command check_mysql_recent_restart not defined [22:40:10] !log gallium, run puppet, graceful Apache to deploy split log files [22:40:14] PROBLEM - MySQL Replication Heartbeat on db1059 is CRITICAL: NRPE: Command check_mysql_slave_heartbeat not defined [22:40:18] Logged the message, Master [22:40:21] New patchset: Pyoungmeister; "adding db74 to pmtpa s5 until" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62104 [22:40:24] PROBLEM - MySQL Slave Delay on db1059 is CRITICAL: NRPE: Command check_mysql_slave_delay not defined [22:40:34] PROBLEM - MySQL Slave Running on db1059 is CRITICAL: NRPE: Command check_mysql_slave_running not defined [22:40:41] yurik: is that needed to allow zero access via ssl? [22:41:05] binasher, no, this is to allow us to automate zero testing [22:41:38] we have had tons of zero issues because there are not tests [22:41:52] with this, we can spoof ips, pretending to be carriers [22:41:54] PROBLEM - DPKG on db1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:42:20] debug yourself is always the answer ^^ [22:43:13] Nemo_bis, this is what TimStarling suggested [22:43:23] besides, we always test everything in production ;) [22:43:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62099 [22:44:23] yurik asked me how we could set up a labs instance with thousands of IP addresses for testing Zero [22:44:26] New patchset: Pyoungmeister; "adding db74 to pmtpa s5 until" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62104 [22:44:40] would have been fun too :) [22:44:41] I said don't do that, just fake the IP [22:45:17] i was wondering about that, but if you fake it you dont see the reply [22:45:18] i really don't think we want to modify client.ip every time a request comes in via ssl [22:45:47] mutante: fake it using a special HTTP header [22:45:52] not by spoofing it or something [22:46:00] ah [22:46:35] that lets you test all of the WMF-specific code [22:46:50] it just doesn't cover the linux kernel and networking infrastructure and what not [22:47:54] RECOVERY - DPKG on db1020 is OK: All packages OK [22:48:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [22:48:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [22:48:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62104 [22:49:54] PROBLEM - Host db1020 is DOWN: PING CRITICAL - Packet loss = 100% [22:50:44] RECOVERY - search indices - check lucene status page on search1016 is OK: HTTP OK: HTTP/1.1 200 OK - 53055 bytes in 0.063 second response time [22:51:37] ori-l: so yeah hmm thanks for the offer to investigate the jenkins issue :d [22:51:52] the patch didn't work? [22:51:54] ori-l: I got a good workaround, seems some thread is stalled so I will just restart it again :-D [22:52:04] the patch is a workaround to avoid hitting a very slow query in jenkins [22:52:12] that triggers a reparse of all the build history [22:52:24] now it stills takes it 8 seconds to update a job description [22:52:44] I hate java :-D [22:52:46] still meaning it took 8 seconds before, or it's much faster but still not fast enough? [22:52:57] oh sorry [22:52:58] hmm [22:53:14] so the previous URL took 2 to 5 minutes depending on the job history lenght [22:53:34] now it takes 8 seconds to update the build description, something which is done several times per build [22:53:43] I think it is a thread which is wild [22:55:19] at least I learned few commands today: jstack to dump a strack trace of each threads run by a java process [22:55:23] and H in top to show threads :-D [22:58:26] !log repooling search1016 [22:58:34] Logged the message, Master [22:58:45] !log restarting jenkins… it got a few threads blocked and the main process is at 100% usage for now reason [23:05:22] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:32] PROBLEM - mysqld processes on db74 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:07:54] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62000 [23:10:42] !log aaron synchronized wmf-config/jobqueue-eqiad.php [23:10:50] Logged the message, Master [23:13:28] New patchset: Dzahn; "add favicons for doc.wm and integration.wm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62108 [23:17:16] New review: Hashar; "the integration websites are maintained outside of puppet in integration/docroot.git :-)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/62108 [23:17:44] New patchset: Ryan Lane; "Remove duplicate definition issue with labs ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62109 [23:17:51] !log yeah after hours and hours of fighting, Jenkins is finally working again. [23:17:59] Logged the message, Master [23:18:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62109 [23:19:54] !log restarting lucene on search1021, search1022, search1017, search1018 (with some waiting in between) [23:20:02] Logged the message, Master [23:20:24] hashar: thanks!! jenkins fix..wee [23:22:34] New patchset: Ryan Lane; "Ganglia: Only define a 443 vhost if a cert is set" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62111 [23:23:27] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62111 [23:25:19] mutante: jenkins still have to flush its queue :( [23:32:33] New patchset: Ryan Lane; "Add custom init script for multiple aggregators for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62113 [23:33:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62010 [23:33:33] New review: Faidon; "Do we have to have a single .pep8 for the whole repository? " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61999 [23:34:08] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62113 [23:34:12] PROBLEM - search indices - check lucene status page on search1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60120 bytes in 0.024 second response time [23:34:25] back [23:34:36] welcome back [23:34:37] LeslieCarr: I chatted with grnet network folks, they've notified GEANT already [23:39:02] New review: Dzahn; "i like it, but also see bug 48020" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/61244 [23:39:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61244 [23:41:08] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [23:41:30] Thehelpfulone: wanna try search on wikimania2014? [23:41:52] arg, already got an error [23:46:19] TimStarling, could you +1 https://gerrit.wikimedia.org/r/#/c/62103/ pls so that your comments are not lost in IRC [23:46:38] or anyone could just +2 it :) [23:47:28] PROBLEM - SSH on gadolinium is CRITICAL: Server answer: [23:47:38] PROBLEM - SSH on caesium is CRITICAL: Server answer: [23:48:28] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:50:28] LeslieCarr: oh it seems to be okay now [23:50:38] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:55:27] New review: Dzahn; "please fix path conflict" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61997 [23:56:53] New review: Tim Starling; "I think this is a sensible way to test source IP dependent code, but maybe I'm biased, since I sugge..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/62103