[00:07:35] (03CR) 10Chad: [C: 032] Cirrus: Remove commented officewiki, cawiki to primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87157 (owner: 10Chad) [00:07:44] (03Merged) 10jenkins-bot: Cirrus: Remove commented officewiki, cawiki to primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87157 (owner: 10Chad) [00:08:17] (03PS1) 10Chad: Revert "Cirrus: Remove commented officewiki, cawiki to primary" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87295 [00:08:26] (03CR) 10Chad: [C: 032 V: 032] Revert "Cirrus: Remove commented officewiki, cawiki to primary" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87295 (owner: 10Chad) [00:11:16] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:37] !log rebuilding search search indecies on english wikis after CirrusSearch deploy that updated English configuration [00:11:52] Logged the message, Master [00:34:46] !log all english indecies rebuilt except enwikisource. that one will take a while. [00:34:56] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 00:34:53 UTC 2013 [00:35:00] Logged the message, Master [00:35:16] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:03:46] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 01:03:42 UTC 2013 [01:04:16] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:33:57] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 01:33:53 UTC 2013 [01:34:07] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:39:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:39:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:40:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:41:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [01:42:34] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:42:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:46:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:25] RECOVERY - Disk space on mw1125 is OK: DISK OK [01:47:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:48:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:50:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [01:52:34] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:25] RECOVERY - Disk space on mw1125 is OK: DISK OK [01:53:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:54:14] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:56:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:57:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:58:15] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:59:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:59:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:00:34] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:00:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:01:25] RECOVERY - Disk space on mw1125 is OK: DISK OK [02:01:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [02:04:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [02:05:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:07:34] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:10:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:10:44] RECOVERY - DPKG on mw1125 is OK: All packages OK [02:11:06] !log on mw1125: server very slow due to disk issues, stopped apache, will shut down [02:11:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:11:25] Logged the message, Master [02:12:44] PROBLEM - Apache HTTP on mw1125 is CRITICAL: Connection refused [02:13:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:14:15] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:15:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [02:16:25] RECOVERY - Disk space on mw1125 is OK: DISK OK [02:16:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [02:20:48] !log LocalisationUpdate completed (1.22wmf19) at Thu Oct 3 02:20:48 UTC 2013 [02:21:01] Logged the message, Master [02:23:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [02:30:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [02:34:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:35:56] !log LocalisationUpdate completed (1.22wmf18) at Thu Oct 3 02:35:56 UTC 2013 [02:36:08] PROBLEM - Host mw1125 is DOWN: PING CRITICAL - Packet loss = 100% [02:36:09] Logged the message, Master [02:59:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Oct 3 02:59:41 UTC 2013 [02:59:57] Logged the message, Master [03:06:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [03:20:23] <^d> TimStarling: Have you seen an error like http://paste.tstarling.com/p/dchlft.html before? Google seems to suggest it might happen if session.save_path isn't writeable, but it should most certainly be. It's happening to me at the end of running the full MW phpunit suite, no other times that I see. [03:20:32] <^d> (This is happening on zend, not hhvm) [03:24:49] ^d: I haven't seen it before [03:25:19] you could check the source that generates those error messages [03:25:58] <^d> I'll do that. [03:27:32] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [03:30:42] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:02] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 03:34:54 UTC 2013 [03:35:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:06:40] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:34:30] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 04:34:23 UTC 2013 [04:34:40] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:57:14] !log csteipp Started syncing Wikimedia installation... : [04:57:32] Logged the message, Master [05:06:19] PROBLEM - search indices - check lucene status page on search20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.121 second response time [05:07:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:12:13] PROBLEM - search indices - check lucene status page on search19 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.112 second response time [05:34:13] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 05:34:03 UTC 2013 [05:34:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:42] !log on searchidx2: switched java procs to "idle" ionice class, to improve scap time [05:36:59] Logged the message, Master [05:40:16] !log olivneh synchronized wmf-config/Bug54847.php [05:40:26] Logged the message, Master [05:41:48] !log csteipp Finished syncing Wikimedia installation... : [05:41:57] Logged the message, Master [05:47:36] !log olivneh synchronized wmf-config/InitialiseSettings.php 'wmgBug54847: default => true, private => false' [05:47:49] Logged the message, Master [05:49:07] !log olivneh synchronized wmf-config/CommonSettings.php 'if ( $wmgBug54847 && $wmgUseCentralAuth )' [05:49:19] Logged the message, Master [05:51:56] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [06:06:33] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:14:06] !log olivneh synchronized wmf-config/Bug54847.php [06:14:17] Logged the message, Master [06:26:42] springle: is there someone else who should look at https://gerrit.wikimedia.org/r/#/c/87168/2 ? [06:29:55] Aaron|home: don't know? should I be +2'ing mediawiki core stuff? [06:30:19] you did with the last one :) [06:30:36] lol did I [06:30:39] ok [06:31:37] springle: did you get a chance to look at that UNION query? [06:32:28] Aaron|home: not yet. will respond shortly [06:33:01] no rush [06:35:43] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 06:35:40 UTC 2013 [06:36:33] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:41:06] (03PS10) 10Ori.livneh: Hooks to force password reset for some users [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87285 (owner: 10CSteipp) [06:41:07] (03PS1) 10Ori.livneh: Enable $wmgBug54847 for default & private, but gate also on $wmgUseCentralAuth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87318 [06:41:08] (03PS1) 10Ori.livneh: Do not show password reset interface unless password is correct [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87319 [06:41:16] ^ TimStarling [06:41:20] FYI [06:41:33] all of these are already on tin & synced, so I just self-merge them, yeah? [06:42:08] (03CR) 10Tim Starling: [C: 031] Do not show password reset interface unless password is correct [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87319 (owner: 10Ori.livneh) [06:42:33] i'll do that from home [06:47:12] ori-l: yes [06:47:25] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: No successful Puppet run in the last 10 hours [06:47:25] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: No successful Puppet run in the last 10 hours [06:54:25] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: No successful Puppet run in the last 10 hours [06:54:25] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: No successful Puppet run in the last 10 hours [06:59:25] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [06:59:25] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:00:25] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: No successful Puppet run in the last 10 hours [07:03:25] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: No successful Puppet run in the last 10 hours [07:05:25] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: No successful Puppet run in the last 10 hours [07:06:25] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:10:13] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:10:23] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: No successful Puppet run in the last 10 hours [07:34:33] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 07:34:30 UTC 2013 [07:35:13] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:02:06] !log Bug 54847: e-mail script completed successfully [08:02:23] Logged the message, Master [08:09:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:34:52] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 08:34:50 UTC 2013 [08:35:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:07:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:10:31] mark ping [10:12:56] mark, i didn't have a good connection yesterday until late, let me know if you want to try ESI in the next few hrs [10:14:08] (03PS1) 10Mark Bergsma: Add ulsfo IPv6 transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/87326 [10:14:47] (03CR) 10Mark Bergsma: [C: 032] Add ulsfo IPv6 transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/87326 (owner: 10Mark Bergsma) [10:15:01] yeah we can do that [10:15:25] in about 30 mins I think [10:16:31] mark, ok, ping me here [10:21:41] PROBLEM - MySQL Idle Transactions on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:31] RECOVERY - MySQL Idle Transactions on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [10:26:56] (03PS2) 10Mark Bergsma: Enable ESI processing for the Testing carrier range [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 [10:29:34] (03PS3) 10Mark Bergsma: Enable ESI processing for the Testing carrier range [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 [10:30:20] yurik_: is PS3 to your liking? [10:30:35] mark, link? [10:30:40] we can put the FORCE-ESI header in zero, but there's no if clause for -TEST there now, and I wasn't sure how it would intermingle with your script [10:30:43] we can also add it later [10:30:54] 2 lines above :P [10:30:56] checking... [10:34:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 10:33:56 UTC 2013 [10:34:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:35:20] mark, yes, looks ok [10:35:23] go ahead [10:35:41] (03CR) 10Mark Bergsma: [C: 032] Enable ESI processing for the Testing carrier range [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 (owner: 10Mark Bergsma) [10:39:35] mark, any way to force puppet run? [10:39:41] or should i check in a half an hour? [10:39:50] i'm waiting on puppet as we speak [10:40:06] if it works we should change make more carriers work with it [10:42:34] it ran on cp1046 [10:42:42] of course now we need requests from the office ;) [10:43:45] mark, sec [10:47:21] (03PS1) 10Mark Bergsma: Disable X-FORCE-ESI for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/87327 [10:48:13] (03CR) 10Mark Bergsma: [C: 032] Disable X-FORCE-ESI for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/87327 (owner: 10Mark Bergsma) [10:48:53] i've disabled it again [10:49:07] I realised that we have a problem if ESI is not enabled on all frontends yet [10:57:07] mark, not sure why we would have a problem? [10:57:30] especially if we only do it for TEST [10:57:31] one frontend sets X-FORCE-ESI, requests through a backend [10:57:49] another frontend without ESI gets it and passes it on unprocessed [10:59:28] ok, but that's fine as long as we use it for testing - I will see it as either tag or a proper banner, and in both cases i will at least know that the php gave the right result [10:59:42] and in the mean time you can get it working on all frontends? [11:00:03] if something has ESI without a proper Vary on X-CS, that's bad [11:00:23] why wouldn't it have vary on X-CS? [11:00:32] i haven't removed that part yet [11:00:34] perhaps you have a bug somewhere? [11:00:57] well, its always possible of course, but in the worst case we simply won't show a banner [11:01:13] has anyone reported seeing anything like that? [11:01:20] no [11:01:22] i mean - lack of vary on X-CS? [11:01:24] but waiting half an hour prevents it [11:01:34] oh, got it [11:01:37] i'm in no rush with this [11:01:45] i thought you wanted to postpone entirely [11:01:57] i propose we reenable it within an hour [11:02:02] then we can test it today in the office [11:02:09] if all is well, we can open it up some tomorrow or so [11:02:16] btw, if we have no vary on X-CS, we have a much bigger problem - showing banners from one carrier to another carrier's users [11:02:24] yes [11:02:48] that's why the problem with the mistagging happened wasn't it [11:03:02] some stuff got cached missing a vary header [11:03:12] weird... that really ought to be fixed [11:03:29] although the best solution would be to get rid of vary on X-CS entirely [11:03:37] which is what we've been working on [11:03:52] sure [11:07:48] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [11:10:58] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:01] puppet is terribly slow [11:17:33] isn't it always... [11:20:42] (03PS1) 10Mark Bergsma: Reenable ESI now all frontends are prepared for it [operations/puppet] - 10https://gerrit.wikimedia.org/r/87328 [11:21:06] (03CR) 10Mark Bergsma: [C: 032] Reenable ESI now all frontends are prepared for it [operations/puppet] - 10https://gerrit.wikimedia.org/r/87328 (owner: 10Mark Bergsma) [11:30:04] bblack: around? [11:41:55] mark or paravoid , I think there is a bug in VCL code in zero.inc file -- in case I am browsing from the office, but don't have spoofed XFF hdr, it does not identify me as a carrier [11:42:13] do you know if the XFF is always present? [11:42:18] looking at line 24 [11:45:13] also, mark, I don't have any way right now to see if ESI actually worked or not - the banner is there, but no internal headers were kept [11:56:11] Request: GET http://en.wikipedia.org/wiki/Special:Random, from 93.187.161.106 via cp3012 frontend ([91.198.174.236]:80), Varnish XID 788340603 [11:56:12] Forwarded for: 67.243.51.60, 93.187.161.106 [11:56:15] Error: 503, Service Unavailable at Thu, 03 Oct 2013 11:53:03 GMT [11:58:14] mark, something weird is going on [11:58:42] the moment i spoof XFF [11:59:44] yep, it seems ESI has been deployed and its a complete failure :( [12:10:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:14:07] !log enwikisource's search index finished building late last night. all english indecies have been updated to include kstem. [12:14:17] Logged the message, Master [12:19:18] yurik_: so only for the test range [12:19:20] not affecting others, right? [12:22:31] mark correct [12:22:42] so no real need to rollback I guess [12:23:40] mark: current results - total site failure when setting XFF from test range [12:23:44] right [12:24:08] i either get 503 message, or "no data received" browser page in chrome [12:24:35] i need to go, so I will just roll it back anyway [12:24:46] no no [12:24:49] well, [12:25:10] i am hoping someone with root will be able to debug it [12:25:20] without waiting for an hour for the puppet run [12:25:31] let it sit there for a bit, maybe its some caching issue [12:25:38] as its not affecting anyone [12:25:39] yeah right [12:26:37] mark, current status: spoofing XFF or X-CS from a test range works fine unless the XFF is set to the IP of the test range [12:28:02] (03PS1) 10Mark Bergsma: Revert "Reenable ESI now all frontends are prepared for it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87331 [12:28:30] (03CR) 10Mark Bergsma: [C: 032 V: 032] Revert "Reenable ESI now all frontends are prepared for it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87331 (owner: 10Mark Bergsma) [12:30:17] i didn't see an Enable-ESI header in a quick test against mediawiki [12:30:53] what does mediawiki check to use esi? [12:30:57] just that header or other things as well? [12:31:08] x-force-esi [12:31:18] yeah [12:31:19] that's the header that triggers it to output tag [12:31:19] just that? [12:31:22] yep [12:31:34] i mean - there is a global setting also, but its false atm [12:32:08] btw, i'm not sure how you tested, i will send an email describing everything [12:32:15] ok [12:32:20] then I'll check later [12:32:26] i need to go, bbl [12:34:32] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 12:34:28 UTC 2013 [12:34:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:37:42] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: Timeout while attempting connection [12:38:32] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [12:49:53] (03PS1) 10Matanya: Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [12:50:19] (03CR) 10jenkins-bot: [V: 04-1] Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [12:52:05] (03PS2) 10Matanya: Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [12:52:28] (03CR) 10jenkins-bot: [V: 04-1] Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [12:52:58] moorooronging, mark, you around? if so, would appreciate a look at that puppet error [12:53:03] i haven't looked at it in a day or two [12:59:43] (03PS3) 10Matanya: Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [13:07:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:10:54] paravoid: mind continue reviewing my change, I learned a lot from yesterday, and really appricate your comments [13:24:01] (03CR) 10Faidon Liambotis: [C: 04-1] "(7 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [13:27:16] (03PS4) 10Matanya: Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [13:29:53] you didn't fix the commit message :) [13:30:03] also puppet:///modules/drac/files/drac/drac.py' [13:30:10] there's an extra "drac/" there [13:30:23] (the file is "modules/drac/files/drac.py") [13:32:50] paravoid: what is wrong with the commit message? [13:33:02] I left a review [13:33:19] thanks for comments :) [13:33:29] no trailing dot and no initial capitalization; the "convert" capitalization is not wrong per se, but I wouldn't do that either [13:33:51] miredo was similar but I fixed it myself, it was too trivial to comment about it [13:33:54] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 13:33:52 UTC 2013 [13:34:13] but since you're proceeding with more changes I thought to tell you this time so that you'll get it right for the next ones too :) [13:34:31] (03PS1) 10Cmjohnson: removing mw1125 from dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/87333 [13:34:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:35:48] (03PS5) 10Matanya: drac: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [13:36:33] paravoid: I just hope my learning curve isn't too sharp :) [13:40:30] (03CR) 10Cmjohnson: [C: 032] removing mw1125 from dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/87333 (owner: 10Cmjohnson) [13:45:26] (03CR) 10Hashar: [C: 031] "Change is good but I have no idea what are the impacts of changing the timezone :-/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86379 (owner: 10Wizardist) [13:45:39] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [13:47:21] !log c state disabled on barium, ref. RT 5555 [13:47:36] Logged the message, Master [13:50:19] RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [13:57:16] (03CR) 10Faidon Liambotis: [C: 031] "But let's wait for Ryan/RobH." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [14:06:53] manybubbles: oh definitely not because some blog said it [14:06:58] (03CR) 10Jgreen: "Please just remove all jenkins-related puppet config from fundraising.pp. We're migrating off of aluminium/grosley onto hosts in frack, an" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [14:07:23] paravoid: a lot of blogs say it, but they don't give good reasoning. the ES docs mention it as something you can do. [14:07:40] but mostly talk about the cpu benefits [14:07:40] I'm mentioning it so we can evaluate it and understand it, as you pointed out [14:07:45] yeah [14:07:57] I really wish I had more information. [14:08:18] I think I'd prefer to do this in two phases if possible [14:08:31] so we can do some more juggling as we learn more [14:08:44] that makes sense to me [14:09:27] I'm writing another update about disk saturation. I was able to do it last night. [14:11:05] in a semi-unrelated question, have you see the health monitoring api at all? [14:11:21] so, ^d submitted some patches that were subsequently merged [14:11:25] to put ES behind LVS [14:11:43] we have a monitor for automatically depooling bad servers in LVS, it's called "pybal" [14:12:05] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:12:11] that has multiple different modules for checking health, the one we currently use is fetch http://localhost:9200/ and expect a 200 [14:12:30] I'm worried there might be situations where a node may be unhealthy but ES would respond fine on / [14:12:48] and if the node is also a data node, this increases the chances, doesn't it? [14:14:15] ottomata: puppet error where? [14:14:32] on lvs4001, its the same one from before, even after we thought we fixed it [14:14:33] paravoid: I wouldn't be surprised if there were some situations, no. let me add it to my list of things to research [14:14:38] the pybal.conf :undef one [14:14:40] ok [14:14:52] paravoid: the health monitoring that we have is monstly nagios level stuff [14:15:14] manybubbles: I remember reading about an ES health API, maybe we can use that in pybal [14:15:17] and we don't want to depool a machine when it reports red health - that'd depool everything at the same time [14:15:26] paravoid: the health api is a cluster health api, mostly [14:15:32] I was fearing that... [14:15:38] paravoid: but there might be some gem in there about node health [14:18:45] (03PS2) 10Matanya: fundrising: remove jenkins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 [14:20:03] s/rising/raising/ [14:20:50] (03PS1) 10Mark Bergsma: Reverse order of class / sites check [operations/puppet] - 10https://gerrit.wikimedia.org/r/87337 [14:20:54] (03PS3) 10Matanya: fundraising: remove jenkins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 [14:21:02] Jeff_Green: ^ [14:21:06] looking [14:21:24] (03CR) 10Mark Bergsma: [C: 032] Reverse order of class / sites check [operations/puppet] - 10https://gerrit.wikimedia.org/r/87337 (owner: 10Mark Bergsma) [14:22:30] (03CR) 10Jgreen: [C: 031] fundraising: remove jenkins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [14:22:49] (03CR) 10Jgreen: [V: 031] fundraising: remove jenkins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [14:23:09] matanya: +1's applied [14:23:41] thanks Jeff_Green. maybe it was better if I set some stuff to absent before, but meh [14:24:08] no, this is far better [14:24:27] paravoid: found a webinar I should have watched before I started this: http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist [14:24:32] I don't want jenkins ripped off of aluminium by any means! [14:24:51] ok, then [14:25:04] ottomata: fixed [14:26:16] heh [14:28:30] (03PS1) 10Jgreen: adjust SA score for DEAR_SOMETHING test to 1.500 [operations/puppet] - 10https://gerrit.wikimedia.org/r/87339 [14:29:25] (03CR) 10Jgreen: [C: 032 V: 031] adjust SA score for DEAR_SOMETHING test to 1.500 [operations/puppet] - 10https://gerrit.wikimedia.org/r/87339 (owner: 10Jgreen) [14:33:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 14:33:53 UTC 2013 [14:34:05] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:50] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:30] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [14:53:35] (03CR) 10Akosiaris: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [14:54:57] akosiaris1: hiiii [14:55:02] what's up with the snappy stuff? [14:55:05] libsnappyjava? [14:56:09] ottomata: not much. It's on my TODO list. I can probably context switch tomorrow if you want it soon. [14:56:25] i think we'll want it fairly soon, within the next week or two [14:56:33] cool. Will do then [14:56:35] the varnishkafka and test broker setup is looking good [14:56:40] (03PS6) 10Matanya: drac: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [14:56:51] (03PS1) 10BBlack: assert(ip) is stupid, fails on 0.0.0.0 [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/87342 [14:56:58] would like to start using official packages and puppetization soon [14:57:04] akosiaris1: ^ [14:57:05] in order to soon actually install this on mobile hosts [14:57:34] matanya: cool thanks!!! [14:57:44] ottomata: official packages ? [14:57:48] :) [14:57:53] uhhh [14:57:57] 'official' meaning from apt :p [14:57:58] please tell me you mean our kafka package :-D [14:58:01] our apt [14:58:11] rather than me dpkg -i ing stuff [14:58:36] ok ok ... I finally have in my mind what needs to be done so I will have a look tomorrow [14:58:59] k danke [15:00:33] (03CR) 10BBlack: [C: 032] assert(ip) is stupid, fails on 0.0.0.0 [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/87342 (owner: 10BBlack) [15:00:41] (03CR) 10BBlack: [V: 032] assert(ip) is stupid, fails on 0.0.0.0 [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/87342 (owner: 10BBlack) [15:01:26] ottomata: I exchanged a bunch of mails with the jq maintainer [15:01:37] he asked for sponsorship, I did a thorough review on his package [15:01:42] I'm about to upload 1.3-1 to Debian [15:01:47] awesome! [15:01:50] thanks [15:01:53] jq ? [15:02:08] http://stedolan.github.io/jq/ [15:02:49] paravoid: did I hear something from magnus about you playing with varnishkafka bson? :p [15:02:58] yes [15:03:08] that was a while ago [15:11:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:20:43] (03PS1) 10BBlack: bump netmapper patch to netmapper:b62b6c7a for assert(ip) fix [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/87347 [15:20:44] (03PS1) 10BBlack: varnish (3.0.3plus~rc1-wm17) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/87348 [15:21:08] (03CR) 10BBlack: [C: 032 V: 032] bump netmapper patch to netmapper:b62b6c7a for assert(ip) fix [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/87347 (owner: 10BBlack) [15:21:27] (03CR) 10BBlack: [C: 032 V: 032] varnish (3.0.3plus~rc1-wm17) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/87348 (owner: 10BBlack) [15:34:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 15:34:44 UTC 2013 [15:35:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:38:53] (03CR) 10Andrew Bogott: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86762 (owner: 10Ryan Lane) [15:52:53] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [15:53:49] ROFL [15:53:52] Invalid language code "//bits.wikimedia.org/static-1.22wmf18/extensions/TimedMediaHandler/MwEmbedModules/EmbedPlayer" [16:05:04] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 16:05:01 UTC 2013 [16:05:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:14:33] (03PS1) 10Reedy: Add symlink stuff [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87350 [16:14:34] !log updated varnish to -wm17 on active mobile caches (cp1046, cp1047, cp1059, cp1060, cp3011, cp3012), fixes 0.0.0.0 assertfail in netmapper [16:14:45]