[00:07:35] (03CR) 10Chad: [C: 032] Cirrus: Remove commented officewiki, cawiki to primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87157 (owner: 10Chad) [00:07:44] (03Merged) 10jenkins-bot: Cirrus: Remove commented officewiki, cawiki to primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87157 (owner: 10Chad) [00:08:17] (03PS1) 10Chad: Revert "Cirrus: Remove commented officewiki, cawiki to primary" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87295 [00:08:26] (03CR) 10Chad: [C: 032 V: 032] Revert "Cirrus: Remove commented officewiki, cawiki to primary" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87295 (owner: 10Chad) [00:11:16] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:37] !log rebuilding search search indecies on english wikis after CirrusSearch deploy that updated English configuration [00:11:52] Logged the message, Master [00:34:46] !log all english indecies rebuilt except enwikisource. that one will take a while. [00:34:56] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 00:34:53 UTC 2013 [00:35:00] Logged the message, Master [00:35:16] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:03:46] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 01:03:42 UTC 2013 [01:04:16] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:33:57] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 01:33:53 UTC 2013 [01:34:07] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:39:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:39:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:40:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:41:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [01:42:34] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:42:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:46:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:25] RECOVERY - Disk space on mw1125 is OK: DISK OK [01:47:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:48:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:50:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [01:52:34] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:25] RECOVERY - Disk space on mw1125 is OK: DISK OK [01:53:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:54:14] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:56:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:57:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [01:58:15] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:59:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [01:59:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:00:34] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:00:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:01:25] RECOVERY - Disk space on mw1125 is OK: DISK OK [02:01:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [02:04:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [02:05:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:07:34] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:10:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:10:44] RECOVERY - DPKG on mw1125 is OK: All packages OK [02:11:06] !log on mw1125: server very slow due to disk issues, stopped apache, will shut down [02:11:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:11:25] Logged the message, Master [02:12:44] PROBLEM - Apache HTTP on mw1125 is CRITICAL: Connection refused [02:13:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:14:15] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:15:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [02:16:25] RECOVERY - Disk space on mw1125 is OK: DISK OK [02:16:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [02:20:48] !log LocalisationUpdate completed (1.22wmf19) at Thu Oct 3 02:20:48 UTC 2013 [02:21:01] Logged the message, Master [02:23:44] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:34] RECOVERY - DPKG on mw1125 is OK: All packages OK [02:30:14] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:24] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:33:04] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [02:34:14] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:35:56] !log LocalisationUpdate completed (1.22wmf18) at Thu Oct 3 02:35:56 UTC 2013 [02:36:08] PROBLEM - Host mw1125 is DOWN: PING CRITICAL - Packet loss = 100% [02:36:09] Logged the message, Master [02:59:41] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Oct 3 02:59:41 UTC 2013 [02:59:57] Logged the message, Master [03:06:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [03:20:23] <^d> TimStarling: Have you seen an error like http://paste.tstarling.com/p/dchlft.html before? Google seems to suggest it might happen if session.save_path isn't writeable, but it should most certainly be. It's happening to me at the end of running the full MW phpunit suite, no other times that I see. [03:20:32] <^d> (This is happening on zend, not hhvm) [03:24:49] ^d: I haven't seen it before [03:25:19] you could check the source that generates those error messages [03:25:58] <^d> I'll do that. [03:27:32] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [03:30:42] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:02] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 03:34:54 UTC 2013 [03:35:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:06:40] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:34:30] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 04:34:23 UTC 2013 [04:34:40] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:57:14] !log csteipp Started syncing Wikimedia installation... : [04:57:32] Logged the message, Master [05:06:19] PROBLEM - search indices - check lucene status page on search20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.121 second response time [05:07:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:12:13] PROBLEM - search indices - check lucene status page on search19 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.112 second response time [05:34:13] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 05:34:03 UTC 2013 [05:34:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:42] !log on searchidx2: switched java procs to "idle" ionice class, to improve scap time [05:36:59] Logged the message, Master [05:40:16] !log olivneh synchronized wmf-config/Bug54847.php [05:40:26] Logged the message, Master [05:41:48] !log csteipp Finished syncing Wikimedia installation... : [05:41:57] Logged the message, Master [05:47:36] !log olivneh synchronized wmf-config/InitialiseSettings.php 'wmgBug54847: default => true, private => false' [05:47:49] Logged the message, Master [05:49:07] !log olivneh synchronized wmf-config/CommonSettings.php 'if ( $wmgBug54847 && $wmgUseCentralAuth )' [05:49:19] Logged the message, Master [05:51:56] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [06:06:33] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:14:06] !log olivneh synchronized wmf-config/Bug54847.php [06:14:17] Logged the message, Master [06:26:42] springle: is there someone else who should look at https://gerrit.wikimedia.org/r/#/c/87168/2 ? [06:29:55] Aaron|home: don't know? should I be +2'ing mediawiki core stuff? [06:30:19] you did with the last one :) [06:30:36] lol did I [06:30:39] ok [06:31:37] springle: did you get a chance to look at that UNION query? [06:32:28] Aaron|home: not yet. will respond shortly [06:33:01] no rush [06:35:43] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 06:35:40 UTC 2013 [06:36:33] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:41:06] (03PS10) 10Ori.livneh: Hooks to force password reset for some users [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87285 (owner: 10CSteipp) [06:41:07] (03PS1) 10Ori.livneh: Enable $wmgBug54847 for default & private, but gate also on $wmgUseCentralAuth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87318 [06:41:08] (03PS1) 10Ori.livneh: Do not show password reset interface unless password is correct [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87319 [06:41:16] ^ TimStarling [06:41:20] FYI [06:41:33] all of these are already on tin & synced, so I just self-merge them, yeah? [06:42:08] (03CR) 10Tim Starling: [C: 031] Do not show password reset interface unless password is correct [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87319 (owner: 10Ori.livneh) [06:42:33] i'll do that from home [06:47:12] ori-l: yes [06:47:25] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: No successful Puppet run in the last 10 hours [06:47:25] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: No successful Puppet run in the last 10 hours [06:54:25] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: No successful Puppet run in the last 10 hours [06:54:25] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: No successful Puppet run in the last 10 hours [06:59:25] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [06:59:25] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:00:25] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: No successful Puppet run in the last 10 hours [07:03:25] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: No successful Puppet run in the last 10 hours [07:05:25] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: No successful Puppet run in the last 10 hours [07:06:25] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: No successful Puppet run in the last 10 hours [07:10:13] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:10:23] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: No successful Puppet run in the last 10 hours [07:34:33] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 07:34:30 UTC 2013 [07:35:13] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:02:06] !log Bug 54847: e-mail script completed successfully [08:02:23] Logged the message, Master [08:09:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:34:52] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 08:34:50 UTC 2013 [08:35:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:07:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:10:31] mark ping [10:12:56] mark, i didn't have a good connection yesterday until late, let me know if you want to try ESI in the next few hrs [10:14:08] (03PS1) 10Mark Bergsma: Add ulsfo IPv6 transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/87326 [10:14:47] (03CR) 10Mark Bergsma: [C: 032] Add ulsfo IPv6 transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/87326 (owner: 10Mark Bergsma) [10:15:01] yeah we can do that [10:15:25] in about 30 mins I think [10:16:31] mark, ok, ping me here [10:21:41] PROBLEM - MySQL Idle Transactions on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:23:31] RECOVERY - MySQL Idle Transactions on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [10:26:56] (03PS2) 10Mark Bergsma: Enable ESI processing for the Testing carrier range [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 [10:29:34] (03PS3) 10Mark Bergsma: Enable ESI processing for the Testing carrier range [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 [10:30:20] yurik_: is PS3 to your liking? [10:30:35] mark, link? [10:30:40] we can put the FORCE-ESI header in zero, but there's no if clause for -TEST there now, and I wasn't sure how it would intermingle with your script [10:30:43] we can also add it later [10:30:54] 2 lines above :P [10:30:56] checking... [10:34:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 10:33:56 UTC 2013 [10:34:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:35:20] mark, yes, looks ok [10:35:23] go ahead [10:35:41] (03CR) 10Mark Bergsma: [C: 032] Enable ESI processing for the Testing carrier range [operations/puppet] - 10https://gerrit.wikimedia.org/r/86258 (owner: 10Mark Bergsma) [10:39:35] mark, any way to force puppet run? [10:39:41] or should i check in a half an hour? [10:39:50] i'm waiting on puppet as we speak [10:40:06] if it works we should change make more carriers work with it [10:42:34] it ran on cp1046 [10:42:42] of course now we need requests from the office ;) [10:43:45] mark, sec [10:47:21] (03PS1) 10Mark Bergsma: Disable X-FORCE-ESI for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/87327 [10:48:13] (03CR) 10Mark Bergsma: [C: 032] Disable X-FORCE-ESI for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/87327 (owner: 10Mark Bergsma) [10:48:53] i've disabled it again [10:49:07] I realised that we have a problem if ESI is not enabled on all frontends yet [10:57:07] mark, not sure why we would have a problem? [10:57:30] especially if we only do it for TEST [10:57:31] one frontend sets X-FORCE-ESI, requests through a backend [10:57:49] another frontend without ESI gets it and passes it on unprocessed [10:59:28] ok, but that's fine as long as we use it for testing - I will see it as either tag or a proper banner, and in both cases i will at least know that the php gave the right result [10:59:42] and in the mean time you can get it working on all frontends? [11:00:03] if something has ESI without a proper Vary on X-CS, that's bad [11:00:23] why wouldn't it have vary on X-CS? [11:00:32] i haven't removed that part yet [11:00:34] perhaps you have a bug somewhere? [11:00:57] well, its always possible of course, but in the worst case we simply won't show a banner [11:01:13] has anyone reported seeing anything like that? [11:01:20] no [11:01:22] i mean - lack of vary on X-CS? [11:01:24] but waiting half an hour prevents it [11:01:34] oh, got it [11:01:37] i'm in no rush with this [11:01:45] i thought you wanted to postpone entirely [11:01:57] i propose we reenable it within an hour [11:02:02] then we can test it today in the office [11:02:09] if all is well, we can open it up some tomorrow or so [11:02:16] btw, if we have no vary on X-CS, we have a much bigger problem - showing banners from one carrier to another carrier's users [11:02:24] yes [11:02:48] that's why the problem with the mistagging happened wasn't it [11:03:02] some stuff got cached missing a vary header [11:03:12] weird... that really ought to be fixed [11:03:29] although the best solution would be to get rid of vary on X-CS entirely [11:03:37] which is what we've been working on [11:03:52] sure [11:07:48] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [11:10:58] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:01] puppet is terribly slow [11:17:33] isn't it always... [11:20:42] (03PS1) 10Mark Bergsma: Reenable ESI now all frontends are prepared for it [operations/puppet] - 10https://gerrit.wikimedia.org/r/87328 [11:21:06] (03CR) 10Mark Bergsma: [C: 032] Reenable ESI now all frontends are prepared for it [operations/puppet] - 10https://gerrit.wikimedia.org/r/87328 (owner: 10Mark Bergsma) [11:30:04] bblack: around? [11:41:55] mark or paravoid , I think there is a bug in VCL code in zero.inc file -- in case I am browsing from the office, but don't have spoofed XFF hdr, it does not identify me as a carrier [11:42:13] do you know if the XFF is always present? [11:42:18] looking at line 24 [11:45:13] also, mark, I don't have any way right now to see if ESI actually worked or not - the banner is there, but no internal headers were kept [11:56:11] Request: GET http://en.wikipedia.org/wiki/Special:Random, from 93.187.161.106 via cp3012 frontend ([91.198.174.236]:80), Varnish XID 788340603 [11:56:12] Forwarded for: 67.243.51.60, 93.187.161.106 [11:56:15] Error: 503, Service Unavailable at Thu, 03 Oct 2013 11:53:03 GMT [11:58:14] mark, something weird is going on [11:58:42] the moment i spoof XFF [11:59:44] yep, it seems ESI has been deployed and its a complete failure :( [12:10:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:14:07] !log enwikisource's search index finished building late last night. all english indecies have been updated to include kstem. [12:14:17] Logged the message, Master [12:19:18] yurik_: so only for the test range [12:19:20] not affecting others, right? [12:22:31] mark correct [12:22:42] so no real need to rollback I guess [12:23:40] mark: current results - total site failure when setting XFF from test range [12:23:44] right [12:24:08] i either get 503 message, or "no data received" browser page in chrome [12:24:35] i need to go, so I will just roll it back anyway [12:24:46] no no [12:24:49] well, [12:25:10] i am hoping someone with root will be able to debug it [12:25:20] without waiting for an hour for the puppet run [12:25:31] let it sit there for a bit, maybe its some caching issue [12:25:38] as its not affecting anyone [12:25:39] yeah right [12:26:37] mark, current status: spoofing XFF or X-CS from a test range works fine unless the XFF is set to the IP of the test range [12:28:02] (03PS1) 10Mark Bergsma: Revert "Reenable ESI now all frontends are prepared for it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87331 [12:28:30] (03CR) 10Mark Bergsma: [C: 032 V: 032] Revert "Reenable ESI now all frontends are prepared for it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87331 (owner: 10Mark Bergsma) [12:30:17] i didn't see an Enable-ESI header in a quick test against mediawiki [12:30:53] what does mediawiki check to use esi? [12:30:57] just that header or other things as well? [12:31:08] x-force-esi [12:31:18] yeah [12:31:19] that's the header that triggers it to output tag [12:31:19] just that? [12:31:22] yep [12:31:34] i mean - there is a global setting also, but its false atm [12:32:08] btw, i'm not sure how you tested, i will send an email describing everything [12:32:15] ok [12:32:20] then I'll check later [12:32:26] i need to go, bbl [12:34:32] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 12:34:28 UTC 2013 [12:34:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:37:42] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: Timeout while attempting connection [12:38:32] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [12:49:53] (03PS1) 10Matanya: Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [12:50:19] (03CR) 10jenkins-bot: [V: 04-1] Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [12:52:05] (03PS2) 10Matanya: Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [12:52:28] (03CR) 10jenkins-bot: [V: 04-1] Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [12:52:58] moorooronging, mark, you around? if so, would appreciate a look at that puppet error [12:53:03] i haven't looked at it in a day or two [12:59:43] (03PS3) 10Matanya: Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [13:07:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:10:54] paravoid: mind continue reviewing my change, I learned a lot from yesterday, and really appricate your comments [13:24:01] (03CR) 10Faidon Liambotis: [C: 04-1] "(7 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [13:27:16] (03PS4) 10Matanya: Drac: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [13:29:53] you didn't fix the commit message :) [13:30:03] also puppet:///modules/drac/files/drac/drac.py' [13:30:10] there's an extra "drac/" there [13:30:23] (the file is "modules/drac/files/drac.py") [13:32:50] paravoid: what is wrong with the commit message? [13:33:02] I left a review [13:33:19] thanks for comments :) [13:33:29] no trailing dot and no initial capitalization; the "convert" capitalization is not wrong per se, but I wouldn't do that either [13:33:51] miredo was similar but I fixed it myself, it was too trivial to comment about it [13:33:54] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 13:33:52 UTC 2013 [13:34:13] but since you're proceeding with more changes I thought to tell you this time so that you'll get it right for the next ones too :) [13:34:31] (03PS1) 10Cmjohnson: removing mw1125 from dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/87333 [13:34:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:35:48] (03PS5) 10Matanya: drac: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [13:36:33] paravoid: I just hope my learning curve isn't too sharp :) [13:40:30] (03CR) 10Cmjohnson: [C: 032] removing mw1125 from dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/87333 (owner: 10Cmjohnson) [13:45:26] (03CR) 10Hashar: [C: 031] "Change is good but I have no idea what are the impacts of changing the timezone :-/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86379 (owner: 10Wizardist) [13:45:39] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [13:47:21] !log c state disabled on barium, ref. RT 5555 [13:47:36] Logged the message, Master [13:50:19] RECOVERY - Host barium is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [13:57:16] (03CR) 10Faidon Liambotis: [C: 031] "But let's wait for Ryan/RobH." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [14:06:53] manybubbles: oh definitely not because some blog said it [14:06:58] (03CR) 10Jgreen: "Please just remove all jenkins-related puppet config from fundraising.pp. We're migrating off of aluminium/grosley onto hosts in frack, an" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [14:07:23] paravoid: a lot of blogs say it, but they don't give good reasoning. the ES docs mention it as something you can do. [14:07:40] but mostly talk about the cpu benefits [14:07:40] I'm mentioning it so we can evaluate it and understand it, as you pointed out [14:07:45] yeah [14:07:57] I really wish I had more information. [14:08:18] I think I'd prefer to do this in two phases if possible [14:08:31] so we can do some more juggling as we learn more [14:08:44] that makes sense to me [14:09:27] I'm writing another update about disk saturation. I was able to do it last night. [14:11:05] in a semi-unrelated question, have you see the health monitoring api at all? [14:11:21] so, ^d submitted some patches that were subsequently merged [14:11:25] to put ES behind LVS [14:11:43] we have a monitor for automatically depooling bad servers in LVS, it's called "pybal" [14:12:05] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:12:11] that has multiple different modules for checking health, the one we currently use is fetch http://localhost:9200/ and expect a 200 [14:12:30] I'm worried there might be situations where a node may be unhealthy but ES would respond fine on / [14:12:48] and if the node is also a data node, this increases the chances, doesn't it? [14:14:15] ottomata: puppet error where? [14:14:32] on lvs4001, its the same one from before, even after we thought we fixed it [14:14:33] paravoid: I wouldn't be surprised if there were some situations, no. let me add it to my list of things to research [14:14:38] the pybal.conf :undef one [14:14:40] ok [14:14:52] paravoid: the health monitoring that we have is monstly nagios level stuff [14:15:14] manybubbles: I remember reading about an ES health API, maybe we can use that in pybal [14:15:17] and we don't want to depool a machine when it reports red health - that'd depool everything at the same time [14:15:26] paravoid: the health api is a cluster health api, mostly [14:15:32] I was fearing that... [14:15:38] paravoid: but there might be some gem in there about node health [14:18:45] (03PS2) 10Matanya: fundrising: remove jenkins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 [14:20:03] s/rising/raising/ [14:20:50] (03PS1) 10Mark Bergsma: Reverse order of class / sites check [operations/puppet] - 10https://gerrit.wikimedia.org/r/87337 [14:20:54] (03PS3) 10Matanya: fundraising: remove jenkins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 [14:21:02] Jeff_Green: ^ [14:21:06] looking [14:21:24] (03CR) 10Mark Bergsma: [C: 032] Reverse order of class / sites check [operations/puppet] - 10https://gerrit.wikimedia.org/r/87337 (owner: 10Mark Bergsma) [14:22:30] (03CR) 10Jgreen: [C: 031] fundraising: remove jenkins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [14:22:49] (03CR) 10Jgreen: [V: 031] fundraising: remove jenkins. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86818 (owner: 10Matanya) [14:23:09] matanya: +1's applied [14:23:41] thanks Jeff_Green. maybe it was better if I set some stuff to absent before, but meh [14:24:08] no, this is far better [14:24:27] paravoid: found a webinar I should have watched before I started this: http://www.elasticsearch.org/webinars/elasticsearch-pre-flight-checklist [14:24:32] I don't want jenkins ripped off of aluminium by any means! [14:24:51] ok, then [14:25:04] ottomata: fixed [14:26:16] heh [14:28:30] (03PS1) 10Jgreen: adjust SA score for DEAR_SOMETHING test to 1.500 [operations/puppet] - 10https://gerrit.wikimedia.org/r/87339 [14:29:25] (03CR) 10Jgreen: [C: 032 V: 031] adjust SA score for DEAR_SOMETHING test to 1.500 [operations/puppet] - 10https://gerrit.wikimedia.org/r/87339 (owner: 10Jgreen) [14:33:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 14:33:53 UTC 2013 [14:34:05] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:50] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:30] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [14:53:35] (03CR) 10Akosiaris: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [14:54:57] akosiaris1: hiiii [14:55:02] what's up with the snappy stuff? [14:55:05] libsnappyjava? [14:56:09] ottomata: not much. It's on my TODO list. I can probably context switch tomorrow if you want it soon. [14:56:25] i think we'll want it fairly soon, within the next week or two [14:56:33] cool. Will do then [14:56:35] the varnishkafka and test broker setup is looking good [14:56:40] (03PS6) 10Matanya: drac: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 [14:56:51] (03PS1) 10BBlack: assert(ip) is stupid, fails on 0.0.0.0 [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/87342 [14:56:58] would like to start using official packages and puppetization soon [14:57:04] akosiaris1: ^ [14:57:05] in order to soon actually install this on mobile hosts [14:57:34] matanya: cool thanks!!! [14:57:44] ottomata: official packages ? [14:57:48] :) [14:57:53] uhhh [14:57:57] 'official' meaning from apt :p [14:57:58] please tell me you mean our kafka package :-D [14:58:01] our apt [14:58:11] rather than me dpkg -i ing stuff [14:58:36] ok ok ... I finally have in my mind what needs to be done so I will have a look tomorrow [14:58:59] k danke [15:00:33] (03CR) 10BBlack: [C: 032] assert(ip) is stupid, fails on 0.0.0.0 [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/87342 (owner: 10BBlack) [15:00:41] (03CR) 10BBlack: [V: 032] assert(ip) is stupid, fails on 0.0.0.0 [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/87342 (owner: 10BBlack) [15:01:26] ottomata: I exchanged a bunch of mails with the jq maintainer [15:01:37] he asked for sponsorship, I did a thorough review on his package [15:01:42] I'm about to upload 1.3-1 to Debian [15:01:47] awesome! [15:01:50] thanks [15:01:53] jq ? [15:02:08] http://stedolan.github.io/jq/ [15:02:49] paravoid: did I hear something from magnus about you playing with varnishkafka bson? :p [15:02:58] yes [15:03:08] that was a while ago [15:11:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:20:43] (03PS1) 10BBlack: bump netmapper patch to netmapper:b62b6c7a for assert(ip) fix [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/87347 [15:20:44] (03PS1) 10BBlack: varnish (3.0.3plus~rc1-wm17) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/87348 [15:21:08] (03CR) 10BBlack: [C: 032 V: 032] bump netmapper patch to netmapper:b62b6c7a for assert(ip) fix [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/87347 (owner: 10BBlack) [15:21:27] (03CR) 10BBlack: [C: 032 V: 032] varnish (3.0.3plus~rc1-wm17) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/87348 (owner: 10BBlack) [15:34:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 15:34:44 UTC 2013 [15:35:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:38:53] (03CR) 10Andrew Bogott: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86762 (owner: 10Ryan Lane) [15:52:53] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [15:53:49] ROFL [15:53:52] Invalid language code "//bits.wikimedia.org/static-1.22wmf18/extensions/TimedMediaHandler/MwEmbedModules/EmbedPlayer" [16:05:04] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 16:05:01 UTC 2013 [16:05:53] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:14:33] (03PS1) 10Reedy: Add symlink stuff [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87350 [16:14:34] !log updated varnish to -wm17 on active mobile caches (cp1046, cp1047, cp1059, cp1060, cp3011, cp3012), fixes 0.0.0.0 assertfail in netmapper [16:14:45] Logged the message, Master [16:17:22] (03CR) 10Ryan Lane: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86762 (owner: 10Ryan Lane) [16:17:47] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:03] (03PS2) 10Reedy: Do not show password reset interface unless password is correct [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87319 (owner: 10Ori.livneh) [16:18:18] (03CR) 10Reedy: [C: 032] Do not show password reset interface unless password is correct [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87319 (owner: 10Ori.livneh) [16:19:35] hi – https://bugzilla.wikimedia.org/show_bug.cgi?id=54847 is the leak bug and it's fixed already, right? can it be made public? [16:19:50] (03PS11) 10Reedy: Hooks to force password reset for some users [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87285 (owner: 10CSteipp) [16:19:57] (03CR) 10Reedy: [C: 032] Hooks to force password reset for some users [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87285 (owner: 10CSteipp) [16:20:08] (03Merged) 10jenkins-bot: Hooks to force password reset for some users [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87285 (owner: 10CSteipp) [16:20:41] (03PS2) 10Reedy: Enable $wmgBug54847 for default & private, but gate also on $wmgUseCentralAuth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87318 (owner: 10Ori.livneh) [16:20:58] (03CR) 10Reedy: [C: 032] Enable $wmgBug54847 for default & private, but gate also on $wmgUseCentralAuth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87318 (owner: 10Ori.livneh) [16:21:01] (03PS8) 10Ottomata: gerrit: linkify references to Analytics Mingle projects. [operations/puppet] - 10https://gerrit.wikimedia.org/r/84338 (owner: 10Diederik) [16:21:05] (03PS3) 10Reedy: Do not show password reset interface unless password is correct [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87319 (owner: 10Ori.livneh) [16:21:06] okay, judging by the activity, i assume it's not fully fixed yet. :P [16:21:10] thanks mark, what was the puppet problem? [16:21:12] (03Merged) 10jenkins-bot: Enable $wmgBug54847 for default & private, but gate also on $wmgUseCentralAuth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87318 (owner: 10Ori.livneh) [16:21:22] (03CR) 10Ottomata: [C: 032 V: 032] gerrit: linkify references to Analytics Mingle projects. [operations/puppet] - 10https://gerrit.wikimedia.org/r/84338 (owner: 10Diederik) [16:21:38] (03CR) 10Reedy: [C: 032] Do not show password reset interface unless password is correct [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87319 (owner: 10Ori.livneh) [16:21:45] (03PS2) 10Reedy: Add symlink stuff [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87350 [16:21:50] (03CR) 10Reedy: [C: 032] Add symlink stuff [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87350 (owner: 10Reedy) [16:22:57] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:24:57] PROBLEM - SSH on ms-be1011 is CRITICAL: Connection refused [16:29:55] ottomata: the fact that not all lvs classes handled ulsfo [16:30:02] by reversing the check in the template, that's no longer needed [16:30:12] hm, k, i see the commit, thank you [16:30:13] so [16:30:15] now, what? [16:30:35] bits varnishes should be up, pybal is up on lvs*, [16:30:35] pybal bgp setup... [16:30:37] ssl [16:30:47] the bgp is outside of puppet? [16:30:50] i assume? [16:30:57] yes, it's router config [16:30:59] aye [16:30:59] k [16:31:24] so i'm back to working on some analytics stuff, can I leave that with you then? or is there more I can/should do? [16:31:25] although it looks like I'm going to esams tomorrow, for urgent dc work [16:31:33] i can take care of it yes [16:31:39] but it'll be next week [16:31:40] ok cool, great, danke [16:31:42] that's fine, [16:31:44] (fine with me though) [16:31:49] i might get around to puppetizing more of the varnishes [16:31:53] i think i just did bits [16:31:58] now I understand the layout more so I can do that [16:32:02] yes, if you wnt to proceed you can go on with upload/mobile [16:32:05] k [16:32:08] let's see why we don't have enough boxes btw [16:32:13] aye [16:32:16] did we assume 2 boxes for bits and mobile perhaps? [16:32:17] * mark checks the ticket [16:33:27] oh damn [16:33:38] Quote 651613720 is for 20 new varnish servers with dual Intel S3700 SSDs. [16:33:45] Quote 651618910 is for 8 bits varnish servers (no SSDs). [16:33:50] those 8 means: 4 lvs, 4 bits [16:33:56] 8? this sounds excessive [16:33:56] oh [16:34:03] but anyway, if you picked 4 out of those 20 with SSDs for bits, that's the wrong boxes [16:34:07] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 16:33:57 UTC 2013 [16:34:12] bits doesn't need SSDs, it's all in-memory [16:34:47] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [16:34:47] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:34:50] although... [16:34:54] latr in the ticket it says: [16:35:04] With more memory we can probably reduce the amount of servers. I think 4 would be fine, but to be safe, let's do 6 upload servers, and 6 text servers. We probably don't need 4 mobile servers either. [16:35:04] So let's reduce 20 to 16 - that will also cover for the increased cost for the memory. [16:35:07] argh, I totally forgot that [16:35:39] anyway, if those bits servers have the same hw configuration as lvs, that's good [16:36:46] it seems we bought 16 servers with SSDs [16:36:51] 4 mobile, 6 text, 6 upload [16:39:55] those bits boxes seem to have hard drives, good [16:40:22] !log reedy synchronized php-1.22wmf20 'Staging' [16:40:35] Logged the message, Master [16:41:00] !log reedy synchronized docroot and w [16:41:02] ottomata: so to proceed, I'd go forward with upload (6 boxes) and mobile (4 boxes) [16:41:09] but it can wait also [16:41:12] Logged the message, Master [16:42:43] ahhh rats ok mark [16:42:48] still readding... [16:43:05] RECOVERY - SSH on ms-be1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [16:43:15] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [16:43:21] mark, but what to do with existing bits? redothem? [16:43:28] which nodes should be for bits? [16:43:31] i believe they're fine [16:43:33] oh [16:43:38] i believe they are the ones with hard drives [16:43:43] perhaps not all(?) I checked two [16:43:49] oh you say that they have hdds [16:43:49] ok [16:44:15] the SSDs identify themselves as intel and are 400 GB [16:44:19] these HDs are 250GB [16:44:34] so I think bits is fine [16:44:54] ja 250 [16:44:55] k [16:44:56] good [16:44:57] :) [16:45:19] so there are 8 nodes with ssds? [16:45:23] and 4 have been used for lvs? [16:45:24] so far? [16:45:29] 16 with ssds, 8 without [16:45:35] oh sorry [16:45:35] no [16:45:36] backwards [16:45:38] got it [16:45:41] the 8 without are 4 for lvs, 4 for bits [16:45:44] 8 iwth hdds, 4 for bits, 4 for lvs [16:45:52] k cool [16:45:56] great [16:48:03] !log reedy Started syncing Wikimedia installation... : testwiki to 1.22wmf20 and build l10n cache [16:48:05] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: No successful Puppet run in the last 10 hours [16:48:05] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: No successful Puppet run in the last 10 hours [16:48:15] Logged the message, Master [16:48:33] ok, more qs for you mark [16:48:46] yesterday i was asking about ganglia and you mentioned that the aggregators in the same cluster should report hte same data, right? [16:49:19] yes [16:49:26] root@nickel:~# netcat analytics1009.eqiad.wmnet 8649 | grep kafka | wc -l [16:49:26] 0 [16:49:26] root@nickel:~# netcat analytics1011.eqiad.wmnet 8649 | grep kafka | wc -l [16:49:26] 4751 [16:49:40] previously, analytics1003 was an aggregator, instead of 1009 [16:49:53] and then it worked? [16:49:54] (03PS1) 10Faidon Liambotis: partman: add a new layout for ms-be @ eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/87354 [16:49:55] (03PS1) 10Faidon Liambotis: partman: cleanup external store profiles [operations/puppet] - 10https://gerrit.wikimedia.org/r/87355 [16:50:00] i changed this on sept 24 (i think), and since then I haven't had any of the non base ganglia data from the analytics cluster [16:50:04] yes, it worked before that [16:50:06] is gerrit down for everyone? [16:50:25] nope [16:50:43] ottomata: so are 1009 and 1011 in different subnets? [16:50:46] yes [16:51:00] sounds like multicast routing is broken [16:51:50] ah [16:51:52] row b and c I bet [16:52:03] i have a feeling if I change vrrp prio of row c back to cr2-eqiad it will work again [16:52:06] fscking multicast [16:52:49] ha, did that change recently too? is 1009 in a different row than 1003? [16:52:58] yes, like last week [16:52:59] Coren: ok, give your script a shot. the sanitarium instances that hold the affected databases in your original list have finished filtering. I'm running the same scripts on the other instances too, but that's enwiki/frwiki, etc which may take a day or so to finish the table scans on `revision` which has one redacted field [16:54:43] YuviPanda: should I be able to wget http://proxy-dammit:5000/v1/testproject/mapping ? [16:54:57] andrewbogott: shouldn't [16:55:02] andrewbogott: well, internally - yes [16:55:05] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: No successful Puppet run in the last 10 hours [16:55:05] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: No successful Puppet run in the last 10 hours [16:55:06] andrewbogott: externally no [16:55:11] oh, nevermind. that's an internal url [16:55:13] yes you should [16:55:25] RECOVERY - Puppet freshness on ms-be1011 is OK: puppet ran at Thu Oct 3 16:55:18 UTC 2013 [16:55:44] 'Connection refused' [16:55:46] (03CR) 10Faidon Liambotis: [C: 032] partman: add a new layout for ms-be @ eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/87354 (owner: 10Faidon Liambotis) [16:56:03] looking [16:56:31] andrewbogott: bah, just found out that it is listening on 127.0.0.1 [16:56:36] (03CR) 10Faidon Liambotis: [C: 032] "Diff is dirty, it's basically "git rm db.cfg; git mv es.cfg db.cfg"." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87355 (owner: 10Faidon Liambotis) [16:56:41] That would explain it! [16:56:48] andrewbogott: hmm, i need to have it listen on only the internal IP [16:56:51] (03PS2) 10Faidon Liambotis: Remove jfsutils from base::standard-packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/86252 (owner: 10Akosiaris) [16:58:32] andrewbogott: try now? [16:58:53] 400 BAD REQUEST [16:59:04] Which means that I'm talking to the API now, but… would expect it to respond with something? [16:59:24] andrewbogott: what was your request? [16:59:37] andrewbogott: try visualeditor [16:59:39] as a project [16:59:53] (03CR) 10Faidon Liambotis: [C: 032] "Removed apache.cfg, I give up on snapshot :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86252 (owner: 10Akosiaris) [16:59:59] Ah, that gets me something. [17:00:05] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [17:00:05] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:00:07] So, a 400 if I pass in a project with no proxies? [17:00:10] Maybe that's OK [17:00:27] i should ideally make that a 404 [17:00:57] (03PS1) 10Faidon Liambotis: swift: change eqiad to the new partition layout [operations/puppet] - 10https://gerrit.wikimedia.org/r/87357 [17:01:05] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: No successful Puppet run in the last 10 hours [17:01:54] (03PS2) 10Faidon Liambotis: swift: change eqiad to the new partition layout [operations/puppet] - 10https://gerrit.wikimedia.org/r/87357 [17:02:02] (03CR) 10Faidon Liambotis: [C: 032 V: 032] swift: change eqiad to the new partition layout [operations/puppet] - 10https://gerrit.wikimedia.org/r/87357 (owner: 10Faidon Liambotis) [17:03:40] YuviPanda: I don't really understand the json I'm getting back from that query… is it serving up dummy responses? [17:03:51] andrewbogott: pastebin? [17:04:02] andrewbogott: no, not dummy responses [17:04:05] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: No successful Puppet run in the last 10 hours [17:04:08] https://dpaste.de/Ee8M [17:04:33] no backends [17:04:54] andrewbogott: ugh, okay that's weird [17:05:11] andrewbogott: oh wait [17:05:13] andrewbogott: what was the URL? [17:05:17] so mark, anything I can do or do you just have to press some buttons somewhere? [17:05:17] you used? [17:05:21] for ganglia problems [17:05:28] http://proxy-dammit:5000/v1/visualeditor/mapping [17:05:39] ottomata: not really [17:05:41] YuviPanda: are you expecing that I'll re-query each individual domain for the backends? [17:05:44] it needs some network architecture changes [17:05:49] That's OK, just unclear from the docs. [17:05:51] andrewbogott: at the moment, yeah [17:05:53] which i'm planning to do soon but obviously can't do right now [17:05:58] (And maybe inefficient) [17:05:59] andrewbogott: I can change that if you want [17:06:04] andrewbogott: yeah, definitely sounds ineffecient [17:06:05] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: No successful Puppet run in the last 10 hours [17:06:06] andrewbogott: let me change that [17:06:13] Yeah, I think it's better if I can get it all in a lump. [17:06:20] thx [17:06:26] oof, ok, so bigger than just some buttons? [17:06:35] should I just move the aggregator else where for now? [17:06:55] well [17:07:05] basically there's no reliable multicast routing between the subnets now [17:07:05] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: No successful Puppet run in the last 10 hours [17:07:07] (03PS1) 10Reedy: Only enable Bug 54847 code if on production [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87359 [17:07:11] so they're two islands [17:07:25] moving the aggregators around doesn't help much [17:09:27] !log reedy Finished syncing Wikimedia installation... : testwiki to 1.22wmf20 and build l10n cache [17:09:37] Logged the message, Master [17:11:05] RECOVERY - Disk space on ms-be1011 is OK: DISK OK [17:11:10] (03PS1) 10Reedy: Disable wmgBug54847 on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87361 [17:11:15] RECOVERY - DPKG on ms-be1011 is OK: All packages OK [17:11:22] (03Abandoned) 10Reedy: Only enable Bug 54847 code if on production [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87359 (owner: 10Reedy) [17:11:45] RECOVERY - RAID on ms-be1011 is OK: OK: State is Optimal, checked 1 logical device(s) [17:13:51] !log reedy synchronized php-1.22wmf20/extensions/VisualEditor [17:13:55] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:55] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:55] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:55] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:55] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:56] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:56] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:04] (03CR) 10Reedy: [C: 032] Disable wmgBug54847 on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87361 (owner: 10Reedy) [17:14:04] Logged the message, Master [17:14:12] (03Merged) 10jenkins-bot: Disable wmgBug54847 on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87361 (owner: 10Reedy) [17:14:15] PROBLEM - Host ms-be1004 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:35] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:05] RECOVERY - NTP on ms-be1011 is OK: NTP OK: Offset -0.01930582523 secs [17:18:32] YuviPanda: https://github.com/google/lmctfy/ [17:18:51] let me container that for you? [17:18:58] yes, i laughed [17:19:05] :D [17:19:05] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [17:19:06] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [17:19:06] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [17:19:06] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [17:19:06] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 2.62 ms [17:19:06] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:19:06] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [17:19:08] me too! [17:19:25] RECOVERY - Host ms-be1004 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:19:31] there's a C++ library, so i guess we can write bindings [17:19:34] Oh, contain [17:19:38] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki back to 1.22wmf19 till deploy time [17:19:55] Logged the message, Master [17:21:05] PROBLEM - SSH on ms-be1008 is CRITICAL: Connection refused [17:21:15] PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused [17:21:25] PROBLEM - SSH on ms-be1009 is CRITICAL: Connection refused [17:21:26] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:35] PROBLEM - SSH on ms-be1006 is CRITICAL: Connection refused [17:21:35] PROBLEM - SSH on ms-be1003 is CRITICAL: Connection refused [17:21:45] PROBLEM - SSH on ms-be1002 is CRITICAL: Connection refused [17:22:05] PROBLEM - SSH on ms-be1007 is CRITICAL: Connection refused [17:22:15] PROBLEM - SSH on ms-be1004 is CRITICAL: Connection refused [17:22:16] these are all me, obviously [17:23:12] paravoid: Don't worry -- when things break we always think of you first. :-) [17:23:25] they didn't break, I'm just reformatting en masse [17:23:31] megacli en masse too [17:23:44] for a moment I was thinking of hooking megacli to d-i [17:23:55] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [17:24:03] but they're both completely undebuggable, so the combination sounds very scary [17:24:58] (03PS1) 10Reedy: Wikipedias to 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87363 [17:24:59] (03PS1) 10Reedy: testwiki, test2wiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87364 [17:26:35] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:26:51] ok, more for you mark :) [17:26:53] https://gerrit.wikimedia.org/r/#/c/86894/4 [17:26:54] no hurry [17:28:15] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:28:35] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:28:35] RECOVERY - SSH on ms-be1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:28:35] PROBLEM - swift-object-server on ms-be1001 is CRITICAL: Connection refused by host [17:28:45] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:28:45] PROBLEM - swift-container-replicator on ms-be1001 is CRITICAL: Connection refused by host [17:28:55] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [17:28:55] PROBLEM - swift-object-updater on ms-be1001 is CRITICAL: Connection refused by host [17:28:55] PROBLEM - swift-object-replicator on ms-be1001 is CRITICAL: Connection refused by host [17:29:06] RECOVERY - SSH on ms-be1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:29:06] PROBLEM - Disk space on ms-be1001 is CRITICAL: Connection refused by host [17:29:06] PROBLEM - swift-object-auditor on ms-be1001 is CRITICAL: Connection refused by host [17:29:06] RECOVERY - SSH on ms-be1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:29:15] PROBLEM - swift-account-server on ms-be1001 is CRITICAL: Connection refused by host [17:29:15] RECOVERY - SSH on ms-be1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:29:15] PROBLEM - RAID on ms-be1001 is CRITICAL: Connection refused by host [17:29:15] PROBLEM - swift-account-reaper on ms-be1001 is CRITICAL: Connection refused by host [17:29:16] PROBLEM - swift-container-updater on ms-be1001 is CRITICAL: Connection refused by host [17:29:25] PROBLEM - swift-container-auditor on ms-be1001 is CRITICAL: Connection refused by host [17:29:25] RECOVERY - SSH on ms-be1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:29:25] PROBLEM - swift-account-replicator on ms-be1001 is CRITICAL: Connection refused by host [17:29:25] PROBLEM - swift-container-server on ms-be1001 is CRITICAL: Connection refused by host [17:29:25] PROBLEM - swift-account-auditor on ms-be1001 is CRITICAL: Connection refused by host [17:29:35] PROBLEM - DPKG on ms-be1001 is CRITICAL: Connection refused by host [17:40:56] PROBLEM - NTP on ms-be1001 is CRITICAL: NTP CRITICAL: No response from NTP server [17:41:08] i see multicast packets from anl1011 and others on 1019 [17:41:11] er on 1009 [17:41:18] so it's probably not multicast routing that is broken [17:42:16] PROBLEM - swift-account-reaper on ms-be1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:44:16] (03PS1) 10Chad: Cirrus to default cawiki, remove commented officewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87369 [17:44:16] RECOVERY - swift-account-reaper on ms-be1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:44:33] (03CR) 10Chad: [C: 032] Cirrus to default cawiki, remove commented officewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87369 (owner: 10Chad) [17:44:42] (03Merged) 10jenkins-bot: Cirrus to default cawiki, remove commented officewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87369 (owner: 10Chad) [17:45:19] !log demon synchronized wmf-config/InitialiseSettings.php [17:45:30] Logged the message, Master [17:45:46] PROBLEM - SSH on maerlant is CRITICAL: Server answer: [18:02:03] can someone touch fluorine:/a/mw-log/Bug54847.log ? [18:03:08] Coren: ^ [18:03:50] ori-l: Touched. [18:03:56] thanks [18:04:37] Coren: erm, and chown it to udp2log.udp2log? [18:05:03] ori-l: Done. [18:05:04] thanks [18:09:46] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:11:46] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:13:00] RECOVERY - search indices - check lucene status page on search20 is OK: HTTP OK: HTTP/1.1 200 OK - 60075 bytes in 0.111 second response time [18:13:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:14:10] PROBLEM - SSH on ms-be1012 is CRITICAL: Connection refused [18:25:22] (03PS2) 10Reedy: Wikipedias to 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87363 [18:25:28] (03CR) 10Reedy: [C: 032] Wikipedias to 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87363 (owner: 10Reedy) [18:27:19] (03CR) 10Reedy: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87363 (owner: 10Reedy) [18:28:01] (03CR) 10Reedy: [V: 032] Wikipedias to 1.22wmf19 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87363 (owner: 10Reedy) [18:28:40] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [18:28:43] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.22wmf19 [18:28:57] Logged the message, Master [18:31:36] akosiaris: etherpad.wikimedia.org is having troubles - HaeB has been bugging me because there's an event in 20 minutes that's using it [18:31:37] (03PS2) 10Reedy: testwiki, test2wiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87364 [18:31:42] (03CR) 10Reedy: [C: 032] testwiki, test2wiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87364 (owner: 10Reedy) [18:32:30] marktraceur: what does having troubles mean ? [18:32:57] akosiaris: Apparently they can edit a pad, but it doesn't get saved to the server [18:33:10] ?????? [18:33:14] I'll let him explain [18:33:27] (03CR) 10Reedy: [V: 032] testwiki, test2wiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87364 (owner: 10Reedy) [18:33:50] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:33:53] Maybe not - he's not used to dvorak [18:34:22] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki, test2wiki, mediawikiwiki, loginwiki and testwikidatawiki to 1.22wmf20 [18:34:30] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 18:34:28 UTC 2013 [18:34:32] Logged the message, Master [18:34:47] akosiaris: He's sending you an email explaining the issue now [18:34:57] ok [18:35:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:37:10] if a log file on fluorine was not writable by udp2log on creation, is it possible that the udp2log daemon needs to be SIGHUPed to start streaming log data into it? [18:37:38] ori-l: I've never needed to HUP it [18:37:53] * Reedy kicks APC [18:37:53] Perhaps you could try dropping and recreating the file? [18:39:02] !log reedy synchronized php-1.22wmf20/includes/DefaultSettings.php [18:39:08] RoanKattouw: can't [18:39:16] Logged the message, Master [18:40:04] ori-l: Would you like me to? [18:40:43] RoanKattouw: that would be great -- thank you [18:40:52] Path? [18:41:04] fluorine:/a/mw-log/Bug54847.log [18:42:02] Done [18:42:14] I don't know if that'll work, udp2log is sort of a mystery to me [18:42:34] I suppose you could try asking someone who's familiar with the code, but I don't know who that would be [18:42:43] hi akosiaris [18:42:52] (i'm tilman ;) [18:42:59] highly-performant, bug-free but mostly inscrutable C++ code? [18:43:10] RECOVERY - SSH on ms-be1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:43:11] hey HaeB I just read your email and I am able to reproduce... [18:43:12] i think that narrows it down to gwicke and tim :P [18:43:26] haha [18:44:00] RoanKattouw: ottomata might be able to help you, he knows a lot about using it [18:44:09] that being said... no logs or anything.... looking into it [18:44:42] thanks ;) [18:47:54] whaasssup? [18:48:54] ottomata: ori-l is having some problems with getting udp2log to write to a new log file [18:50:05] oh, I haven't done anything that would have definitively generated a log message in the last 5-10 mins [18:50:09] if you recreated the file, RoanKattouw [18:50:15] I did [18:50:43] kk [18:52:13] nope [18:52:23] did wfDebugLog( "Bug54847", 'ori-l testing' ); on eval.php on fenari [18:54:30] RECOVERY - Puppet freshness on ms-be1012 is OK: puppet ran at Thu Oct 3 18:54:25 UTC 2013 [18:54:34] * ori-l moves on to another bug [18:55:50] RECOVERY - Host ms-be1010 is UP: PING WARNING - Packet loss = 37%, RTA = 0.34 ms [18:56:07] (03PS1) 10Odder: (bug 54922) Add an accountcreator user group on svwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87432 [18:58:03] im trying to import https://en.wikipedia.org/wiki/San_Francisco and associated templates to test2wiki since mobile web's automated tests use that article - however, i get the error 'import failed: could not open import file' [18:58:22] (via test2.wikipedia.org/wiki/Special:Import) [18:58:42] i tried a much smaller article, which resulted in ERR_READ_TIMEOUT :| [18:58:50] PROBLEM - SSH on ms-be1010 is CRITICAL: Connection refused [18:58:57] akosiaris: any insights yet? [18:58:59] (just say it's complicated ;) [18:59:17] its facebooky-like complicated [18:59:31] seems to be something either in the content [18:59:31] anybody know what might be going on and/or able to help get the San Francisco article imported? ^^ [18:59:35] or the length... [18:59:45] because i can reproduce with just c/p to another pad [18:59:52] but etherpad logs no error [19:00:00] RECOVERY - Disk space on ms-be1012 is OK: DISK OK [19:00:01] RECOVERY - DPKG on ms-be1012 is OK: All packages OK [19:00:20] RECOVERY - RAID on ms-be1012 is OK: OK: State is Optimal, checked 1 logical device(s) [19:00:50] RECOVERY - NTP on ms-be1012 is OK: NTP OK: Offset -0.06885778904 secs [19:02:07] i also noted that when i tried to paste in a larger chunk of text this morning (10kb), the pasted text showed up in a differen font (in a serif font like in the text editor i was pasting front, rathter than the standard sans serif font etherpad is using) [19:04:15] what text editor did you copy it from? [19:04:28] Reedy: any idea? ^^^ [19:04:35] awjr: are you importing all revisions? [19:04:40] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 19:04:32 UTC 2013 [19:04:51] and are you using transwiki import or importupload [19:05:12] Nemo_bis: yeah, all revisions but probably don't need to - and transwiki [19:05:23] SF must have thousands revisions, if you also imported transcluded templates that can amount to GB of stuff [19:05:24] lemme try w/o revisions [19:05:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:05:43] not that the API would even let it do it [19:06:57] Nemo_bis: w/o revisions, im getting ERR_READ_TIMEOUT [19:07:03] Request: POST http://test2.wikipedia.org/w/index.php?title=Special:Import&action=submit, from 208.80.154.75 via cp1007.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.141 (10.64.0.141) [19:07:03] Error: ERR_READ_TIMEOUT, errno [No Error] at Thu, 03 Oct 2013 19:06:19 GMT [19:07:30] HaeB_: seems like it is fixed [19:07:45] awjr: with or without templates? [19:07:50] with templates [19:07:58] do you really need them? [19:08:31] Nemo_bis: seems to work w/o them - we might for the automated tests [19:08:32] lemme dig [19:08:39] that being said I did not find the reason it happened and I bet it will show up again. [19:08:50] RECOVERY - SSH on ms-be1010 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:08:50] cause all i did was restart etherpad [19:08:59] Nemo_bis: is there a way to use importupload on test2wiki? [19:09:20] sure [19:09:22] btw https://bugzilla.wikimedia.org/show_bug.cgi?id=15000#c17 [19:09:42] yeah, we need templates :( [19:10:19] Nemo_bis: thanks - i'd like to try importupload - how do i do that on test2wiki? [19:10:25] you can ask anyone on #wikimedia-stewards [19:11:13] akosiaris: oh.,.. just started to implement plan b.. [19:11:52] HaeB_: you may want to have a plan B ... [19:11:56] how sure are we that it's fixed? [19:12:09] we are not [19:12:22] !log CentralAuth: UPDATE `bug_54847_password_resets` SET `r_logged_out` = 1 WHERE `r_reset` IS NOT NULL; [19:12:37] Logged the message, Master [19:13:23] I expect it will show up again. If it does ping me. I wanna gather more info next time. [19:13:58] ok, thanks! [19:25:40] PROBLEM - SSH on stat1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:30] RECOVERY - SSH on stat1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:27:20] RECOVERY - Puppet freshness on ms-be1002 is OK: puppet ran at Thu Oct 3 19:27:12 UTC 2013 [19:27:20] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Thu Oct 3 19:27:12 UTC 2013 [19:32:10] RECOVERY - Disk space on ms-be1003 is OK: DISK OK [19:32:11] RECOVERY - RAID on ms-be1001 is OK: OK: State is Optimal, checked 1 logical device(s) [19:32:20] RECOVERY - DPKG on ms-be1002 is OK: All packages OK [19:32:20] RECOVERY - RAID on ms-be1003 is OK: OK: State is Optimal, checked 1 logical device(s) [19:32:20] RECOVERY - RAID on ms-be1002 is OK: OK: State is Optimal, checked 1 logical device(s) [19:32:30] RECOVERY - DPKG on ms-be1001 is OK: All packages OK [19:32:40] RECOVERY - swift-container-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:32:49] !log reedy synchronized php-1.22wmf20/extensions/MassMessage/ [19:32:50] RECOVERY - swift-object-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:32:50] RECOVERY - swift-object-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:33:00] RECOVERY - swift-container-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:33:02] Logged the message, Master [19:33:10] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [19:33:10] RECOVERY - Disk space on ms-be1002 is OK: DISK OK [19:33:10] RECOVERY - swift-container-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:33:10] RECOVERY - swift-object-auditor on ms-be1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:33:10] RECOVERY - DPKG on ms-be1003 is OK: All packages OK [19:33:10] RECOVERY - swift-account-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:33:20] RECOVERY - swift-container-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:33:30] RECOVERY - swift-account-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:33:40] RECOVERY - swift-object-server on ms-be1001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:34:30] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 19:34:25 UTC 2013 [19:34:30] RECOVERY - Puppet freshness on ms-be1005 is OK: puppet ran at Thu Oct 3 19:34:26 UTC 2013 [19:34:30] RECOVERY - Puppet freshness on ms-be1004 is OK: puppet ran at Thu Oct 3 19:34:26 UTC 2013 [19:35:30] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:35:39] Reedy: thanks :) [19:39:00] RECOVERY - DPKG on ms-be1004 is OK: All packages OK [19:39:10] RECOVERY - RAID on ms-be1004 is OK: OK: State is Optimal, checked 1 logical device(s) [19:39:10] RECOVERY - Disk space on ms-be1005 is OK: DISK OK [19:39:40] RECOVERY - Puppet freshness on ms-be1006 is OK: puppet ran at Thu Oct 3 19:39:34 UTC 2013 [19:39:40] RECOVERY - Disk space on ms-be1004 is OK: DISK OK [19:39:41] RECOVERY - DPKG on ms-be1005 is OK: All packages OK [19:39:50] RECOVERY - RAID on ms-be1005 is OK: OK: State is Optimal, checked 1 logical device(s) [19:42:43] RECOVERY - Puppet freshness on ms-be1009 is OK: puppet ran at Thu Oct 3 19:42:36 UTC 2013 [19:42:53] RECOVERY - Puppet freshness on ms-be1008 is OK: puppet ran at Thu Oct 3 19:42:46 UTC 2013 [19:43:13] RECOVERY - Puppet freshness on ms-be1007 is OK: puppet ran at Thu Oct 3 19:43:11 UTC 2013 [19:43:17] checked 1 logical devices out of 14, I love this :) [19:43:53] RECOVERY - DPKG on ms-be1006 is OK: All packages OK [19:44:03] RECOVERY - RAID on ms-be1006 is OK: OK: State is Optimal, checked 1 logical device(s) [19:44:03] RECOVERY - Disk space on ms-be1006 is OK: DISK OK [19:45:33] RECOVERY - Puppet freshness on ms-be1010 is OK: puppet ran at Thu Oct 3 19:45:27 UTC 2013 [19:46:13] RECOVERY - swift-account-reaper on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:46:43] RECOVERY - swift-account-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:47:03] RECOVERY - DPKG on ms-be1008 is OK: All packages OK [19:47:04] RECOVERY - RAID on ms-be1008 is OK: OK: State is Optimal, checked 1 logical device(s) [19:47:04] RECOVERY - RAID on ms-be1009 is OK: OK: State is Optimal, checked 1 logical device(s) [19:47:33] RECOVERY - NTP on ms-be1009 is OK: NTP OK: Offset -0.0874838829 secs [19:47:33] RECOVERY - Disk space on ms-be1008 is OK: DISK OK [19:47:33] RECOVERY - NTP on ms-be1003 is OK: NTP OK: Offset -0.02688324451 secs [19:47:43] RECOVERY - Disk space on ms-be1009 is OK: DISK OK [19:47:53] RECOVERY - DPKG on ms-be1009 is OK: All packages OK [19:47:53] RECOVERY - NTP on ms-be1001 is OK: NTP OK: Offset -0.02662801743 secs [19:47:53] RECOVERY - Disk space on ms-be1007 is OK: DISK OK [19:48:13] RECOVERY - RAID on ms-be1007 is OK: OK: State is Optimal, checked 1 logical device(s) [19:48:13] RECOVERY - NTP on ms-be1002 is OK: NTP OK: Offset -0.02849555016 secs [19:48:23] RECOVERY - DPKG on ms-be1007 is OK: All packages OK [19:48:43] RECOVERY - NTP on ms-be1007 is OK: NTP OK: Offset 0.05657565594 secs [19:50:33] RECOVERY - DPKG on ms-be1010 is OK: All packages OK [19:50:43] RECOVERY - Disk space on ms-be1010 is OK: DISK OK [19:50:43] RECOVERY - RAID on ms-be1010 is OK: OK: State is Optimal, checked 1 logical device(s) [19:54:43] RECOVERY - NTP on ms-be1005 is OK: NTP OK: Offset -0.02548146248 secs [19:55:04] RECOVERY - NTP on ms-be1004 is OK: NTP OK: Offset -0.02514827251 secs [19:56:43] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:56:43] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:13] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:23] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:23] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:23] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:23] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:33] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:34] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:34] PROBLEM - Host ms-be1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:34] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:43] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [19:57:53] PROBLEM - Swift HTTP on ms-fe1001 is CRITICAL: Connection timed out [19:58:14] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [19:58:14] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:58:14] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:58:14] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:58:14] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:58:14] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:58:14] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:58:22] nooooooooooooooo [19:58:23] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:58:23] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [19:58:23] RECOVERY - Host ms-be1004 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:58:27] it's fine :) [19:58:32] Oh Lol [19:58:33] PROBLEM - Host ms-fe1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:58:35] ok [19:58:43] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [19:58:48] I just rebooted the whole cluster after reinstalling everything [19:58:53] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [19:59:03] PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:03] PROBLEM - Host ms-fe1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:04] PROBLEM - Host ms-fe1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:00:03] RECOVERY - Host ms-fe1002 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [20:00:13] RECOVERY - Host ms-fe1001 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [20:00:13] RECOVERY - Host ms-fe1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:00:14] RECOVERY - Host ms-fe1004 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:05:27] ottomata: hey, did you give up on the meeting before I got there? [20:05:43] MEETING [20:05:44] NO [20:05:46] hi! [20:05:48] let's meet! [20:06:21] hm i have no reminder email [20:06:24] need hangout url [20:06:25] manybubbles: [20:11:13] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 724 statistics [20:12:13] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 1 copy to table, 2 statistics [20:13:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:14:12] PROBLEM - NTP on ms-fe1004 is CRITICAL: NTP CRITICAL: Offset unknown [20:14:42] PROBLEM - NTP on ms-fe1002 is CRITICAL: NTP CRITICAL: Offset unknown [20:16:13] PROBLEM - MySQL Idle Transactions on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:42] PROBLEM - Swift HTTP on ms-fe1001 is CRITICAL: Connection refused [20:17:02] RECOVERY - NTP on ms-be1008 is OK: NTP OK: Offset -0.02320730686 secs [20:17:12] RECOVERY - MySQL Idle Transactions on db1021 is OK: OK longest blocking idle transaction sleeps for 0 seconds [20:17:12] RECOVERY - NTP on ms-be1010 is OK: NTP OK: Offset -0.01921904087 secs [20:18:12] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 11 copy to table, 302 statistics [20:18:42] RECOVERY - NTP on ms-fe1002 is OK: NTP OK: Offset 0.003898143768 secs [20:19:12] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 7 copy to table, 3 statistics [20:19:13] RECOVERY - NTP on ms-fe1004 is OK: NTP OK: Offset -0.003332138062 secs [20:24:12] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 7 copy to table, 219 statistics [20:25:12] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 4 copy to table, 1 statistics [20:27:10] !log krinkle synchronized php-1.22wmf19/resources/startup.js 'Attempt to fix bug 54935' [20:27:21] Logged the message, Master [20:33:18] is there going to be a west coast DC? [20:34:42] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 20:34:31 UTC 2013 [20:34:42] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:37:44] matanya_, there's going to be a caching centre in SF if that's what you mean# [20:37:47] . [20:38:55] that is Krenair. in Seattle, or LA? [20:38:57] or TBD? [20:39:34] um [20:39:52] I said SF [20:40:34] oh, missed that :( [20:40:43] thanks [20:42:07] !log krinkle synchronized php-1.22wmf19/extensions/VisualEditor/modules/ve/ve.Element.js 'touch for bug 54935' [20:42:19] Logged the message, Master [20:42:44] matanya, there's some pics on https://commons.wikimedia.org/wiki/Special:ListFiles/Mutante [20:42:47] ctrl+f 'ULSFO' [20:43:12] so that what ulsfo stands for! i was wondering [20:43:35] haha, I thoguht it was ULS related (language selector) first time I read it [20:43:35] https://commons.wikimedia.org/w/index.php?title=Special:Search&search=ULSFO&profile=images [20:43:54] yeas, me too :) [20:44:34] What DC is it? equinix? [20:44:40] matanya, wmf's datacentre names are the initials of the provider and nearest airport [20:44:50] https://commons.wikimedia.org/wiki/Category:Wikimedia_servers_in_San_Francisco says UnitedLayer [20:45:14] now i'm a bit smarter [20:45:24] legoktm: reminding you pywikibot :) [20:45:37] pmtpa = so UnitedLayer San FransiscO [20:45:46] woops, ignore the 'pmpta' bit of that message [20:45:56] Krenair: actually, SFO stands for San Francisco-Oakland [20:45:57] was typing out something else originally >_> [20:46:00] matanya: Anatomy of a data center name: ul = United Layer, sfo = nearest airport [20:46:18] legoktm, oh, okay. TIL. [20:46:22] Similarly, eqiad is an Equinix location near IAD (= Dulles International Airport near Washington DC) [20:46:24] :) [20:47:17] and pmpta is tampa, florida. pm is what provider? [20:47:38] and esams is something in amsterdam with some provider [20:47:45] evoswitch [20:47:57] it's even on [[wmf:Benefactors]] or something iirc [20:48:14] pmtpa = PowerMedium in TPA = Tampa, FL [20:48:20] esams is EvoSwitch in AMS [20:48:31] knams is the old Kennisnet location in AMS [20:48:35] And I forget what sdtpa is [20:49:03] used to be a 'lopar' cluster according to wikitech. [20:49:04] legoktm: I doubt it, as Oakland has its own airport with its own airport code (OAK) [20:49:26] RoanKattouw: it's a historical thing, let me see if i can find anything talking about it [20:49:38] so regarding ulsfo, it will be serving only cache? [20:49:44] Although if SFO's airport code was assigned before OAK existed, that could make sense [20:50:15] <^d> RoanKattouw: sd was switch & data. [20:50:39] Aha [20:50:54] so what was lopar? [20:51:03] <^d> nobody remembers :p [20:51:12] <^d> I remember yaseo though. [20:51:16] I'm guessing Paris? [20:51:21] yaseo was Yahoo in Seoul [20:51:26] <^d> Yep, and it was bad. [20:51:32] I guess this was before we started obeying the airport code convention properly [20:51:35] http://wiki.answers.com/Q/What_does_the_O_in_SFO_stand_for is mixed [20:51:42] there's still documenation mentioning yaseo [20:51:44] Because that should have been yaicn probably [20:52:00] <^d> greg-g: Until recently-ish, we had plenty of yaseo references lingering around wmf-config. [20:52:06] <^d> I *think* we've finally purged those. [20:52:18] hah [20:52:21] 'great' [20:52:43] there's still plenty on the wikis [20:53:03] <^d> {{sofixit}} :) [20:54:05] ugh, so, speaking of hurting wrists, mine are right now [20:54:11] ^d: I already did https://meta.wikimedia.org/wiki/Wikimedia_servers ;) [20:54:14] * greg-g goes to take a shower as a break [20:54:15] is there any a bird fly view of all infra from DC level to application level exect puppet config? [20:54:33] yeah, can't find 'yaseo' in wmf-config [20:54:34] matanya: nothing that's accurate [20:54:42] I suspect this doesn't work that well https://wikitech.wikimedia.org/wiki/Squid_checker [20:54:43] that i found out [20:55:19] lopar :) [20:56:42] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 342 seconds [20:56:58] presumably par is for paris, what does lo mean? [20:57:42] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay -0 seconds [20:58:08] -0 seconds. huh. [20:58:20] lelo https://wikitech.wikimedia.org/wiki/Lopar [20:58:40] or Lost Oasis [21:04:52] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 21:04:43 UTC 2013 [21:05:32] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [21:08:40] Ryan_Lane: mind reviewing https://gerrit.wikimedia.org/r/#/c/87332/ please ? [21:32:18] (03CR) 10Ryan Lane: [C: 032] drac: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [21:32:21] (03PS7) 10Ryan Lane: drac: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [21:34:19] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 21:34:15 UTC 2013 [21:34:39] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [21:35:12] Ryan_Lane: what is that drac thing? [21:35:16] (03CR) 10Ryan Lane: [C: 032] drac: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87332 (owner: 10Matanya) [21:35:40] (03PS1) 10Faidon Liambotis: swift: switch eqiad to tempauth backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/87452 [21:39:35] Ryan_Lane: I'm planning to convert some more lone .pp files into modules. any hints, suggestions or comments before I start this effort? [21:39:58] see the puppet todo effort [21:40:50] (03CR) 10Faidon Liambotis: [C: 032] swift: switch eqiad to tempauth backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/87452 (owner: 10Faidon Liambotis) [21:41:17] thanks paravoid [21:44:20] oh puppet... [21:44:30] mutante: can you please review : https://gerrit.wikimedia.org/r/#/c/86818/ [21:48:49] ^d - lvs love ? [21:49:04] <^d> Lez do ittttt :) [21:50:32] paravoid: can we please please convert the roles into modules now? [21:50:46] why are you asking me? :) [21:50:46] puppetd -tv --modulepath=/etc/puppet/modules [21:50:51] guess why that fails? [21:50:56] why? [21:50:59] when trying not to use puppetmaster::self [21:51:07] use --manifestdir ? :) [21:51:09] because the damn role that needs to be called isn't a module [21:51:13] tried that [21:51:29] it won't be just the role that fails though... [21:51:37] * Ryan_Lane sighs [21:51:38] right [21:51:45] but sure, no objection from me [21:51:50] why roles aren't modules too ? [21:51:50] modules/role ? [21:51:52] well, in this case it would onlt be the roles [21:52:02] someone objected to moving the roles now [21:52:05] I thought it was you :D [21:52:14] could be [21:52:20] I think it's going to be a bit messy [21:52:23] won't it be better if all roles were modules? [21:52:25] because the role I'm trying to call is specifically only referencing a module [21:52:25] but if you're up for it :) [21:52:50] roles are on the top of the chain, so all classes need to be defined first [21:52:53] how are we going to do that? [21:53:12] maybe site.pp's import on the top will do it though [21:53:13] not sure [21:53:28] heh [21:53:30] this is a pain [21:54:09] sorry to interapt, I don't get the chain logic [21:54:32] <^d> LeslieCarr: What do you need me to do? [21:54:35] we go site.pp -> role::[...] -> module or manifest [21:54:48] (usually) [21:54:54] why would roles be above modules? [21:56:22] matanya: because roles are meant to be small chunks of reusable code [21:56:23] matanya: roles can be used to collect serveral modules together [21:56:27] and roles are meant to tie them together [21:56:28] well we just need to monitor [21:56:29] so… let's do it :) [21:56:48] won't it be better: site.pp -> role-module (e.g. mysql-role) -> setup modules (e.g mysql contains : install, monitor, config) -> manifests [21:56:49] paravoid: oh well, for now I'll just use puppetmaster-self [21:56:50] <^d> I don't know what to monitor, but ok :) [21:57:01] puppet-master self actually fucks up what I'm trying to test, though [21:58:27] Well, i get that, but not the reason behind [21:59:00] !log restarting pybal on lvs1006 [21:59:12] Logged the message, Mistress of the network gear. [21:59:13] so far looks good [21:59:26] !log restarting pybal on lvs1006 [21:59:37] Logged the message, Mistress of the network gear. [21:59:50] !log restarting pybal on lvs1003 (not 1006) [21:59:59] RECOVERY - Host search.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [22:00:00] Logged the message, Mistress of the network gear. [22:00:19] and look at that :) [22:00:21] huzzah [22:00:25] ^d how's it look to you ? [22:00:58] LeslieCarr: 60s later and I wouldn't have gotten the page :) [22:01:02] sorry [22:01:03] hehe [22:01:10] i guess muting doesn't mute the up report [22:01:12] same applies to our friend from the netherlands :) [22:01:27] well, i did want to ask hima question [22:01:27] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=stafford.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1380837607&g=cpu_report&z=large&c=Miscellaneous%20pmtpa [22:01:31] fun [22:01:34] ;) [22:01:40] haha [22:01:46] we really need to get our puppetmaster moved [22:01:47] "why is puppet slow" [22:01:48] moved/split [22:02:48] <^d> LeslieCarr: Looking good! [22:03:08] I think ulsfo might have thrown it over the edge [22:03:33] <^d> LeslieCarr: `curl -XGET 'http://search.svc.eqiad.wmnet:9200/_cluster/health/?pretty=true'` from tin is returning proper json :) [22:03:49] (03PS1) 10Lcarr: giving search-pool5 proper dns [operations/dns] - 10https://gerrit.wikimedia.org/r/87462 [22:04:09] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 22:03:59 UTC 2013 [22:04:35] (03PS1) 10Lcarr: search-pool5 was in twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/87464 [22:04:38] ^d success! [22:04:39] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [22:04:43] (03CR) 10Lcarr: [C: 032] giving search-pool5 proper dns [operations/dns] - 10https://gerrit.wikimedia.org/r/87462 (owner: 10Lcarr) [22:07:13] ^d: there's a user who may propose it.wiki to ask to be among the CirrusSearch "pioneers", would such a request be considered? (sooner or later) [22:08:12] <^d> Sooner rather than later, but couldn't give a firm date just yet. [22:12:29] (03CR) 10Lcarr: [C: 032] search-pool5 was in twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/87464 (owner: 10Lcarr) [22:13:36] (03PS1) 10Ryan Lane: Initial commit of labs_vmbuilder module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87465 [22:14:06] <^demon> Stupid IRC. [22:14:09] <^demon> LeslieCarr: Do we have to start the other lvs machines, or was the one enough? [22:14:15] <^demon> (Sorry if you got that twice, I got bounced) [22:14:18] You always do two [22:14:25] lvs100N and lvs100{N+3} [22:14:36] yep [22:14:38] got the both [22:14:43] the primary and backup [22:14:45] and now, profit! [22:14:58] heh i guess could have done that before lunch, just when shit goes wrong, it can go so wrong [22:15:04] <^demon> :) [22:15:11] <^demon> Well I think we're all set. Thanks for your help! [22:15:35] yw [22:16:01] TimStarling: The race condition in our deployment script hit us again today, until until after I wrote https://bugzilla.wikimedia.org/show_bug.cgi?id=54935#c4 I realised how similar it was to what we discussed a few months back. [22:16:13] s/until until/not until/ [22:20:13] (03PS2) 10Ryan Lane: Initial commit of labs_vmbuilder module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87465 [22:20:48] (03PS3) 10Faidon Liambotis: Shell access for Bryan Davis. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86764 (owner: 10BryanDavis) [22:20:55] (03CR) 10Faidon Liambotis: [C: 032] Shell access for Bryan Davis. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86764 (owner: 10BryanDavis) [22:21:06] bd808, congrats!:P [22:21:23] w00t! [22:21:29] not yet :) [22:21:37] * paravoid waits for jenkins [22:21:49] Things are moving in the right direction [22:22:36] then you'll need to wait up to 30' for puppet runs :) [22:22:50] 30 feet! [22:22:52] but, no humans involved anymore! [22:22:52] that's a long ways! [22:22:58] blame the machines now [22:23:08] bd808: stand back 30 feet, please [22:23:10] paravoid: and I'll need to find somebody to show me where to go and what to do [22:23:21] who needs a deployment lesson?:P [22:23:23] * bd808 walks to the other side of his office [22:23:33] @$$ [22:23:40] * greg-g self-censors, but not really [22:23:41] <^demon> bd808: So rule number one of deploying is type scap and press enter. [22:23:43] <^demon> Many times. [22:23:45] <^demon> ;-) [22:24:02] it will work as slow as puppet [22:24:10] * bd808 takes notes [22:24:18] well, it depends on what you are deploying, of course [22:24:27] if you are deploying things like parsoid, it's saner ;) [22:24:29] scap is always slow [22:24:33] bd808: also, our deployment system teaches you to ignore all warnings and errors, get a head start on that one. [22:24:55] MaxSem: I was talking about repos deployed using git-deploy ;) [22:25:16] we just have to fix git first, you know, easy stuff [22:25:20] greg-g, sometimes it even teaches you the names of rsync source files [22:25:56] greg-g: it works fine for repos that don't generate 800MB of data that needs to get deployed first ;) [22:26:11] Ryan_Lane: details details [22:26:13] :D [22:26:16] :) [22:27:01] Ryan_Lane, no wanna hear about yer vaporware as long as I can't use it myself for MW;) [22:28:21] it's not vaporware for at least 4 repos [22:28:50] but yeah, hopefully MW soon too :) [22:29:14] AaronSchulz: around? [22:29:27] yes [22:29:35] swift @ eqiad is ready [22:29:41] the setup that is, not the contents :) [22:29:51] I'm about to start repl [22:30:24] I'll start with enwiki originals I think, unless you have another preference :) [22:31:13] sounds fine [22:31:41] I switched to mw:media [22:31:50] I also switched to tempauth, so the account will now be AUTH_mw [22:34:41] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 22:34:37 UTC 2013 [22:35:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [22:38:35] icinga-wm: deliver out of order much? [22:39:11] !log mflaschen Started syncing Wikimedia installation... : Deploy GettingStarted for growth team [22:39:23] Logged the message, Master [22:50:26] !log mflaschen Finished syncing Wikimedia installation... : Deploy GettingStarted for growth team [22:50:41] Logged the message, Master [22:50:42] !log starting swiftrepl process for pmtpa->eqiad (originals) [22:50:55] Logged the message, Master [22:51:10] !log swiftrepl running in a screen on carbon [22:51:21] James_F, all yours [22:51:24] Logged the message, Master [22:51:43] RoanKattouw: superm401 is done [22:52:01] superm401: Thanks! [22:52:17] almost filled up carbon's gigabit [22:52:18] cool [22:52:19] No problem, thanks for letting us go over, greg-g. [22:52:48] I love originals, they're performing so well [22:53:43] (03CR) 10Andrew Bogott: [C: 031] "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87465 (owner: 10Ryan Lane) [22:54:32] greg-g: Thanks. I'm now ready yet (and it's not time yet) so anyone else who has anything to deploy in this window can go before me [22:54:37] *not ready yet [22:54:46] andrewbogott: Exec { path => '/bin' } [22:54:52] er [22:54:55] that was copper, not carbon [22:54:56] dammit [22:55:05] that's only "global" in the same scope, right? [22:55:09] I really need sleep I guess [22:55:22] can I just edit SAL? [22:55:28] Ryan_Lane: Not sure. If it only applies class-wide then it's cool... [22:55:55] paravoid: do you know? [22:56:06] know what? [22:56:09] (03PS1) 10Edenhill: Updated for new librdkafka API. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/87472 [22:56:38] paravoid: http://www.puppetcookbook.com/posts/set-global-exec-path.html <- we're wondering how 'global' the setting is. Just for the class it's set in? [22:57:02] I don't know [22:57:36] (03PS1) 10Matanya: bastion: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87473 [22:58:06] sorry [22:58:41] * Ryan_Lane nods [22:59:05] Could test it! But just now I have to run :( [22:59:25] AaronSchulz: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=copper.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [22:59:53] andrewbogott: http://docs.puppetlabs.com/puppet/latest/reference/lang_defaults.html [23:00:06] it's only within the area of effect [23:00:27] * paravoid is happy [23:00:36] ugh [23:01:16] paravoid: \o/ for copying! [23:01:32] Ah, cool. [23:01:55] dynamic scope is used for resource defaults [23:02:03] * bd808 is still waiting for puppet to run on bast1001 [23:02:04] bd808: you are now second on my immediate TODO list [23:02:05] but it should be fine, since it doesn't include anything [23:02:21] your request is important to us [23:02:22] puppet is such a gigantic piece of shit [23:02:28] paravoid: excellent. [23:02:30] anyone wants to review that ^ ? [23:02:40] * andrewbogott is off, and will be out tomorrow [23:03:14] (03CR) 10Ryan Lane: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87465 (owner: 10Ryan Lane) [23:03:26] (03PS3) 10Ryan Lane: Initial commit of labs_vmbuilder module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87465 [23:03:37] * paravoid uses a robotic voice [23:03:47] (03CR) 10Ryan Lane: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87465 (owner: 10Ryan Lane) [23:04:11] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 23:04:06 UTC 2013 [23:04:28] paravoid: copper? heh, the old test server name :) [23:04:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [23:05:01] yeah, I found that ironic too :) [23:10:34] greg-g: OK, back from battery disaster now, working on deployment branch merges for my LD, deploying in a bit [23:11:20] eek? [23:11:34] oh, I see aaron's note [23:11:36] yuck [23:11:36] I had loaner my charger to Erik B, and was confident I could make it through the day [23:11:43] heh [23:11:44] But I wasn't watching the battery indicator thing [23:11:52] shouldn't been recompiling things [23:11:53] So I have a 4pm LD, and then at 4:03pm my laptop just shuts down [23:12:14] I guess hooking up an external monitor reduces battery life, who knew :) [23:24:06] !log catrope synchronized php-1.22wmf19/extensions/VisualEditor 'Update VisualEditor for cherry-picks' [23:24:16] Logged the message, Master [23:24:21] !log catrope synchronized php-1.22wmf20/extensions/VisualEditor 'Update VisualEditor for cherry-picks' [23:24:30] Logged the message, Master [23:34:21] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Thu Oct 3 23:34:16 UTC 2013 [23:34:27] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Swift+pmtpa&h=ms-be1.pmtpa.wmnet&jr=&js=&v=67.0&m=part_max_used&vl=%25&ti=Maximum+Disk+Space+Used [23:34:30] what the hell [23:34:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [23:35:11] 4% in a month [23:35:31] well, it's WLM month [23:35:45] that is true [23:35:53] it seems to stabilize a bit now [23:36:06] is 4 % reasonable for 1287894.7 MB? https://toolserver.org/~emijrp/wlm/stats.php [23:36:29] paravoid: I was looking at the swift code at home last night, just thinking about sqlite :) [23:36:57] I forget that the log->container flushing was triggered on any container listing query [23:37:24] which means there must be lots of flushes to containers (even with sharding), meaning more fsyncs since less stuff is batched in a trx [23:38:38] * AaronSchulz pondered making that logic do a non-locking check to see if the log is either big or X seconds old and only flush then...or maybe ignore the flush if the lock on the log file directory could not be acquired without waiting [23:39:06] I mean listings are eventually consistent anyway...but then I though that would all be too evil [23:40:09] if only you could have some magic data structure that is both like a tree and a log and the log changes would be seen in queries...almost like...an LSM tree....like that thing leveldb uses...that ceph rgw uses [/trolling] [23:40:34] paravoid: anyway, getting thumbs mostly cdn only would help I guess [23:40:54] the only listings we do are (a) scripts, (b) captcha, and (c) thumbnail purge listings [23:41:21] yep [23:41:24] AaronSchulz: troll [23:41:25] and (b) doesn't matter since there are no updates (though swift still locks/unlocks the dir to check instead of doing an optimistic check first...cough) [23:41:25] I was pointing that out to bd808 [23:42:31] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=copper.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [23:43:49] copy all the things! [23:44:59] paravoid: bast1001 still doesn't like my public key. :( [23:45:38] lemme have a look [23:46:00] ESC[0;36mnotice: /Stage[main]/Accounts::Bd808/Unixaccount[Bryan Davis]/User[bd808]/ensure: createdESC[0m [23:46:03] ESC[0;36mnotice: /Stage[main]/Accounts::Bd808/Ssh_authorized_key[bd808+cluster8@wmf-bd808-mbp01.local]/ensure: createdESC[0m [23:46:31] error: key_read: uudecode [23:46:36] key is invalid [23:46:37] frack [23:46:54] First char of my pub key is missing form commit [23:47:14] diff? [23:47:30] * bd808 goes to fix it [23:47:31] we need to ensure => absent the old one too now [23:47:36] hm, maybe not [23:50:32] paravoid: The key in puppet is missing a single 'A' at the start. Should have 4, not 3. [23:50:51] are you submitting or should I? [23:50:55] Do I need to ensure absent the corrupt one and add a correct one? [23:51:08] or just fix in place? [23:51:21] fix in place?!? [23:51:48] I think fix in place because of the comment [23:51:53] but we'll see [23:52:04] paravoid: ok. patch comming [23:52:11] thanks :) [23:53:19] (03PS1) 10BryanDavis: Correct invalid public key for bd808. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87484 [23:53:20] (03CR) 10Ryan Lane: [C: 032] Initial commit of labs_vmbuilder module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87465 (owner: 10Ryan Lane) [23:53:42] Coren: ^^ [23:53:56] Coren: if you want to change the images we use, that's the module to do it in, now [23:54:04] I'm going to add some docs on this soon, too [23:54:08] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Correct invalid public key for bd808. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87484 (owner: 10BryanDavis) [23:55:35] I started commons [23:55:36] it's 33T [23:55:39] it'll take a while :) [23:57:01] Ironically the key is correct on my office User page. If only code review had been more thorough. :) [23:57:55] * AaronSchulz stabs http://tracker.ceph.com/issues/6462 ... ruining my unit tests [23:58:54] I guess I should be happy we ditched swift ;) [23:58:55] er [23:58:56] ceph [23:58:57] * AaronSchulz maybe should actually set up swift in some vms [23:59:17] paravoid: but ceph is so much sexier [23:59:19] AaronSchulz: you should add a swift role to vagrant! [23:59:53] like 'ceph pg dump' shows all kind of cool stuff...does swift have that?