[00:12:19] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:19] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:19] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:19] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:19] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:20] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [00:30:10] (03CR) 10Ori.livneh: "Hey Andrew, I think it'd make sense for the two of us to spend a bit of time figuring out how we can merge your work on mediawiki_singleno" [operations/puppet] - 10https://gerrit.wikimedia.org/r/53989 (owner: 10Andrew Bogott) [01:09:22] (03CR) 10Andrew Bogott: "I'm on holiday for another couple of weeks -- if you want to recruit someone else to merge this that's fine, otherwise I will take care of" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [01:19:57] * Romaine points at https://bugzilla.wikimedia.org/show_bug.cgi?id=52853 [01:20:38] (03CR) 10Ori.livneh: "Oh, I forgot. No worries then. This can wait." [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [01:29:19] PROBLEM - Puppet freshness on hooft is CRITICAL: No successful Puppet run in the last 10 hours [01:57:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:08:17] !log LocalisationUpdate completed (1.22wmf12) at Thu Aug 15 02:08:17 UTC 2013 [02:08:29] Logged the message, Master [02:08:49] PROBLEM - Host mw16 is DOWN: PING CRITICAL - Packet loss = 100% [02:09:39] RECOVERY - Host mw16 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [02:20:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Aug 15 02:20:39 UTC 2013 [02:20:51] Logged the message, Master [02:22:19] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [02:25:29] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:31:20] !log rebooting fenari, kernel upgrade [02:31:31] Logged the message, Master [02:34:59] PROBLEM - Host fenari is DOWN: PING CRITICAL - Packet loss = 100% [02:35:39] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [02:43:23] !log rebooting bast1001, kernel upgrade [02:43:34] Logged the message, Master [02:54:49] !log rebooting iron, kernel upgrade [02:55:01] Logged the message, Master [04:25:19] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [04:58:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:59:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.158 second response time [05:10:38] greg-g: interesting isn't it? [05:13:03] apergos: which? [05:13:27] the nl wiki revision flipflops [05:14:11] where by "interesting" I mean "kind of a bummer and I wonder what else isn't right on that db" [05:14:15] oh yeah [05:14:25] yeah, totally a bummer [05:15:55] I suppose since there was some sort of replication failure ovr there the solution will be to rebuild it somehow, I wonder if they can do that form some recent snapshot or something [05:16:14] yeah, just read your comments. thanks for that. ugh [05:16:18] anyways that was a good theory which turned out right [05:16:22] yes ugh [05:21:56] go brian! :) [05:26:23] eh? [05:26:57] Bawolff [06:12:03] TimStarling: hmm, seems the new public key doesn't work [06:14:08] ah, well this is where it gets more difficult [06:14:27] are you the real Aaron Schulz or just someone who has stolen his laptop? ;) [06:15:21] heh, I don't see anything wrong with the key in the puppet repo [06:16:25] * Aaron|home looking at a30ec4c8d31633a8153f89deb0e4ec64b088d4b9 [06:18:27] (03PS1) 10Springle: remove db1009 from action while investigating bug 52853 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79178 [06:18:52] you are using fenari? [06:19:54] yes [06:20:15] try now [06:21:38] no luck [06:22:13] mostly I just wanted the auth.log lines without having to search for them ;) [06:22:22] Aug 15 06:21:46 fenari sshd[28025]: input_userauth_request: invalid user Aaron [preauth] [06:22:27] (03CR) 10Springle: [C: 032 V: 032] remove db1009 from action while investigating bug 52853 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79178 (owner: 10Springle) [06:22:30] maybe best to stick with lower case [06:23:19] that worked ;) [06:24:34] well for gitbash ssh, still faffing around with putty [06:31:28] ah, I had the old key path set in the putty session config [06:31:38] there we go [06:40:19] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [06:41:36] !log deploying CirrusSearch configuration changes (fa2991591a2eb5c8c461e471af49c62d428bc530) [06:41:47] Logged the message, Master [06:42:39] !log tstarling synchronized wmf-config/InitialiseSettings-labs.php [06:42:41] (03CR) 10Tim Starling: "Please do not merge things into mediawiki-config and then leave them undeployed for days." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [06:42:50] Logged the message, Master [06:43:25] !log tstarling synchronized wmf-config/InitialiseSettings.php [06:43:36] Logged the message, Master [06:44:26] !log tstarling synchronized wmf-config/CirrusSearch-common.php [06:44:37] Logged the message, Master [06:45:11] !log tstarling synchronized wmf-config/CirrusSearch-production.php [06:45:21] Logged the message, Master [06:45:53] !log tstarling synchronized wmf-config/CommonSettings.php [06:46:04] Logged the message, Master [06:47:31] !log tstarling synchronized wmf-config/db-eqiad.php 'depooling db1009 at springle's request' [06:47:42] Logged the message, Master [06:59:59] (03PS1) 10ArielGlenn: fix up private aliases class name for mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/79179 [07:01:20] (03CR) 10ArielGlenn: [C: 032] fix up private aliases class name for mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/79179 (owner: 10ArielGlenn) [07:02:16] !log delaying slave db52 during bug 52853 investigation [07:02:28] Logged the message, Master [07:03:39] RECOVERY - Puppet freshness on mchenry is OK: puppet ran at Thu Aug 15 07:03:35 UTC 2013 [08:13:33] !log starting pt-table-checksum on db1034 bug 52853 [08:13:44] Logged the message, Master [08:36:32] springle: loks like fun... [08:43:19] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [08:53:29] paravoid, it is ;-) compared the old pre-percona-toolkit options... [09:01:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.158 second response time [09:33:19] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [09:42:19] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [09:42:19] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [09:42:19] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [09:42:19] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [09:42:19] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [09:42:20] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [09:46:26] (03PS4) 10Faidon: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 [10:13:19] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [10:13:19] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:13:19] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [10:13:19] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [10:13:19] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [10:13:20] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [11:23:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [11:30:19] PROBLEM - Puppet freshness on hooft is CRITICAL: No successful Puppet run in the last 10 hours [11:30:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:40:20] (03PS1) 10TTO: Add several additional user groups for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79197 [11:48:02] (03PS1) 10Faidon: Re-enable multiwrites for Ceph [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79198 [11:48:04] (03CR) 10Faidon: [C: 04-1] "Not to be deployed yet." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79198 (owner: 10Faidon) [14:20:18] paravoid: I have a critical firmware update for all WD and Seagate H/D's. Since the disk is not showing failed this may correct the errors. Of course this would require me taking the server down. Would like to schedule this, lmk when the best time and day. [14:26:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.085 second response time [15:13:52] (03PS1) 10Petr Onderka: Fixed bug when reading empty string [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/79203 [15:14:14] (03CR) 10Petr Onderka: [C: 032 V: 032] Fixed bug when reading empty string [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/79203 (owner: 10Petr Onderka) [15:15:36] anybody here who could run a simple DB query on en.wp for me? [15:15:45] select count(*) from user_properties where up_property = 'gadget-oldeditor'; [15:20:18] MatmaRex, 145 [15:22:42] heh [15:23:13] this is people who opted out using the gadget? [15:23:25] that's people who bothered to disable visual editor with the gadget, but didn't visit wikipedia for the last two weeks [15:23:32] since the gadget now disables itself when ran [15:23:41] (and enables the pref) [15:23:47] https://en.wikipedia.org/w/index.php?title=MediaWiki:Gadget-oldeditor.js&diff=566875800&oldid=565485013 [15:23:53] ah [15:24:11] i was wondering if it could be killed already [15:24:13] so the rest might have been back already [15:24:50] summer vacation? tsk tsk [15:25:02] is there any chance to get one of you guys to run a little UPDATE there? ;) [15:30:01] I am having troubles to make sense of the request logs I get from the eqiad mobile caches (e.g.: cp1046.eqiad.wmnet). The total # of requests for the sampled-1000 dropped from ~60K do ~36K for yesterday. [15:30:07] Were there any mobile caches added recently (could not find something in the puppet repo) or some load balanacer updates? [15:30:59] (03PS1) 10Mark Bergsma: Add support for gerrit.wikimedia.org to the misc cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79204 [15:31:46] (right, i'll take that as a 'no' ;) ) [15:31:57] and another thing, how about select count(*) from valid_tag;? [15:32:04] looking at the code, nothing ever writes to that table [15:32:12] and only special:tags reads from it [15:32:21] (03CR) 10Mark Bergsma: [C: 032] Add support for gerrit.wikimedia.org to the misc cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79204 (owner: 10Mark Bergsma) [15:33:07] bblack: heya [15:33:23] paravoid: hi [15:34:08] RT #5614 [15:34:37] start:1376505554 uptime:75300 inpkts_recvd:18804368 inpkts_sane:18804368 inpkts_enqueued:18804368 inpkts_dequeued:10532275 queue_overflows:4 queue_size:426732 queue_max_size:426732 [15:34:48] yup [15:35:01] and yet I did confirm that content didn't get purged [15:35:50] well, on that host something's clearly wrong, given the large queue_size and equal queue_max_size, and 4 overflows since yesterday's update [15:35:58] for reference, cp1040 working ok: [15:35:59] Aug 15 15:32:21 cp1040 vhtcpd[17521]: Stats: start: 1376506941 uptime: 73800 inpkts_recvd: 18965554 inpkts_sane: 18965554 inpkts_enqueued: 18965554 inpkts_dequeued: 18965554 queue_overflows: 0 queue_size: 0 queue_max_size: 1654 [15:36:54] !log reedy synchronized php-1.22wmf13/ [15:37:05] Logged the message, Master [15:38:32] !log reedy synchronized docroot and w [15:38:43] Logged the message, Master [15:38:53] aha, he is alive! [15:38:59] (Reedy, that is) ;) [15:39:37] welcome back, Reedy. [15:39:47] paravoid: the basic issue seems to be that requests are dequeueing fine to :3128, but not to :80 (which is why the queue keeps them and eventually overflows). and it predates the latest release of the code, so it's not new [15:39:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [15:41:12] paravoid: oops, backwards, they succeed on 80 and seem to be stalled out on 3128 [15:43:49] paravoid: and the cause there seems to be that varnish never responds to the PURGE request... [15:44:18] (03CR) 10Demon: "This should actually be really nice for some things. It won't be able to cache the initial html sent to the browser (that's no-cache), but" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79204 (owner: 10Mark Bergsma) [15:44:57] paravoid: the traffic to :80 flows like normal with responses, but the activity for the fd for port 3128 is a very slow loop of: [15:45:04] 15:43:15.071951 shutdown(6, 2 /* send and receive */) = 0 [15:45:04] 15:43:15.072309 connect(6, {sa_family=AF_INET, sin_port=htons(3128), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) [15:45:06] 15:43:15.072608 getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [15:45:09] 15:43:15.072673 sendto(6, "PURGE /wikipedia/commons/thumb/1/16/DNA_orbit_animated.gif/200px-DNA_orbit_animated.gif HTTP/1.1\r\nHost: upload.wikimedia.org\r\nUser-Agent: vhtcpd\r\n\r\n", 148, 0, NULL, 0) = 148 [15:45:13] 15:44:12.091475 shutdown(6, 2 /* send and receive */) = 0 [15:45:22] (timeout after 1m) [15:48:38] hm [15:53:19] I tried some other random URLs and it does respond to e.g. a PURGE on "/wikipedia/commons/thumb/1/16/xxx.gif/200px-xxx.gif" [15:53:33] but not that DNA_orbit_one, just hangs the connection forever on a manual telnet [15:57:27] I'm of two minds on how to handle that: one says that varnish should always give a response in a reasonable amount of time and we find bugs better by requiring that. the other says the code should just give up on a queue entry after a few timeouts and toss it out and move on (perhaps incrementing some failure stat) [15:57:36] but then these issues would just get ignored more [16:01:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [16:08:21] !log reedy synchronized php-1.22wmf13/ [16:14:40] !log reedy Started syncing Wikimedia installation... : test2wiki to 1.22wmf13 and build l10n cache [16:14:51] Logged the message, Master [16:19:11] (03PS1) 10Jeremyb: tweak gitblit header text [operations/puppet] - 10https://gerrit.wikimedia.org/r/79208 [16:27:57] !log reedy Finished syncing Wikimedia installation... : test2wiki to 1.22wmf13 and build l10n cache [16:28:08] Logged the message, Master [16:28:22] Aaron|home: around? [16:28:51] (03CR) 10MaxSem: "Ping!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/73342 (owner: 10MaxSem) [16:29:04] That seems awfully quick [16:29:15] * Aaron|home is looking at oauth stuff atm [16:29:21] Aaron|home, greg-g: I'd like to schedule the reenablement of ceph in production, basically https://gerrit.wikimedia.org/r/#/c/79198/ [16:29:36] greg-g: any time next week would be fine with me [16:30:40] (03PS1) 10Reedy: Add wmf13 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79210 [16:30:41] (03PS1) 10Reedy: Rebuild IW cache after viwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79211 [16:31:02] (03CR) 10Reedy: [C: 032] Add wmf13 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79210 (owner: 10Reedy) [16:31:09] (03CR) 10Reedy: [C: 032] Rebuild IW cache after viwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79211 (owner: 10Reedy) [16:31:12] (03CR) 10Greg Grossmeier: [C: 031] "I like it. Better than what we have." [operations/puppet] - 10https://gerrit.wikimedia.org/r/79208 (owner: 10Jeremyb) [16:31:13] (03Merged) 10jenkins-bot: Add wmf13 symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79210 (owner: 10Reedy) [16:31:19] (03Merged) 10jenkins-bot: Rebuild IW cache after viwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79211 (owner: 10Reedy) [16:31:57] paravoid: aha [16:32:16] * greg-g looks at calendar [16:32:20] paravoid: how much time do you want? [16:32:43] half an hour tops? [16:32:49] cool [16:33:06] wow, busy week next week... [16:33:07] I mean, it's a single file, plus monitoring nothing broke [16:33:19] right [16:33:20] (03PS1) 10Reedy: 1.22wmf13 phase one wikis to 1.22wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79212 [16:33:31] I can always deploy it on a european morning ;) [16:33:39] why wouldn't you [16:33:45] lemme update the Deployments page real quick with next week's schedule just so it is easy to find a slot [16:33:50] yeah, is that a good time? [16:34:30] that pretty much garauntees that you can do it whenever you want, as long as it isn't Tuesday morning (when Language team has their window) [16:34:39] works for me [16:34:43] excellent [16:34:55] so, pencil you in for Mon or Wed? [16:34:56] (03PS2) 10Reedy: 1.22wmf13 phase one wikis to 1.22wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79212 [16:34:57] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: revert test2wiki to 1.22wmf12 till window [16:35:00] mon is fine I think [16:35:03] coolio [16:35:07] Logged the message, Master [16:35:09] what time? [16:35:13] (just so I can list it) [16:35:33] say 12:00 UTC? [16:35:56] that's 3pm not morning! :P [16:36:27] er, I was thinking 12 my time, but yeah 3pm works too [16:37:29] yep, put it down for 12:00 UTC [16:41:19] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [16:44:10] paravoid: (consider it) done (almost) [16:44:22] * greg-g updates gcal and wikipage [16:45:30] thanks! [16:48:31] * Aaron|home checks random <1 month old files [16:48:38] for? [16:49:17] paravoid: you saw rt 5600? [16:49:40] I did [16:49:54] seeing if they are synced [16:49:59] Aaron|home: speaking of logs, fluorine doesn't have a filebackend-ops [16:51:09] used to at least [17:00:08] maybe logrotate doesn't leave 0 byte .log files, though I thought it did [17:00:20] if there is no recent activity that would explain it not being there [17:00:40] the swift log has no write operation logs, so that would make sense [17:02:17] perfect [17:02:22] sync is done btw, just looping now [17:03:34] "1500 OK out of 1500 checks" [17:11:30] greg-g: where/how should we link to https://www.mediawiki.org/wiki/How_to_contribute ? [17:12:24] (03CR) 10Faidon: [C: 032] tweak gitblit header text [operations/puppet] - 10https://gerrit.wikimedia.org/r/79208 (owner: 10Jeremyb) [17:13:19] jeremyb: dunno, ask quim :) [17:13:28] greg-g: i mean from the header? [17:13:33] no quim here [17:14:00] and not in #wikimedia-deg [17:14:02] dev* [17:14:15] so i guess not on [17:15:02] wtf, grrrit-wm doesn't report jenkins fail? [17:15:43] oh, that's docs. i guess maybe that fails every time [17:16:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:02] has anyone already looked into https://rt.wikimedia.org/Ticket/Display.html?id=5614 ? [17:17:34] ottomata: only as far as I could with a web browser... [17:17:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.068 second response time [17:17:40] (i don't think anyone else did) [17:17:54] ottomata: it was in this channel a day or two ago [17:18:46] it seems to not be the "exists in varnish but not swift so it doesn't get purged" thing i'd seen before [17:20:25] do you know who knows about this kind of thing? [17:21:27] ottomata: ma rk, para void, aper gos, all should [17:21:31] maybe bb lack? [17:21:38] yeah I was looking at it a bit [17:21:41] thinking [17:21:42] and then bblack was [17:22:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:00] "5000 OK out of 5000 checks" \o/ [17:43:52] (03PS1) 10RobH: ytterbium install troubleshooting [operations/puppet] - 10https://gerrit.wikimedia.org/r/79216 [17:46:41] (03PS1) 10Faidon: ceph: more detailed status output for nagios check [operations/puppet] - 10https://gerrit.wikimedia.org/r/79217 [17:47:06] (03CR) 10Faidon: [C: 032 V: 032] ceph: more detailed status output for nagios check [operations/puppet] - 10https://gerrit.wikimedia.org/r/79217 (owner: 10Faidon) [17:51:59] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN noout flag(s) set [17:52:19] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN noout flag(s) set [17:52:26] (ignore those) [17:52:39] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN noout flag(s) set [17:54:33] (03CR) 10RobH: [C: 032] ytterbium install troubleshooting [operations/puppet] - 10https://gerrit.wikimedia.org/r/79216 (owner: 10RobH) [17:57:56] !log reedy synchronized php-1.22wmf13/includes/specials/SpecialNewimages.php [17:57:59] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN noout flag(s) set [17:58:06] Logged the message, Master [17:58:14] paravoid: i am ready if you are to update the disk firmware on ms-be1008 [18:01:37] cmjohnson1: go ahead [18:01:45] okay..cool [18:03:29] are you going to poweroff the box or should I? [18:03:39] is there anything to stop first? [18:03:48] no, a proper shutdown is enough [18:03:54] okay..than i will do it [18:04:41] !log powering down ms-be1008 to update hard drive firmware [18:04:52] Logged the message, Master [18:04:53] (03PS1) 10Ori.livneh: declare hafnium node in site.pp & grant sudo to myself [operations/puppet] - 10https://gerrit.wikimedia.org/r/79218 [18:05:38] ^ cmjohnson1 / paravoid -- not sure i did that right (i don't know if there are other steps to get puppet running on a host) [18:06:28] ori-l: since this has no role classes included [18:06:43] could you add perhaps a comment (and/or system_role) indicating what that box does? [18:07:09] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [18:07:14] yes, good idea [18:07:14] I can imagine e.g. someone seeing it go down, opening site.pp to find out what that is and getting all confused [18:14:32] (03PS2) 10Ori.livneh: declare hafnium node in site.pp & grant sudo to myself [operations/puppet] - 10https://gerrit.wikimedia.org/r/79218 [18:15:16] (03CR) 10Faidon: [C: 032] declare hafnium node in site.pp & grant sudo to myself [operations/puppet] - 10https://gerrit.wikimedia.org/r/79218 (owner: 10Ori.livneh) [18:15:19] Damn you gerrit [18:15:27] Oh, just slow [18:15:42] gerrit's slow, puppet's slow, icinga's slow [18:15:50] so annoying [18:15:51] <^d> Gerrit's slow because I had to flush all the caches. [18:16:06] (03CR) 10Reedy: [C: 032] 1.22wmf13 phase one wikis to 1.22wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79212 (owner: 10Reedy) [18:17:21] cmjohnson1, paravoid, RobH: thank youuuuuuuuuuuuuuuuuuuuuuu :) [18:17:51] (03Merged) 10jenkins-bot: 1.22wmf13 phase one wikis to 1.22wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79212 (owner: 10Reedy) [18:20:10] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: 1.22wmf13 phase1 [18:20:23] Logged the message, Master [18:21:38] dat wikiversions [18:24:08] * bawolff attempts to get attention for https://bugzilla.wikimedia.org/show_bug.cgi?id=52864 (cp1063 doesn't respond to htcp purges) [18:26:15] bawolff: ottomata was also poking about that [18:26:18] see logs [18:26:23] ah ok [18:26:45] Sorry, I should have looked before asking [18:27:00] np, just fyi [18:27:12] well, i didn't have much success (also was in meeting when I posted that, scrolling back now...) [18:27:31] bawolff: also, https://rt.wikimedia.org/Ticket/Display.html?id=5614 is set up so that you will get a copy of any replies on it [18:27:47] ok, sounds like paravoid and bblack were looking at it, sooooo [18:28:02] i think passwd reset may be broken but you could try doing one for bawolff+wn@ [18:28:21] I stopped looking at it once bblack took over [18:28:23] if that works then you should be able to read https://rt.wikimedia.org/Ticket/Display.html?id=5614 in the web UI [18:28:32] bawolff: ^ [18:29:03] yeah, I don't have access to that [18:29:08] password reset didn't do anything [18:29:13] bawolff: latest was, it's not that it doesn't to purges at all [18:29:32] varnish is being slow processing some of them and vhtcpd times out [18:29:50] that's according to what brandon said here, I don't know more [18:30:50] That's interesting. Its odd it would only be that server [18:31:04] In any case, I'm just happy people are looking into it [18:31:41] how does it work? are they bans? [18:31:43] it is odd [18:32:39] hm, should i be able to ssh into hafnium by now? [18:33:09] puppet says no [18:36:32] info: Applying configuration version '1376591719' [18:36:37] notice: /Stage[main]/No: No. [18:36:40] notice: Finished catalog run in 8.33 seconds [18:36:52] no? [18:36:57] ah, [18:36:58] i made that up [18:36:58] hah [18:37:01] yeah :) [18:37:14] I hadn't seen the backscroll :) [18:38:09] forcing a puppet run [18:39:02] thanks [18:39:43] err: Could not retrieve catalog from remote server: Connection timed out - connect(2) [18:39:49] puppet isn't happy [18:39:55] 15 17:22:49 <+icinga-wm> PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:20] * ori-l un-/ignores icinga-wm [18:42:13] (03CR) 10Aaron Schulz: [C: 031] "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79198 (owner: 10Faidon) [18:42:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.542 second response time [18:42:51] * Aaron|home keeps getting pre-IPO stuff in linkedin [18:43:10] Aaron|home: my intention was to keep it off, run sync jobs one last time, then enable it back again [18:43:13] sounds reasonable? [18:43:41] that's what we did last time wasn't it? [18:43:41] meh, that works :) [18:44:19] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [18:44:34] and I'm thinking of swapping masters e.g. on Wed [18:45:20] autoresync matters the most when changing masters or if the current one wasn't current that long and one is still paranoid [18:46:06] though it's less worrisome with 'conservative' which didn't exist the first times we when through this [18:46:28] ok [18:47:02] but, yeah, doing what you were going to do is fine [18:47:18] ori-l: hafnium should be ready [18:47:40] !log rebooting stafford & sockpuppet for kernel upgrades [18:47:48] dell sends me nothing but shit...paravoid this may have to be done another day [18:47:52] Logged the message, Master [18:47:58] cmjohnson1: what happened? [18:48:19] PROBLEM - RAID on stafford is CRITICAL: Timeout while attempting connection [18:48:20] the firmware iso can't find any of the disks [18:49:04] and works nothing like the instructions say [18:49:52] paravoid: another thing worth mentioning is ms-be1008 shows a foreign config..i left it alone but you may wanna look at that later [18:49:59] PROBLEM - Host stafford is DOWN: PING CRITICAL - Packet loss = 100% [18:50:10] that's weird [18:50:29] RECOVERY - Host stafford is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [18:50:44] cmjohnson1: are you going to boot the box now? [18:51:08] yes..if you want to console in [18:51:34] yeah I might just as well [18:51:55] actually it looks like the fw upgrade did go through...just super fast i guess [18:52:54] paravoid: nope. do i need to add accounts::olivneh? it's not set on vanadium but maybe it was at one point [18:53:29] duh, right [18:53:37] ^ that reboot was you? [18:53:40] yes [18:53:43] k [18:55:51] ││00:00:09 SAS 1862.50 GB Foreign - TOSHIBA [18:56:13] all seagate disks but the 10th, which is what it was failed [18:56:51] that toshiba was the last replacement...must not have put it back in properly [18:57:22] (03PS1) 10Ori.livneh: Add accounts::olivneh to vanadium & hafnium [operations/puppet] - 10https://gerrit.wikimedia.org/r/79225 [18:58:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:23] and disk 00:00:11 (i.e. bay 12) says SMART Error [18:59:52] want to take the console and try to make sense of all that? [19:00:00] bay 12 is the one blining green/amber [19:01:16] we want to add the disk back but where? [19:01:28] I'm confused :) [19:04:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.231 second response time [19:07:35] How come apparently we only monitor "Apache HTTP" in PMTPA and not EQIAD for mw app servers? [19:07:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:30] Oh.. Seemingly just on jobrunners [19:13:39] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:16:19] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [19:16:39] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [19:16:59] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [19:28:29] PROBLEM - NTP on ms-be1008 is CRITICAL: NTP CRITICAL: Offset unknown [19:29:24] (03PS1) 10MaxSem: Rebuild localisation cache in several threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 [19:33:29] RECOVERY - NTP on ms-be1008 is OK: NTP OK: Offset -0.0007193088531 secs [19:34:19] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [19:40:37] !log Added wb_property_info table to wikidatawiki and testwikidatawiki [19:40:48] Logged the message, Master [19:43:19] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [19:43:19] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [19:43:19] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [19:43:19] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [19:43:19] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [19:43:20] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [19:44:57] paravoid: can i has https://gerrit.wikimedia.org/r/#/c/79225/? [19:45:34] oh sorry [19:45:46]