[00:04:13] !log catrope Finished syncing Wikimedia installation... : Updating VisualEditor to master [00:04:18] Logged the message, Master [00:06:22] (03PS2) 10BryanDavis: Add *_delta stats for vhtcpd ganglia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80151 [00:11:21] !log catrope synchronized php-1.22wmf13/extensions/VisualEditor/modules/ve/ui/tools/buttons/ve.ui.UnderlineButtonTool.js 'touch' [00:11:26] Logged the message, Master [00:13:40] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [00:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [00:26:21] hi Risker :) [00:26:26] (03PS1) 10Ori.livneh: Tee centralauth wfDebug()s to vanadium for Gangliafication [operations/puppet] - 10https://gerrit.wikimedia.org/r/80164 [00:27:01] hi ori-l! we were supposed to talk about something last week but we both wound up getting on planes instead. Was it performance? [00:27:20] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [00:27:38] Eep -- I may have forgotten, too. [00:28:19] StevenW was playing matchmaker there. We can talk about it later this week, I'm kind of overloaded for the next day or two [00:29:23] Yes, performance! [00:29:26] Thanks for the reminder. [00:30:30] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:31:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.141 second response time [00:33:04] TimStarling: https://gerrit.wikimedia.org/r/80164 is almost an exact duplicate of a change you reviewed in the past which configured the udp2log instance on fluorine to relay exceptions/fatals to vanadium. This time it's centralauth, as an expedient hack to get a metric going to accompany the coming deployments. Could you review? [00:37:28] (03CR) 10Greg Grossmeier: [C: 031] Tee centralauth wfDebug()s to vanadium for Gangliafication [operations/puppet] - 10https://gerrit.wikimedia.org/r/80164 (owner: 10Ori.livneh) [00:39:20] (03CR) 10Tim Starling: [C: 04-1] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80164 (owner: 10Ori.livneh) [00:41:29] (03PS2) 10Ori.livneh: Tee centralauth wfDebug()s to vanadium for Gangliafication [operations/puppet] - 10https://gerrit.wikimedia.org/r/80164 [00:51:13] ori-l: do you want me to deploy it? [00:51:39] TimStarling: yes, I'd appreciate that. Thanks. [00:51:47] TimStarling: that'd be great (not sure if he's still here). getting more data the better at this point [00:52:00] well then [00:52:01] ;) [00:58:02] !log allow engineering and ops group members in RT to create new saved searches [00:58:07] Logged the message, Master [01:02:23] !log mwalker synchronized php-1.22wmf13/extensions/CentralNotice/special 'Applying CentralNotice fix for bug 53032' [01:02:28] Logged the message, Master [01:02:58] !log mwalker synchronized php-1.22wmf12/extensions/CentralNotice/special 'Applying CentralNotice fix for bug 53032' [01:03:03] Logged the message, Master [01:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:06:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:07:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:13:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:14:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:15:12] !log LocalisationUpdate completed (1.22wmf13) at Wed Aug 21 02:15:12 UTC 2013 [02:15:18] Logged the message, Master [02:22:02] AaronSchulz: it's a tribute? [02:28:33] !log LocalisationUpdate completed (1.22wmf12) at Wed Aug 21 02:28:33 UTC 2013 [02:28:39] Logged the message, Master [02:34:50] !log [02:35:19] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Aug 21 02:35:19 UTC 2013 [02:35:24] Logged the message, Master [02:46:02] !log RT - grant global right to see system dashboards to privileged users [02:46:07] Logged the message, Master [02:46:40] !log RT - create new system dashboard 'wikimedia default' that lists open ops-requests, quick ticket creation, reminders and bookmarked tickets [02:46:45] Logged the message, Master [02:50:03] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [02:56:03] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: No successful Puppet run in the last 10 hours [03:16:29] (03CR) 10Dzahn: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [03:22:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [03:28:49] Coren: I'm around now. my jet lag caused me to pass out for most of the day [03:30:29] just "jet lag"? [03:32:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [03:38:06] (03PS3) 10Jalexander: Replace public key for jamesofur [operations/puppet] - 10https://gerrit.wikimedia.org/r/79304 [03:47:42] Ryan_Lane: Yeay jet lag. Go to sleep, I'll talk to you tomorrow. :-) (I'm off to bed to rest that effing flu away) [03:48:58] heh, well now I'm not tired :) [03:50:05] Ryan_Lane: work through the night, it'll be fun ;) [03:50:12] :D [03:51:28] Ryan_Lane and greg-g (and ori-l and TimStarling and the others who have been working on it), thank you for your work on the HTTPS issue. [03:51:41] yw [03:51:50] Risker: you're welcome. [03:52:03] Tim's doing the hard part, I'm just writing emails and blog posts and wiki pages ;) [03:52:35] yeah, that wikipage needs a good copy edit and comb-out, but at least it's there :) [03:52:44] please do! [03:52:47] I'm doing peice meal right now [03:53:12] also, if you haven't seen, the Blog post draft has more up to date info that I'll copy over to the HTTPS metawiki page in the morning [03:53:16] * Risker tries to find it again....darn watchlist.... [03:53:23] https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts/HTTPS_by_default_for_logged_in_users [03:53:57] * greg-g goes to check in on family [03:53:59] see ya later [04:22:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [05:14:05] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [05:22:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:30:54] greg-g, did some copy editing for you, also left some CE suggestions on the talk page [05:36:26] yes, that's a good point [05:52:13] PROBLEM - Puppet freshness on srv281 is CRITICAL: No successful Puppet run in the last 10 hours [05:52:13] PROBLEM - Puppet freshness on virt2 is CRITICAL: No successful Puppet run in the last 10 hours [05:55:13] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:00:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:02:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [06:14:09] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [06:49:24] PROBLEM - search indices - check lucene status page on search32 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 480 bytes in 0.057 second response time [07:02:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [07:05:43] (03PS1) 10Tim Starling: Proposed configuration for wgSecureLogin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80175 [07:05:59] (03CR) 10Tim Starling: "Untested." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80175 (owner: 10Tim Starling) [07:15:36] (03PS1) 10ArielGlenn: fix canonical ip addr for snmptrap [operations/puppet] - 10https://gerrit.wikimedia.org/r/80176 [07:16:46] (03PS2) 10ArielGlenn: fix canonical ip addr for snmptrap [operations/puppet] - 10https://gerrit.wikimedia.org/r/80176 [07:18:41] (03CR) 10ArielGlenn: [C: 032] fix canonical ip addr for snmptrap [operations/puppet] - 10https://gerrit.wikimedia.org/r/80176 (owner: 10ArielGlenn) [07:22:38] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Wed Aug 21 07:22:37 UTC 2013 [07:27:31] (03CR) 10MaxSem: [C: 031] Proposed configuration for wgSecureLogin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80175 (owner: 10Tim Starling) [07:31:38] apergos: that's wrong [07:32:03] please explain [07:32:05] you're assuming eth0 has an IP which can easily not be the case [07:32:15] e.g. all machines with bonds [07:32:39] what woudl you propose? [07:34:00] set it to ipaddress_bond0 if that variable has a value, otherwise to eth0? [07:34:04] or are there more exceptions? [07:34:56] who knows really :) [07:37:18] well I could do that for now [07:47:32] can I update ULS to master on 1.22wmf13 now? [07:51:07] I'm the only one having deployment windows today [07:51:18] and I'm saying it's okay :) [07:51:19] go ahead [07:51:26] thanks [07:51:34] yw [07:54:07] RECOVERY - Puppet freshness on virt1005 is OK: puppet ran at Wed Aug 21 07:54:00 UTC 2013 [07:57:19] (03CR) 10Faidon: [C: 032] "Having compatibility with what's in precise is a good thing, so thanks for that." [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/80127 (owner: 10Edenhill) [07:57:25] (03CR) 10Faidon: [V: 032] "Having compatibility with what's in precise is a good thing, so thanks for that." [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/80127 (owner: 10Edenhill) [08:01:08] !log nikerabbit synchronized php-1.22wmf13/extensions/UniversalLanguageSelector/ 'ULS to master' [08:01:20] Logged the message, Master [08:01:23] I'm done [08:12:12] Nischayn22|Away: thanks [08:12:26] (03CR) 10Faidon: [C: 04-1] "(4 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 (owner: 10Akosiaris) [08:19:24] (03PS5) 10Faidon: exim: add DKIM for wikimedia.org domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/79754 [08:19:25] (03PS1) 10Faidon: mailman: remove DKIM headers [operations/puppet] - 10https://gerrit.wikimedia.org/r/80181 [08:20:16] (03CR) 10Faidon: [C: 032] mailman: remove DKIM headers [operations/puppet] - 10https://gerrit.wikimedia.org/r/80181 (owner: 10Faidon) [08:20:23] (03CR) 10Faidon: [C: 032] exim: add DKIM for wikimedia.org domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/79754 (owner: 10Faidon) [08:40:05] (03PS1) 10ArielGlenn: one more try for snmp trap ip client ip address [operations/puppet] - 10https://gerrit.wikimedia.org/r/80182 [08:41:51] (03CR) 10ArielGlenn: [C: 032] one more try for snmp trap ip client ip address [operations/puppet] - 10https://gerrit.wikimedia.org/r/80182 (owner: 10ArielGlenn) [08:44:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:46:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [09:20:50] (03PS1) 10Faidon: exim: s/content/source/ on DKIM keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/80184 [09:21:04] (03CR) 10Faidon: [C: 032 V: 032] exim: s/content/source/ on DKIM keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/80184 (owner: 10Faidon) [09:21:07] ffs [09:21:10] SO MUCH TIME [09:27:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [09:31:57] 2013-08-21 12:31:25 1VC4lA-0007tq-Pn DKIM: d=wikimedia.org s=wikimedia c=relaxed/relaxed a=rsa-sha256 [verification succeeded] [09:32:00] 2013-08-21 12:31:42 1VC4lR-0007tx-UT DKIM: d=lists.wikimedia.org s=wikimedia c=relaxed/relaxed a=rsa-sha256 [verification succeeded] [09:32:03] finally [10:13:51] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [10:23:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:24:25] (03PS1) 10Faidon: exim: revert DKIM signing for wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/80189 [10:24:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.535 second response time [10:26:19] (03CR) 10Faidon: [C: 032] exim: revert DKIM signing for wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/80189 (owner: 10Faidon) [10:58:44] and finally starting my windows [10:58:44] (03PS1) 10Faidon: filebackend: switch master to Ceph [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80195 [11:00:52] (03CR) 10Faidon: [C: 032] filebackend: switch master to Ceph [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80195 (owner: 10Faidon) [11:02:08] !log faidon synchronized wmf-config/filebackend.php 'promote ceph to master' [11:02:13] Logged the message, Master [11:54:46] good luck! [12:03:07] (03PS1) 10Yuvipanda: Add appropriate support for websocket proxying [operations/puppet] - 10https://gerrit.wikimedia.org/r/80201 [12:03:21] it's fiiine [12:08:18] !log Running extensions/CentralAuth/maintenance/populateHomeDB.php in a screen on tin [12:08:23] Logged the message, Master [12:08:54] :-) [12:12:32] (03PS9) 10Yuvipanda: Route requests based on data from Redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/78025 [12:12:34] (03PS9) 10Yuvipanda: Add redis lua library to labsproxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/78002 [12:12:34] (03PS2) 10Yuvipanda: Add appropriate support for websocket proxying [operations/puppet] - 10https://gerrit.wikimedia.org/r/80201 [12:12:36] (03PS1) 10Yuvipanda: Remove useless proxy_redirect directive [operations/puppet] - 10https://gerrit.wikimedia.org/r/80203 [12:23:17] apergos: so, what did you do with ::ipaddress after all? [12:23:52] if bond0 is set use that, if not see if eth0 is set and use that, otherwise fall back to ipaddress [12:24:05] I tried checking with salt which hss didn't have eth0 [12:24:10] those all seemed to have bond0 [12:24:28] *which hosts [12:45:25] I think we should just $::ipaddress [12:45:36] virt2 must be the only real exception [12:45:46] but anyway, let's talk over a gerrit changeset :) [12:50:56] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [12:56:56] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: No successful Puppet run in the last 10 hours [13:08:40] (03PS1) 10Faidon: Prepare rubidium/mexia for authdns migration [operations/puppet] - 10https://gerrit.wikimedia.org/r/80211 [13:10:06] (03CR) 10Faidon: [C: 032] Prepare rubidium/mexia for authdns migration [operations/puppet] - 10https://gerrit.wikimedia.org/r/80211 (owner: 10Faidon) [13:31:04] okay, here goes [13:38:16] !log switching ns1 traffic to mexia (new authdns) [13:38:21] Logged the message, Master [13:40:00] (03CR) 10Akosiaris: "(4 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 (owner: 10Akosiaris) [13:46:25] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:29] when in the move to HTTPS happening? [13:46:48] (03PS5) 10Akosiaris: Refactoring nrpe module (round 2/??) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 [13:47:55] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:15] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.005 second response time [13:49:45] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:47] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:47] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:47] RECOVERY - HTTP radosgw on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 2.950 second response time [13:49:50] aww crap [13:50:35] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60995 bytes in 0.230 second response time [13:50:37] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [13:50:37] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [13:50:37] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:10] wth [13:51:13] there's nothing wrong [13:51:26] malafaya: Answered in -tech. [13:54:25] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [13:55:38] ganglia says ms-be1008 was unhappy there for a bit [13:56:06] it's been unhappy in general [13:56:12] it was like that before [13:56:32] there is a spike at that time though [13:56:35] hmmm [13:56:36] weird [13:56:43] thanks, that was a nice hint [13:56:46] sure [13:56:59] I had it on my todo to ask #ceph on how to debug this high cpu usage [13:57:02] look at 2/4hr [13:57:26] look at day, it's even worse [13:57:28] pretty interesting [13:58:31] !log stopping ceph-osd 88 & 95 (ms-be1008), evidence of unexplainable high cpu usage [13:58:36] Logged the message, Master [14:00:05] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 708 pgs degraded: 395 pgs stuck unclean: recovery 12860568/908502459 degraded (1.416%): 2/143 in osds are down [14:00:15] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 708 pgs degraded: 396 pgs stuck unclean: recovery 12860572/908502633 degraded (1.416%): 2/143 in osds are down [14:00:25] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 708 pgs degraded: 406 pgs stuck unclean: recovery 12860574/908502678 degraded (1.416%): 2/143 in osds are down [14:03:14] CPU is now increased for normal reasons [14:03:37] the OSDs were out'ed, so now other OSDs on that box are getting all of its data [14:05:48] holy crap [14:05:52] ? [14:05:58] http://ganglia.wikimedia.org/latest/?c=Ceph%20eqiad&h=ms-be1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [14:06:01] gigabit's full [14:06:09] yeow [14:06:22] take that swift [14:08:52] so why is there a spike in traffic and load like that in the last ten mins? that sems pretty sharp [14:09:00] (only if you are not busy) [14:09:01] because I stopped osd.88 & osd.95 [14:09:10] two osds has that much impact? [14:09:14] ms-be1001/2 have the other two copies [14:09:25] so now it's reconstructing the third copy out of the other two [14:09:33] for all of 88/95's data [14:09:37] that's good! [14:09:43] as long as I can limit it somehow [14:09:47] as fas as it can go you mean [14:09:48] it's nice that it can go at it full speed [14:09:58] unlike say swift [14:10:09] where it took weeks for you to depool machines :) [14:10:16] itmight be nice to be able to throttle it some [14:10:26] (03PS1) 10Akosiaris: Tab cleanup in site.pp. Also fix vim modeline [operations/puppet] - 10https://gerrit.wikimedia.org/r/80213 [14:14:12] amazing [14:16:18] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [14:16:18] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [14:16:27] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:28] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:28] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [14:16:34] fuck [14:16:37] PROBLEM - HTTP radosgw on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:37] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [14:16:57] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:17:17] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [14:17:28] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:17:28] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [14:17:37] grumble [14:20:17] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [14:20:18] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.601 second response time [14:20:18] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.156 second response time [14:20:18] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.751 second response time [14:20:27] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.018 second response time [14:20:37] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.638 second response time [14:21:07] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [14:21:18] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.112 second response time [14:21:27] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60995 bytes in 0.146 second response time [14:21:30] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:47] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [14:22:27] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [14:24:17] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.008 second response time [14:38:40] (03PS1) 10Ottomata: Preparing to remove private webrequest logs from stat1. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80214 [14:42:41] (03CR) 10Ottomata: [C: 032 V: 032] Preparing to remove private webrequest logs from stat1. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80214 (owner: 10Ottomata) [14:49:40] Haha 28.6M rows in globalnames [14:58:34] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:22] (03PS1) 10Faidon: Ceph: tune config knobs in response to mini-outage [operations/puppet] - 10https://gerrit.wikimedia.org/r/80219 [14:59:25] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [15:00:44] (03CR) 10Faidon: [C: 032] Ceph: tune config knobs in response to mini-outage [operations/puppet] - 10https://gerrit.wikimedia.org/r/80219 (owner: 10Faidon) [15:14:09] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [15:20:09] PROBLEM - HTTP radosgw on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:49] PROBLEM - HTTP radosgw on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:49] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [15:22:59] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [15:22:59] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [15:23:19] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [15:23:19] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [15:23:19] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [15:23:30] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [15:23:39] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:39] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:24:42] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [15:24:42] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [15:25:24] (03PS2) 10Reedy: Remove long-buried $wgLogAutocreatedAccounts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79059 (owner: 10Nemo bis) [15:25:28] (03CR) 10Reedy: [C: 032] Remove long-buried $wgLogAutocreatedAccounts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79059 (owner: 10Nemo bis) [15:25:40] (03Merged) 10jenkins-bot: Remove long-buried $wgLogAutocreatedAccounts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79059 (owner: 10Nemo bis) [15:25:59] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:04] (03PS2) 10Reedy: Set Wikibase sort order to alphabetic for ilowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79990 (owner: 10TTO) [15:26:09] (03CR) 10Reedy: [C: 032] Set Wikibase sort order to alphabetic for ilowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79990 (owner: 10TTO) [15:26:19] (03Merged) 10jenkins-bot: Set Wikibase sort order to alphabetic for ilowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79990 (owner: 10TTO) [15:26:30] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60992 bytes in 5.537 second response time [15:26:33] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.420 second response time [15:26:37] (03CR) 10Raimond Spekking: "Thanks for merging. But https://de.planet.wikimedia.org/ still shows the old URL. Anyhting more to do? Or is it related to the failure abo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79466 (owner: 10Raimond Spekking) [15:27:09] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [15:27:09] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [15:27:49] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.176 second response time [15:28:12] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78502 (owner: 10TTO) [15:28:19] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.987 second response time [15:28:20] (03CR) 10Akosiaris: [C: 032] Tab cleanup in site.pp. Also fix vim modeline [operations/puppet] - 10https://gerrit.wikimedia.org/r/80213 (owner: 10Akosiaris) [15:28:29] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.857 second response time [15:28:46] (03PS2) 10Reedy: Add WikiProject namespace for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78648 (owner: 10TTO) [15:28:49] RECOVERY - HTTP radosgw on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [15:28:53] (03CR) 10Reedy: [C: 032] Add WikiProject namespace for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78648 (owner: 10TTO) [15:28:59] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.176 second response time [15:29:08] (03Merged) 10jenkins-bot: Add WikiProject namespace for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78648 (owner: 10TTO) [15:29:20] (03PS3) 10Reedy: Create five additional namespaces for pflwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78624 (owner: 10TTO) [15:29:26] (03CR) 10Reedy: [C: 032] Create five additional namespaces for pflwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78624 (owner: 10TTO) [15:29:38] (03Merged) 10jenkins-bot: Create five additional namespaces for pflwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78624 (owner: 10TTO) [15:30:14] (03PS2) 10Reedy: NewUserMessage extension configuration on ckb.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78662 (owner: 10Dereckson) [15:30:17] (03CR) 10Reedy: [C: 032] NewUserMessage extension configuration on ckb.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78662 (owner: 10Dereckson) [15:30:29] (03Merged) 10jenkins-bot: NewUserMessage extension configuration on ckb.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78662 (owner: 10Dereckson) [15:31:19] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:30] (03PS2) 10Reedy: skwikisource: Project name localization [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79016 (owner: 10Danny B.) [15:31:30] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [15:31:39] (03CR) 10Reedy: [C: 032] skwikisource: Project name localization [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79016 (owner: 10Danny B.) [15:31:39] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:39] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:49] (03Merged) 10jenkins-bot: skwikisource: Project name localization [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79016 (owner: 10Danny B.) [15:31:49] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:51] paravoid: flapping a bit? [15:31:59] PROBLEM - HTTP radosgw on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:59] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:04] (03PS1) 10Faidon: Revert "filebackend: switch master to Ceph" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80221 [15:32:29] (03CR) 10Faidon: [C: 032] Revert "filebackend: switch master to Ceph" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80221 (owner: 10Faidon) [15:32:49] (03PS3) 10Reedy: Add several additional user groups for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79197 (owner: 10TTO) [15:32:59] (03CR) 10Reedy: [C: 032] Add several additional user groups for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79197 (owner: 10TTO) [15:33:08] (03Merged) 10jenkins-bot: Add several additional user groups for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79197 (owner: 10TTO) [15:33:19] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:19] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:30] (03PS2) 10Reedy: Add namespace aliases (shortcuts) for dewikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79550 (owner: 10TTO) [15:33:37] (03CR) 10Reedy: [C: 032] Add namespace aliases (shortcuts) for dewikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79550 (owner: 10TTO) [15:33:47] (03Merged) 10jenkins-bot: Add namespace aliases (shortcuts) for dewikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79550 (owner: 10TTO) [15:33:58] Reedy: are you deploying? [15:34:15] I need to revert this but diff is dirty [15:34:20] has some wikisource stuff [15:34:25] Was going to when I'd finished merging stuff... [15:34:32] I've not pulled anything onto tin yet [15:34:41] I need to though [15:34:44] I fetched [15:34:49] RECOVERY - HTTP radosgw on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.032 second response time [15:35:06] Right, I'll stop merging while you do yours [15:35:15] can I merge? I'm just going to sync one file [15:35:31] yeah, that's fine [15:36:19] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.721 second response time [15:36:28] !log faidon synchronized wmf-config/filebackend.php 'revert ceph promotion to master' [15:36:30] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.659 second response time [15:36:33] Logged the message, Master [15:37:09] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [15:37:09] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [15:37:29] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.091 second response time [15:37:29] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60992 bytes in 0.220 second response time [15:37:33] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.043 second response time [15:37:40] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.060 second response time [15:37:59] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.232 second response time [15:38:04] (03PS2) 10Reedy: Set up flood flag on shwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79324 (owner: 10TTO) [15:38:09] (03CR) 10Reedy: [C: 032] Set up flood flag on shwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79324 (owner: 10TTO) [15:38:18] (03Merged) 10jenkins-bot: Set up flood flag on shwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79324 (owner: 10TTO) [15:38:52] (03CR) 10Reedy: [C: 032] Enable "block" action for AbuseFilter on meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79458 (owner: 10TTO) [15:39:17] (03PS2) 10Reedy: Clean up abusefilter.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78951 (owner: 10TTO) [15:40:04] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76277 (owner: 10TTO) [15:40:29] (03PS2) 10Reedy: Dereference unused category from ArticleFeedbackToolv5 en.wiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79832 (owner: 10Nemo bis) [15:40:34] (03CR) 10Reedy: [C: 032] Dereference unused category from ArticleFeedbackToolv5 en.wiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79832 (owner: 10Nemo bis) [15:40:45] (03Merged) 10jenkins-bot: Dereference unused category from ArticleFeedbackToolv5 en.wiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79832 (owner: 10Nemo bis) [15:41:23] (03PS3) 10Reedy: Add missing HTTP error pages on bits.wm.o [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78507 (owner: 10TTO) [15:42:17] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [15:42:48] (03CR) 10Reedy: [C: 04-1] (bug 52997) $wgCategoryCollation to 'uca-ru' on all Russian-language [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [15:43:47] (03PS3) 10Reedy: Clean up abusefilter.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78951 (owner: 10TTO) [15:43:51] (03CR) 10Reedy: [C: 032] Clean up abusefilter.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78951 (owner: 10TTO) [15:44:08] (03Merged) 10jenkins-bot: Clean up abusefilter.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78951 (owner: 10TTO) [15:44:29] (03PS2) 10Reedy: Enable "block" action for AbuseFilter on meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79458 (owner: 10TTO) [15:44:34] (03CR) 10Akosiaris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 (owner: 10Akosiaris) [15:44:38] (03CR) 10Reedy: [C: 032] Enable "block" action for AbuseFilter on meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79458 (owner: 10TTO) [15:44:47] (03Merged) 10jenkins-bot: Enable "block" action for AbuseFilter on meta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79458 (owner: 10TTO) [15:46:01] !log reedy synchronized wmf-config/ [15:46:06] Logged the message, Master [15:46:12] That was very quick [15:46:38] !log reedy synchronized wmf-config/ [15:47:08] what's wrong? [15:47:33] Nothing [15:47:38] It just seemed to be very quick [15:47:44] real 0m23.814s [15:48:40] I suspect the dsh list being updated helped [15:51:39] !log switching ns0 to rubidium [15:51:44] Logged the message, Master [15:53:10] PROBLEM - Puppet freshness on srv281 is CRITICAL: No successful Puppet run in the last 10 hours [15:59:52] (03PS4) 10Reedy: Add missing HTTP error pages on bits.wm.o [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78507 (owner: 10TTO) [16:00:06] (03CR) 10Reedy: [C: 032] Add missing HTTP error pages on bits.wm.o [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78507 (owner: 10TTO) [16:00:18] (03Merged) 10jenkins-bot: Add missing HTTP error pages on bits.wm.o [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78507 (owner: 10TTO) [16:04:29] !log reedy synchronized docroot and w [16:04:34] Logged the message, Master [16:15:08] PROBLEM - Puppet freshness on mw1046 is CRITICAL: No successful Puppet run in the last 10 hours [16:17:42] (03PS1) 10Akosiaris: Adding brewster to new backup system [operations/puppet] - 10https://gerrit.wikimedia.org/r/80223 [16:17:59] (03CR) 10jenkins-bot: [V: 04-1] Adding brewster to new backup system [operations/puppet] - 10https://gerrit.wikimedia.org/r/80223 (owner: 10Akosiaris) [16:19:27] (03PS1) 10Faidon: authdns: migrate ns2 to eeden [operations/puppet] - 10https://gerrit.wikimedia.org/r/80224 [16:19:28] (03PS1) 10Faidon: authdns: remove IP/dns::authserver from old NS [operations/puppet] - 10https://gerrit.wikimedia.org/r/80225 [16:20:21] (03CR) 10Faidon: [C: 032] authdns: migrate ns2 to eeden [operations/puppet] - 10https://gerrit.wikimedia.org/r/80224 (owner: 10Faidon) [16:21:02] !log switching ns2 to eeden [16:21:07] Logged the message, Master [16:23:18] PROBLEM - Host ns2.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:23:30] that's expected [16:23:32] damn you puppet [16:25:38] RECOVERY - Host ns2.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 89.19 ms [16:27:16]