[00:07:32] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [00:09:02] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1482912 (10lfaraone) The English Wikipedia Arbitration Committee moved **from** `arbcom.en.wikipedia.org` to `arbcom-en.wikipedia.org` (s... [00:14:50] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1482913 (10Jalexander) >>! In T102814#1482912, @lfaraone wrote: > The English Wikipedia Arbitration Committee moved **from** `arbcom.en.w... [00:21:33] Jamesofur: what SSL cert? :P https://arbcom.en.wikipedia.org/ [00:22:08] lfaraone: https://arbcom-en.wikipedia.org/ works fine, as well you know. [00:22:14] I imagine that was part of the thing :P we weren't going to rebuy it :) [00:22:21] as in 'did not rebuy it' [00:22:39] Jamesofur: It randomly stopped working, which is why I requested a move in the first place :P [00:22:49] it was in no way random [00:22:52] lfaraone: Repeatedly stopped, not just random. [00:23:07] sorry, nobody on the committee at the time seemed to know that it happened or why. [00:23:34] lfaraone: Attackers ... from arbcom.en.wikipedia.org (for example, ... credit cards)" are ArbCom still not happy with their enwiki control they need our credit cards? :p [00:23:35] oh I'm sure that no one communicated it (i doubt we knew) but just saying it wasn't random :) [00:23:54] I'm not saying it was an unannounced change; it just was something we were told as new arbs "yeah there's an SSL error, ignore it" which makes me sad. [00:24:05] * Jamesofur nods [00:24:12] but not surprising :) [00:24:19] Jamesofurrrrrr [00:24:22] Hey [00:24:26] \o [00:24:30] JohnFLewis: ? [00:24:43] ah yes. [00:25:25] how else would I get my Platinum dental insurance from English Wikipedia Arbitration Committee, Inc. [00:26:07] lfaraone: arbocom.en.wikipedia.org never worked over HTTPS [00:26:14] jeez. [00:26:15] s/arbo/arb/ [00:26:27] well, I'm not surprised arbocom.en never worked. [00:26:34] ;) [00:29:17] the backstory is: we're trying to get rid of third-level subdomains altogether, as we move towards HSTS includeSubdomains [00:29:51] http://arbcom.en works now and redirects to https://arbcom-en, but once we get HSTS for the whole domain, it won't [00:41:34] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [00:47:24] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1482939 (10Krinkle) [00:51:34] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [00:52:50] (03CR) 10Krinkle: "If you view the source of any article you'll still see a reference to /w/static, namely for $wgLocalStylePath. view-source:https://en.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/219107 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [00:53:32] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [00:55:32] good catch, Krinkle_ [01:07:32] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [01:19:33] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [01:20:32] (03PS1) 10Faidon Liambotis: Depool ulsfo, network troubles [dns] - 10https://gerrit.wikimedia.org/r/227067 [01:20:44] d'oh [01:20:52] (03CR) 10Faidon Liambotis: [C: 032] Depool ulsfo, network troubles [dns] - 10https://gerrit.wikimedia.org/r/227067 (owner: 10Faidon Liambotis) [01:21:22] I was hoping it'd pass [01:21:38] looks more serious than I initially thought, so meh [01:25:23] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [01:28:04] (03PS5) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [01:29:12] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [01:35:22] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [01:39:33] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [01:43:04] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [01:46:32] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.3 with snmp version 2 [01:50:23] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [01:56:22] RECOVERY - Router interfaces on cr2-ulsfo is OK host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [02:02:51] !log LocalisationUpdate failed (1.26wmf15) at 2015-07-26 02:02:51+00:00 [02:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:23] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:06:32] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [02:07:01] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jul 26 02:07:01 UTC 2015 (duration 7m 0s) [02:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:07:23] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:08:53] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:09:52] PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail [02:10:23] RECOVERY - Router interfaces on cr2-ulsfo is OK host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [02:15:02] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:20:53] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:22:44] !log l10nupdate Synchronized php-1.26wmf15/cache/l10n: (no message) (duration: 07m 12s) [02:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:52] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:25:33] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:26:47] !log LocalisationUpdate completed (1.26wmf15) at 2015-07-26 02:26:47+00:00 [02:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:32] RECOVERY - RAID on terbium is OK optimal, 1 logical, 2 physical [02:31:02] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:37:02] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:37:43] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [02:44:57] (03CR) 10Tim Landscheidt: "The "url" property in https://en.wikipedia.org/w/api.php?action=sitematrix&smsiteprop=dbname|url&smlangprop=site is now null for (all?) Wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225287 (https://phabricator.wikimedia.org/T104088) (owner: 10Alex Monk) [02:46:53] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:50:53] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [02:52:53] (03PS1) 10Yuvipanda: base: Fix spacing / indent for service_unit [puppet] - 10https://gerrit.wikimedia.org/r/227069 [02:53:52] (03PS2) 10Yuvipanda: base: Fix spacing / indent for service_unit [puppet] - 10https://gerrit.wikimedia.org/r/227069 [02:54:04] (03CR) 10Yuvipanda: [C: 032 V: 032] base: Fix spacing / indent for service_unit [puppet] - 10https://gerrit.wikimedia.org/r/227069 (owner: 10Yuvipanda) [03:05:12] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [03:11:12] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [03:15:03] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:17:03] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20606 bytes in 3.131 second response time [03:17:07] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [03:17:25] urgh, so the ipv6 alarm is false, but wtf is up with the ulsfo stuff... [03:18:35] single bgp route is showing critical in icinga... [03:18:47] robh: faidon depooled it earlier [03:18:56] ahh, cool, yayyyy [03:19:04] on a positive note [03:19:09] pagerduty pages worked =] [03:19:22] nice :) [03:19:24] and smsglobal pages did not =[ [03:19:36] so in the ops meeting on monday I think i'll be proposing we roll over. [03:19:44] * robh is setup to get paged from both vendors. [03:22:42] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [03:22:53] PROBLEM - BGP status on cr1-ulsfo is CRITICAL No response from remote host 198.35.26.192 [03:24:42] RECOVERY - Router interfaces on cr2-ulsfo is OK host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [03:25:13] RECOVERY - BGP status on cr1-ulsfo is OK host 198.35.26.192, sessions up: 11, down: 0, shutdown: 0 [03:25:13] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [03:28:03] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1483009 (10RobH) [03:31:13] PROBLEM - BGP status on cr1-ulsfo is CRITICAL No response from remote host 198.35.26.192 [03:31:23] meh, if it was a work day i'd think about puttin gulsfo into downtime in icinga, cuz its noisy... but its not and its easier for folks to see if its settled out in non downtime mode. [03:31:46] (just not that much chatter in here compared to a weekday, seems safe enough to leave ulsfo in noisy icinga mode) [03:32:54] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1483010 (10RobH) We just had another failure to page to me, as the lvs ipv6 false alarm went off again. pagerduty test worked fine, but smsglobal failed. [03:33:23] RECOVERY - BGP status on cr1-ulsfo is OK host 198.35.26.192, sessions up: 11, down: 0, shutdown: 0 [03:35:33] PROBLEM - puppet last run on mw1212 is CRITICAL Puppet has 1 failures [03:36:33] PROBLEM - puppet last run on mw1163 is CRITICAL Puppet has 1 failures [03:37:04] PROBLEM - puppet last run on mw1244 is CRITICAL Puppet has 1 failures [03:37:23] PROBLEM - puppet last run on mw2051 is CRITICAL Puppet has 1 failures [03:38:31] !log ulsfo network issues, faidon depooled via https://gerrit.wikimedia.org/r/#/c/227067/ [03:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:39:23] PROBLEM - BGP status on cr1-ulsfo is CRITICAL No response from remote host 198.35.26.192 [03:43:34] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [03:45:14] PROBLEM - BGP status on cr1-ulsfo is CRITICAL No response from remote host 198.35.26.192 [03:49:42] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [03:51:14] PROBLEM - BGP status on cr1-ulsfo is CRITICAL No response from remote host 198.35.26.192 [03:53:33] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [03:55:32] RECOVERY - BGP status on cr1-ulsfo is OK host 198.35.26.192, sessions up: 11, down: 0, shutdown: 0 [03:59:43] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [04:01:42] RECOVERY - puppet last run on mw1212 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [04:02:42] RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:12] RECOVERY - puppet last run on mw1244 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:23] PROBLEM - puppet last run on cp4018 is CRITICAL puppet fail [04:03:33] RECOVERY - puppet last run on mw2051 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:05:43] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [04:07:23] PROBLEM - puppet last run on cp4012 is CRITICAL puppet fail [04:08:04] PROBLEM - puppet last run on cp4020 is CRITICAL puppet fail [04:08:22] PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail [04:09:22] PROBLEM - puppet last run on cp4011 is CRITICAL puppet fail [04:09:23] PROBLEM - puppet last run on cp4008 is CRITICAL puppet fail [04:09:42] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [04:11:43] PROBLEM - puppet last run on cp4001 is CRITICAL puppet fail [04:17:33] PROBLEM - puppet last run on cp4002 is CRITICAL puppet fail [04:19:44] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [04:21:23] PROBLEM - puppet last run on cp4017 is CRITICAL puppet fail [04:21:32] PROBLEM - BGP status on cr1-ulsfo is CRITICAL No response from remote host 198.35.26.192 [04:21:42] PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail [04:25:42] RECOVERY - BGP status on cr1-ulsfo is OK host 198.35.26.192, sessions up: 11, down: 0, shutdown: 0 [04:27:22] PROBLEM - puppet last run on cp4009 is CRITICAL puppet fail [04:27:44] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [04:31:03] PROBLEM - puppet last run on cp4010 is CRITICAL puppet fail [04:31:13] PROBLEM - puppet last run on cp4016 is CRITICAL puppet fail [04:34:02] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [04:43:14] RECOVERY - puppet last run on cp4011 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [04:43:22] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [04:43:23] RECOVERY - puppet last run on cp4012 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:43:33] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [04:43:42] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [04:43:43] RECOVERY - puppet last run on cp4001 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:43:54] RECOVERY - puppet last run on cp4020 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [04:44:13] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:45:23] RECOVERY - puppet last run on cp4018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:45:32] RECOVERY - puppet last run on cp4002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:53:32] PROBLEM - BGP status on cr1-ulsfo is CRITICAL No response from remote host 198.35.26.192 [04:53:32] PROBLEM - puppet last run on cp4005 is CRITICAL puppet fail [04:53:34] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [04:55:23] RECOVERY - BGP status on cr1-ulsfo is OK host 198.35.26.192, sessions up: 11, down: 0, shutdown: 0 [04:56:13] PROBLEM - puppet last run on mw1233 is CRITICAL Puppet has 1 failures [04:56:43] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:57:03] RECOVERY - puppet last run on cp4009 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [04:57:23] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:57:32] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [04:58:43] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:59:03] RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:05:42] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [05:09:34] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [05:19:13] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:21:54] RECOVERY - puppet last run on mw1233 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:22:34] (03PS6) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [05:30:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jul 26 05:30:10 UTC 2015 (duration 30m 9s) [05:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:37:38] (03PS7) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [05:48:13] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:48:16] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [05:48:23] RECOVERY - Host cp4013 is UPING OK - Packet loss = 0%, RTA = 76.90 ms [05:49:44] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UPING OK - Packet loss = 16%, RTA = 73.50 ms [05:55:34] (03PS8) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [06:03:13] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [06:05:24] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [06:30:03] PROBLEM - puppet last run on mw1073 is CRITICAL puppet fail [06:30:52] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on cp1054 is CRITICAL Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:32:32] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:55:14] RECOVERY - puppet last run on cp1054 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:43] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw1073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:14] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:22] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:46] (03PS1) 10Faidon Liambotis: Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/227070 [07:33:38] (03PS2) 10Faidon Liambotis: Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/227070 [07:33:45] (03CR) 10Faidon Liambotis: [C: 032] Revert "Depool ulsfo, network troubles" [dns] - 10https://gerrit.wikimedia.org/r/227070 (owner: 10Faidon Liambotis) [08:21:23] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [08:49:14] RECOVERY - puppet last run on mw2062 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:11:22] PROBLEM - puppet last run on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:13] RECOVERY - puppet last run on terbium is OK Puppet is currently enabled, last run 14 minutes ago with 0 failures [09:21:33] PROBLEM - puppet last run on ms-be3002 is CRITICAL puppet fail [09:49:43] RECOVERY - puppet last run on ms-be3002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:59:23] PROBLEM - Apache HTTP on mw1160 is CRITICAL - Socket timeout after 10 seconds [10:01:24] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.099 second response time [14:18:50] YuviPanda: around? [14:26:48] all set, thanks YuviPanda [14:51:44] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:53:33] RECOVERY - RAID on terbium is OK optimal, 1 logical, 2 physical [15:40:35] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1483413 (10Tau) I have tried purging but with no success. Can turning ImageMagick on/off affect this issue? I will try some m... [15:48:09] Krenair: if a wiki wanted to turn on second-factor auth, could it be enabled, people given time to transition, then made mandatory? [15:50:27] assuming the wiki is not hosted by WMF, probably [15:56:44] Nemo_bis: private WMF-hosted wiki. [15:58:55] (03PS1) 10Nemo bis: Redirect wiki.toolserver.org to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) [16:00:38] (03CR) 10Nemo bis: "Not sure what's the recommended way; this sort of "partial" redirect does not seem supported by modules/mediawiki/files/apache/sites/redir" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [16:09:46] grrr [16:09:58] who is running populateContentModel on terbium for enwiki? [16:10:26] legoktm: ^ [16:10:28] :S [16:11:19] lfaraone, I think you'd need to talk to Chris Steipp about that [16:11:57] kk. [16:12:22] Wikitech does require it for highly privileged accounts [16:13:12] IIRC the recovery process was the thing needing improvement? [16:26:35] Anyone with racktables access around? [16:26:56] Reedy maybe? [16:49:23] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [16:50:04] 6operations: High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1483459 (10hoo) 3NEW [17:31:11] hoo|away: Wassup? [17:32:54] hoo|away: also, yes legoktm [17:32:55] legoktm 4764 0.0 0.0 12320 692 pts/23 S<+ Jul24 0:00 /bin/bash /usr/local/bin/mwscript populateContentModel.php --wiki=enwiki --ns=all --table=page [17:40:22] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:42:13] RECOVERY - RAID on terbium is OK optimal, 1 logical, 2 physical [17:54:43] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61462 bytes in 2.263 second response time [17:57:54] (03PS9) 10BryanDavis: [WIP] Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [18:02:22] Reedy: I was wondering whether the redis connection failures coincident with servers in a certain row... but I guess that's unlikely [18:40:02] hoo: I can have a look if you want me to [18:42:18] Reedy: mw1124 mw1129 mw1120 mw1135 mw1121 [18:42:28] These have most of the connection time outs [18:42:56] ok, give me a few [18:45:10] mw1124 is B7 [18:45:30] mw1129 is B8 [18:46:03] mw1120 is B7 [18:46:15] mw1121 is B7 [18:46:29] mw1135 is B8 [18:46:46] both racks have loads of other app servers too [18:47:32] Ok... all these are api apaches [18:53:15] were api apaches generally higher loaded from memory? [18:53:32] possible [19:56:09] Bored [19:56:15] Anything I can do? [19:56:46] fix bugs? [19:56:46] :p [19:57:15] https://phabricator.wikimedia.org/T106963 is pretty bad [19:59:01] Isn't that an old one? [19:59:06] wiki != wikipedia [20:01:15] nope [20:03:11] It "worked" until we got rid of the 'default' => 'https://$lang.wikipedia.org' line from the config [20:03:25] lol [20:11:08] Reedy, I checked on testwiki, a hack to change $major in SiteMatrix::getUrl from 'wiki' to 'wikipedia' does work [20:11:23] We could add a proper way to do that [20:14:28] although.... hm [20:15:51] wgConf->siteFromDB returns wikipedia as you'd expect [20:33:06] sigh. [20:33:14] > var_dump( $wgConf->siteFromDB( 'metawiki' ) ); [20:33:14] array(2) { [20:33:15] [0]=> [20:33:15] string(9) "wikipedia" [20:33:15] [1]=> [20:33:16] string(4) "meta" [20:33:17] } [20:36:12] (03CR) 10BryanDavis: "Needs to be tested in beta cluster. I'll do that in the coming week." [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [20:49:01] (03PS1) 10Ori.livneh: Move WikipediaMobileFirefoxOS from bits to wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227156 (https://phabricator.wikimedia.org/T98373) [20:49:22] (03PS2) 10Ori.livneh: Move WikipediaMobileFirefoxOS from bits to wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227156 (https://phabricator.wikimedia.org/T98373) [20:49:35] (03CR) 10Ori.livneh: [C: 032] Move WikipediaMobileFirefoxOS from bits to wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227156 (https://phabricator.wikimedia.org/T98373) (owner: 10Ori.livneh) [20:49:41] (03Merged) 10jenkins-bot: Move WikipediaMobileFirefoxOS from bits to wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227156 (https://phabricator.wikimedia.org/T98373) (owner: 10Ori.livneh) [20:51:49] !log ori Synchronized docroot: I5f8b8b54a: Move WikipediaMobileFirefoxOS from bits to wikimedia.org docroot (Bug: T98373) (duration: 00m 17s) [21:01:09] (03PS1) 10Ori.livneh: Update WikipediaMobileFirefoxOS submodule for URL changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227159 (https://phabricator.wikimedia.org/T98373) [21:01:29] (03CR) 10Ori.livneh: [C: 032] Update WikipediaMobileFirefoxOS submodule for URL changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227159 (https://phabricator.wikimedia.org/T98373) (owner: 10Ori.livneh) [21:01:33] (03Merged) 10jenkins-bot: Update WikipediaMobileFirefoxOS submodule for URL changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227159 (https://phabricator.wikimedia.org/T98373) (owner: 10Ori.livneh) [21:02:32] !log ori Synchronized docroot/wikimedia.org/WikipediaMobileFirefoxOS: Update WikipediaMobileFirefoxOS submodule for URL changes (duration: 00m 16s) [21:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:10] 6operations, 10Traffic, 7Mobile, 5Patch-For-Review: Replace bits URL in Firefox app, if possible - https://phabricator.wikimedia.org/T98373#1483672 (10ori) a:5ori>3brion I moved the submodule from docroot/bits to docroot/wikimedia.org, leaving a symlink behind to ensure existing references don't break.... [21:08:33] 6operations: High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1483677 (10ori) p:5Triage>3Unbreak! [21:08:56] 6operations, 10Traffic, 7Mobile, 5Patch-For-Review: Replace bits URL in Firefox app, if possible - https://phabricator.wikimedia.org/T98373#1483680 (10brion) >>! In T98373#1483672, @ori wrote: > I moved the submodule from docroot/bits to docroot/wikimedia.org, leaving a symlink behind to ensure existing re... [21:09:12] brion: thanks :) [21:09:17] woohoo :) [21:09:26] 6operations, 10Traffic, 7Mobile, 5Patch-For-Review: Replace bits URL in Firefox app, if possible - https://phabricator.wikimedia.org/T98373#1483681 (10ori) 5Open>3Resolved [21:10:01] we'll see if i can get anybody to help replace that firefox os app with more modern code :D hehe [21:10:07] it still gets good reviews somehow though [21:10:43] i haven't been following FirefoxOS very closely. Has it been gaining traction? I had an early developer build and it was awful, even for a preview. [21:12:29] (The OS; not the app) [21:14:17] ori: not terribly great; they're still selling in latin america but i have no idea how well [21:14:44] if they refocus on non-shit hardware that might help, but they gotta improve the developer experience a lot [21:29:22] (03PS1) 10Ori.livneh: Re-introduce ProxyPass rule for thumb_handler.php, with corrected docroots [puppet] - 10https://gerrit.wikimedia.org/r/227160 (https://phabricator.wikimedia.org/T84842) [21:52:24] PROBLEM - RAID on terbium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:13] RECOVERY - RAID on terbium is OK optimal, 1 logical, 2 physical [22:03:43] hoo: Reedy: is populateContentModel causing issues? [22:10:55] legoktm: Made terbium alter twice today [22:11:03] alter? [22:11:35] * alert [22:12:09] o.O [22:12:25] the RAID checks? [22:12:46] yeah [22:13:08] ok, killed it [22:13:22] why would it affect that though? [22:13:45] !log killed populateContentModel.php for enwiki on terbium due to alerts [22:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:32] legoktm: I guess the machine just didn't manage to respond in time, thus the socket time out [22:14:53] so...it was causing too much load? [22:14:55] Has been swapping quite badly (see iowait in ganglia) [22:15:09] https://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=terbium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [22:16:12] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=terbium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1437948896&g=mem_report&z=large&c=Miscellaneous%20eqiad wtf [22:17:20] why in the world would it use so much memory? [22:21:17] I think I know why [22:21:21] could be something else [22:21:24] but probably that [22:21:44] why? :P [22:21:51] $toSave [22:21:58] Let me hack up a patch [22:22:38] nope, probably not it [22:22:45] unset( $toSave[$model] ); [22:22:53] it shouldn't grow indefinitely [22:25:34] yeah [22:25:45] maybe you hit a php 5.3 gc bug, but I'd not bet on that [22:25:53] more likely to be a self made memleak [22:29:23] LinkCache doesn't look like it ever evicts... [22:30:37] mh... but why should so much stuff edn up in LinkCache? [22:31:15] no idea. [22:31:26] I tracked all the ContentHandlerDefaultModelFor hook handlers and didn't see anything obvious [22:31:48] doing the same :P [22:32:04] Scribunto is creating titles using Title::makeTitleSafe [22:32:15] but probably way to few to cause for so much trouble [22:32:19] only if it is in NS_MODULE [22:32:31] yeah [22:32:31] I couldn't follow what JsonConfig was doing though [22:32:47] JsonConfig should not be on enwiki [22:32:49] Or is it? [22:33:04] it is because of Graph and Zero stuff [22:40:58] gj legoktm [22:41:12] :< [22:42:16] ofc, php 5.3 sucks etc [22:42:32] Report all the bugs [22:42:37] legoktm: Reedy: Place your bets [22:42:40] which extension [22:42:59] JsonConfig or Flow probably [22:43:08] it's flow [22:43:11] lolol [22:43:19] I was running this script for Flow too [22:43:30] ohhhh [22:43:31] if ( $title->isRedirect() ) { [22:43:36] gj Flow [22:43:37] They do Title::getContentModel which adds stuff to the LinkCache [22:43:58] 6operations, 10ops-eqiad, 10Analytics-Cluster, 5Patch-For-Review: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1483718 (10kevinator) [22:44:07] hoo: no, that code path shouldn't be called since $checkContentModel = false [22:44:17] but we check if its a redirect or if it exists [22:45:24] https://gerrit.wikimedia.org/r/#/c/104745/ heh [22:46:14] Can we just make LinkCache fifo or so? [22:46:24] or /dev/null [22:46:59] Oh nice... that "cache" has database interaction in it [22:47:31] Of course it does [22:47:35] [15:47:14] MediaWiki-Cache: LinkCache should be LRU based to avoid indefinitely growing and causing memory issues - https://phabricator.wikimedia.org/T106998#1483727 (Legoktm) NEW [22:47:40] * Reedy finds a brick wall to bang his hea d against it [22:48:01] legoktm: Are you just going to hack in a ->clear() to your script for now? [22:48:08] every N iterations [22:48:36] We should make it even hackier any only clear() once the machine starts swapping [22:49:10] if ( swap() ) { /* oh shit */ } [22:49:36] I was going to actually fix LinkCache :| [22:49:52] boring legoktm [22:50:07] hacking gets your script running again quicker [22:50:13] well, so does cabal cr [22:50:29] ;) [22:50:44] along with hurrydeploy [22:51:02] hurryweekenddeploy* [23:01:51] (03CR) 10MZMcBride: "Nice!" [software] - 10https://gerrit.wikimedia.org/r/226939 (https://phabricator.wikimedia.org/T106897) (owner: 10Alex Monk) [23:04:15] Katie, it would be more helpful if you convinced ops to review and run the fix :p [23:04:31] That's Labs shit, right? [23:04:32] Krenair: Katie is just hitting on you. [23:04:36] You'd need Yuvi or Coren? [23:04:42] Or Tim L.? [23:04:45] Who seems to be aroundish. [23:05:06] I don't know if Tim L. has the sort of access needed for this [23:06:07] YuviPanda: Do you have time to review and deploy https://gerrit.wikimedia.org/r/226939 ? [23:12:52] (03CR) 10Ori.livneh: [C: 032] "I'll watch the URLs cited in T106895 for trouble and revert if I see any." [puppet] - 10https://gerrit.wikimedia.org/r/227160 (https://phabricator.wikimedia.org/T84842) (owner: 10Ori.livneh) [23:41:22] PROBLEM - puppet last run on lvs2001 is CRITICAL puppet fail [23:41:59] (03PS1) 10Reedy: Remove multiple subdomain wiki rewrites [puppet] - 10https://gerrit.wikimedia.org/r/227172 (https://phabricator.wikimedia.org/T102814) [23:43:46] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1483795 (10Reedy) ``` # renamed chapter wiki - T40763 rewrite pa.us.wikimedia.org //pa-us.wikimedia.org # arbcom rewrite arbcom.de.w... [23:45:29] (03PS1) 10Reedy: Kill pa.us.wikimedia.org from dns [dns] - 10https://gerrit.wikimedia.org/r/227173 (https://phabricator.wikimedia.org/T102814) [23:45:39] More shit to die in a fire [23:50:07] diediedie