[00:36:47] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [01:00:07] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:14:19] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [01:16:08] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [01:30:26] 06Operations, 10MediaWiki-Logging: Missing move log of the target page in dewiki - https://phabricator.wikimedia.org/T142923#2551442 (10doctaxon) [02:00:58] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: puppet fail [02:09:20] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures [02:24:42] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.14) (duration: 11m 15s) [02:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:07] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:30:34] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Aug 14 02:30:34 UTC 2016 (duration 5m 53s) [02:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:39] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:26:38] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures [03:53:57] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [05:06:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [05:12:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:08:57] (03PS1) 10TTO: Don't prepend protocol in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304689 (https://phabricator.wikimedia.org/T141208) [06:15:01] (03PS2) 10TTO: Don't prepend protocol in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304689 (https://phabricator.wikimedia.org/T141208) [06:34:38] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:59:57] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:01:11] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2551627 (10Smalyshev) I think there's a question of how the storage is indexed and... [10:22:07] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 1 failures [10:47:29] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:59:06] 06Operations, 06Labs, 06Release-Engineering-Team, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech and Gerrit - https://phabricator.wikimedia.org/T133968#2551726 (10lfschenone) [13:00:23] 06Operations, 06Labs, 06Release-Engineering-Team, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech and Gerrit - https://phabricator.wikimedia.org/T133968#2250492 (10lfschenone) I modified my rename request. The requested username change would now be: - At Gerrit, from lfs to Sophivorus -... [13:34:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [13:36:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:25:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [14:27:18] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:39:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [14:42:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:20:38] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 1 failures [15:37:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:46:07] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:51:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:57:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [16:03:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:30:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [16:34:47] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:35:23] (03CR) 10jenkins-bot: [V: 04-1] Monthly update of the "slowest" querypages on the English Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) (owner: 10Nemo bis) [16:37:02] (03PS2) 10Nemo bis: Monthly update of the "slowest" querypages on the English Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) [17:06:36] (03PS3) 10Faidon Liambotis: Switch India & BIOT to esams (4) [dns] - 10https://gerrit.wikimedia.org/r/257843 [17:08:48] 06Operations, 10netops: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2551905 (10faidon) [17:12:45] !log bumping cr2-knams<->cr1-eqiad OSPF/OSPF3 metric to 1820 (thus activating the new cr2-esams<->cr2-eqiad link which has a metric of 840) [17:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:52] 06Operations, 10netops: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2551907 (10faidon) 05Open>03Resolved This took a while, with a lot of back and forths with Level3. They finally managed to patch their equipment in Ashburn on Aug 12th and in the Netherlands in Aug 13t... [17:31:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [17:33:29] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:04:32] 06Operations, 10ArticlePlaceholder, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2551996 (10hoo) [18:15:56] (03CR) 10Faidon Liambotis: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/304050 (owner: 10BBlack) [18:38:08] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [18:40:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:44:23] (03CR) 10Alex Monk: "We're still getting this issue:" [puppet] - 10https://gerrit.wikimedia.org/r/247587 (https://phabricator.wikimedia.org/T50501) (owner: 10Alex Monk) [19:05:09] Hi, is MediaWiki 1.28 still considered as an alpha version? I do not think alpha version (from my point of view untested version which a lot of bugs) can be used in enwiki and dewiki (very big projects viewed by a huge amount of people). Or what is our terminology related to version names? [19:08:14] What's the current development/deployment model we use? Commit then flee! [19:08:41] In MediaWiki, usually the stable releases are the least tested ones. [19:09:12] "Stable" means they stay broken for longer, so that extensions can rely on a boring brokenness rather than an ever-changing one. [19:09:56] alpha/beta/rc/stable is just the standard naming of release processes, there is no implied statement on quality or reliability or anything. [19:16:59] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 422 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [19:22:58] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 2 probes of 422 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [19:38:58] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:39:16] umm [19:39:17] Platonides [19:39:30] I think that user may have been legit [19:42:20] Platonides: and although I'm not 100% of this, I think IP bans on freenode also ban users behind a cloak [19:42:48] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:43:03] (03CR) 10Halfak: ores: Enable uwsgi-specific statsd setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/304678 (https://phabricator.wikimedia.org/T141543) (owner: 10Ladsgroup) [19:43:20] Platonides, yeah, user is logged in but not cloaked [19:44:16] "Bans set on IP addresses will apply even if the affected user joins with a resolved or cloaked hostname." (https://freenode.net/kb/answer/channelmodes) [20:04:28] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:07:24] Platonides hi could you unblock 104.236.178.119 please [20:07:30] since it is a deffitly a legit user [20:07:48] since he uses an ip, instead of a cloak. [20:12:17] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:23:21] paladox, you're just repeating me now.. [20:23:33] oh sorry [20:23:48] but you said it may be legit [20:23:54] i am just saying it is legit [20:25:02] :( [20:25:43] Thanks [20:26:19] Platonides im wondering if he could be added to the whitelist please? unlikly his ip will change. [20:26:36] added [20:26:41] Thanks :) [20:26:44] I was on it :) [20:26:50] oh, thanks :) [20:28:30] Platonides, are you banning automatically? [20:28:51] okay seriously now [20:28:57] wtf [20:29:15] you can't keep op anywhere if this keeps up [20:29:18] why didn't it skip it? [20:30:16] o_O mukunda banned [20:30:43] if ($address eq "104.236.178.119") { [20:30:43] return 1; [20:30:43] } [20:30:58] ok [20:31:12] now, let's see why it wasn't skipped 5 mins ago [20:32:30] that whitelist affected nicks but not ips :( [20:34:24] Nemo_bis: Thanks a lot for your explanation. [20:41:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:43:29] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:06:25] JEM +++ ASIMOVBOT === TERNURA [21:06:26] JEM +++ ASIMOVBOT === TERNURA [21:06:32] LALALALALALA [21:07:50] (03PS2) 10Ladsgroup: ores: Enable uwsgi-specific statsd setup [puppet] - 10https://gerrit.wikimedia.org/r/304678 (https://phabricator.wikimedia.org/T141543) [21:08:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [21:10:24] " DB connection was already closed or the connection dropped." [21:10:49] where are these errors? [21:11:14] load on arwiki [21:12:58] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:13:47] mostly from mw1169 [21:14:18] Reedy, what? [21:14:34] looking at logstash [21:14:46] all those errors are from the one apache, against arwiki [21:21:34] 06Operations, 10Beta-Cluster-Infrastructure: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2547328 (10hashar) The `under_NDA` group was meant to maintain yet another list of people under NDA. It has been created ages ago with the idea of using real SSL/TLS certificates on the beta... [21:21:43] 06Operations, 10Beta-Cluster-Infrastructure: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2552159 (10hashar) p:05Triage>03Normal [21:29:00] 06Operations, 10Beta-Cluster-Infrastructure: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2552172 (10AlexMonk-WMF) >>! In T142822#2552157, @hashar wrote: > One will want to review whether the sudo policy in wikitech is still of any use. I have seen mails notifications stating th... [21:36:32] 06Operations, 10Phabricator: Renew phab.wmfusercontent.org https certificate - https://phabricator.wikimedia.org/T142951#2552191 (10AlexMonk-WMF) It's not currently a letsencrypt cert, @Paladox. Though I imagine operations are already tracking this. [21:37:52] 06Operations, 10Phabricator: Renew phab.wmfusercontent.org https certificate - https://phabricator.wikimedia.org/T142951#2552193 (10Paladox) p:05Triage>03High Oh thanks. It will never need renewing. Maybe we can make this letsencrypt? Setting this as high priority since it expiring will make it difficult t... [22:12:20] (03PS1) 10Dzahn: add script to check if live instance is in sync with repo [debs/wikistats] - 10https://gerrit.wikimedia.org/r/304742 [22:16:58] (03CR) 10Dzahn: [C: 032 V: 032] add script to check if live instance is in sync with repo [debs/wikistats] - 10https://gerrit.wikimedia.org/r/304742 (owner: 10Dzahn) [22:21:20] (03PS1) 10Dzahn: sync instance with repo [debs/wikistats] - 10https://gerrit.wikimedia.org/r/304743 [22:24:12] (03CR) 10Dzahn: [C: 032 V: 032] sync instance with repo [debs/wikistats] - 10https://gerrit.wikimedia.org/r/304743 (owner: 10Dzahn) [22:24:54] (03Merged) 10jenkins-bot: sync instance with repo [debs/wikistats] - 10https://gerrit.wikimedia.org/r/304743 (owner: 10Dzahn) [22:27:23] (03CR) 10Paladox: "recheck" [debs/wikistats] - 10https://gerrit.wikimedia.org/r/304743 (owner: 10Dzahn) [22:33:18] (03PS1) 10Dzahn: add bootstrap-3.3.5 to repo [debs/wikistats] - 10https://gerrit.wikimedia.org/r/304744 [22:35:32] (03CR) 10Dzahn: [C: 032] add bootstrap-3.3.5 to repo [debs/wikistats] - 10https://gerrit.wikimedia.org/r/304744 (owner: 10Dzahn) [22:39:22] (03CR) 10Paladox: "recheck" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 (owner: 10Chad) [22:42:36] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/304678 (https://phabricator.wikimedia.org/T141543) (owner: 10Ladsgroup) [22:53:08] (03CR) 10Paladox: "recheck" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/304486 (owner: 10Chad) [23:01:48] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2552287 (10AlexMonk-WMF) >>! In T115194#2244874, @Andrew wrote: > there are relatively many ldap connection failures in the sink log. That fits with the fact that... [23:11:32] (03PS1) 10Alex Monk: labsprojectfrommetadata: Pull project_id from new field [puppet] - 10https://gerrit.wikimedia.org/r/304748 (https://phabricator.wikimedia.org/T105891) [23:20:11] JEM [23:20:17] AMA A ASIMOVBOT [23:20:30] TUVIERON SEXO EN LA CAMA DE DORS [23:20:41] HAHAHAHAHHAHAHAHAHAHAA [23:20:46] MIERDA [23:22:00] that sociopath not again :| [23:25:17] (03PS1) 10Alex Monk: No longer set up config for our old project-id metadata creation [puppet] - 10https://gerrit.wikimedia.org/r/304750 (https://phabricator.wikimedia.org/T105891) [23:27:30] (03PS1) 10Alex Monk: openstack: Delete old juno files from the repository [puppet] - 10https://gerrit.wikimedia.org/r/304751