[00:00:28] <icinga-wm>	 PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0]
[00:11:29] <grrrit-wm>	 (03CR) 10Alex Monk: "Faidon: I4b9b1857 & I349653a0 convert it to a simple non-template script using a config file set up by puppet, without ports hardcoded or " [puppet] - 10https://gerrit.wikimedia.org/r/286683 (owner: 10Alex Monk)
[00:14:29] <wikibugs>	 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2271454 (10DanielFriesen) @ashley They just want to use MW to host the policy pages rather than use it like an actual wiki. So omitting things like content_actions an...
[00:19:45] <icinga-wm>	 RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0]
[00:23:23] <grrrit-wm>	 (03PS1) 10Yurik: Removed obsolete graph ext settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287299 
[00:24:14] <icinga-wm>	 RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0]
[00:24:16] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2271484 (10Dzahn) re: "the releng team".  Are we talking about the "contint-admins" , "contint-users" groups? These are the ones...
[00:30:30] <wikibugs>	 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho should write stats of messages processed and we should alert when that drops to zero - https://phabricator.wikimedia.org/T134326#2262099 (10AlexMonk-WMF) or if it stops writing them I guess, maybe based on the file's timestamp?
[00:49:34] <grrrit-wm>	 (03PS7) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) 
[00:49:53] <grrrit-wm>	 (03CR) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz)
[01:01:38] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.22/includes/api/ApiStashEdit.php: If56084466: Bump PRESUME_FRESH_TTL_SEC to improve hit rate and avoid link queries (duration: 00m 34s)
[01:01:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:06:44] <wikibugs>	 06Operations, 06Labs, 06Project-Admins: Archive old Incident-* projects - https://phabricator.wikimedia.org/T134624#2271591 (10Danny_B)
[02:19:03] <Danny_B>	 Request from 90.180.83.194 via cp1045 cp1045, Varnish XID 2072679218
[02:19:06] <Danny_B>	 Error: 503, Service Unavailable at Sat, 07 May 2016 02:18:37 GMT
[02:19:18] <Danny_B>	 >>> UNRECOVERABLE FATAL ERROR <<<
[02:19:18] <Danny_B>	 Maximum execution time of 10 seconds exceeded
[02:19:18] <Danny_B>	 /srv/deployment/phabricator/deployment-cache/revs/7dd45143c333b8fb854b8f40bd96c46ea56a0970/phabricator/src/applications/maniphest/storage/ManiphestTask.php:152
[02:19:22] <Danny_B>	 ┻━┻ ︵ ¯\_(ツ)_/¯ ︵ ┻━┻
[02:24:06] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 09m 34s)
[02:24:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:31:05] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[02:45:18] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.23) (duration: 05m 38s)
[02:45:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:54:33] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat May  7 02:54:32 UTC 2016 (duration 9m 14s)
[02:54:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:57:44] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:05:10] <Krenair>	 gerrit seems very slow to pull from
[03:06:57] <Krenair>	 ottomata, you created a git repository without making the github mirror
[03:08:38] <Krenair>	 4e707beb 03:05:14.637 03:04:14.637      (retry 370) push git@github.com:wikimedia/operations-debs-druid
[03:09:40] <Krenair>	 fixed
[03:13:40] <Krenair>	 not sure what's up with this one: 4b114dea 03:13:52.211 03:12:52.211      (retry 3024) push gerritslave@antimony.wikimedia.org:/var/lib/git/mediawiki/extensions/WikiShare.git
[03:22:34] <Krenair>	 can pull again now
[03:24:11] <Krenair>	 based on show-queue it seems several instances of "gerrit gc --all" (gerrit2) are gone, as are several instances of "git-upload-pack '/labs/private.git'" (labs-puppet)
[03:54:00] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:15:48] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2271750 (10Krenair)
[04:16:52] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2271764 (10Krenair) This appears to have happened on deployment-memc02 since 13th April, I also found -logstash2 with the issue and -mathoid wi...
[04:19:52] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:26:53] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[04:28:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[05:01:22] <icinga-wm>	 PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: puppet fail
[05:18:08] <icinga-wm>	 PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: puppet fail
[05:26:37] <icinga-wm>	 RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[05:38:34] <icinga-wm>	 PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: puppet fail
[05:45:52] <icinga-wm>	 RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[06:05:32] <icinga-wm>	 RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[06:15:54] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5920498 keys - replication_delay is 646
[06:17:52] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5910570 keys - replication_delay is 0
[06:30:42] <icinga-wm>	 PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:04] <icinga-wm>	 PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:13] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:13] <icinga-wm>	 PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:03] <icinga-wm>	 PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:13] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:43] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:02] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:03] <icinga-wm>	 PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:34] <icinga-wm>	 PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:43] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:35:02] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:35:13] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:55:33] <icinga-wm>	 RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[06:55:53] <icinga-wm>	 RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[06:56:32] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[06:56:52] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[06:56:53] <icinga-wm>	 RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:57:02] <icinga-wm>	 RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[06:57:23] <icinga-wm>	 RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[06:57:33] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[06:57:52] <icinga-wm>	 RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:52] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[06:57:58] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2271750 (10hashar) Supposed to be fixed by https://gerrit.wikimedia.org/r/#/c/284852/ for T132689
[06:58:02] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:03] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:03] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:31] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2271819 (10hashar)
[06:58:35] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10hashar)
[07:05:08] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[07:06:58] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5911746 keys - replication_delay is 0
[07:55:44] <grrrit-wm>	 (03PS1) 10Nikerabbit: Use wfLoadExtension for LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287316 
[08:27:35] <grrrit-wm>	 (03PS2) 10Nikerabbit: Use wfLoadExtension for LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287316 
[08:34:01] <icinga-wm>	 PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:46:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89958.83 seconds
[09:01:11] <icinga-wm>	 RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:18:39] <wikibugs>	 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2271904 (10Nemo_bis) > IMVHO using such dumb forwarders is a behaviour that should be discouraged  How common is such a forwarding mechanism? Email aliases are c...
[09:19:43] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "\o/ (Assuming this works... :)" [puppet] - 10https://gerrit.wikimedia.org/r/287246 (owner: 10Alex Monk)
[09:20:13] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 (owner: 10Alex Monk)
[10:34:06] <icinga-wm>	 PROBLEM - puppet last run on mw2012 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:51:05] <grrrit-wm>	 (03PS1) 10Volans: MariaDB: tune thread-pool to avoid Aborted_connects [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) 
[10:53:33] <grrrit-wm>	 (03CR) 10Volans: "Jaime: it's ok do it on all core DBs or you prefer to have it only on the masters given that we encountered the issue only on masters?" [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans)
[11:01:29] <icinga-wm>	 RECOVERY - puppet last run on mw2012 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[11:55:37] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.22/includes/api/ApiStashEdit.php: If56084466: Make stashEditFromPreview() call setCacheTime() (duration: 00m 33s)
[11:55:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:05:08] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5939600 keys - replication_delay is 633
[12:10:58] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5915914 keys - replication_delay is 0
[12:32:44] <hoo>	 !log Restarted blazegraph on wdqs1002 (Unresponsive, even locally: java.io.IOException: Too many open files) T134238
[12:32:46] <stashbot>	 T134238: Query service fails with "Too many open files" - https://phabricator.wikimedia.org/T134238
[12:32:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:52:32] <icinga-wm>	 PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0]
[13:37:47] <icinga-wm>	 RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0]
[14:07:08] <grrrit-wm>	 (03CR) 10Jcrespo: "I would definitely do it only on the masters- this issue never happened in the last 1 year for other slaves, and we still not have a 100% " [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans)
[14:32:02] <icinga-wm>	 PROBLEM - HHVM rendering on mw1195 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.007 second response time
[14:34:02] <icinga-wm>	 RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 67081 bytes in 0.204 second response time
[14:35:41] <wikibugs>	 06Operations, 10Analytics-EventLogging, 06Performance-Team, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2272198 (10Krinkle) 05Open>03Resolved No, let's close this.
[14:38:50] <jynus>	 !log inplace precise upgrade to trusty on db1069 before labs explodes
[14:38:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:40:21] <icinga-wm>	 PROBLEM - Host db1069 is DOWN: PING CRITICAL - Packet loss = 100%
[14:40:52] <icinga-wm>	 RECOVERY - Host db1069 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms
[14:44:30] <wikibugs>	 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2272203 (10BBlack) Even if we can't come to a firm consensus on which of `?all` or `-all` is the most-appropriate setting, I think both sides of that debate woul...
[14:47:42] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is OK: TCP OK - 0.037 second response time on port 9042
[15:05:21] <wikibugs>	 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272246 (10BBlack) Thanks for merging in the probably-related tasks.  I had somehow missed really noticing T123159 earlier...  So probably digging into gunzip itself isn...
[15:08:52] <wikibugs>	 06Operations, 10Traffic, 05Security: Varnish corruption on certain response lengths - https://phabricator.wikimedia.org/T134649#2272249 (10BBlack)
[15:15:33] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2272279 (10jcrespo)
[15:15:35] <wikibugs>	 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Upgrade db1069 - https://phabricator.wikimedia.org/T134349#2272277 (10jcrespo) 05Open>03Resolved I upgraded it this weekend because if not I was going to become crazy. Hopefully that will fix all tokudb-replication issues for once.
[15:19:31] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[15:27:02] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:37:06] <wikibugs>	 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272333 (10BBlack)
[15:49:07] <wikibugs>	 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272368 (10BBlack) p:05Triage>03High
[15:59:06] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[16:01:16] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5920461 keys - replication_delay is 0
[16:45:03] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2272446 (10Krenair) I'm not sure how this is a duplicate?
[17:40:21] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2272522 (10Papaul) The OS Installation is done on maps2001, maps2003 and maps2004. On maps2002 after the installation and after the system reboots to load the OS I get the error message below.  /bin/sh: E: not...
[18:00:15] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:03:34] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[18:07:04] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:08:33] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy
[18:10:02] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:10:53] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy
[18:14:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:15:13] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[18:15:52] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy
[18:16:24] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy
[18:24:52] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0
[18:26:13] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[18:30:42] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:30:53] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:54:35] <wikibugs>	 06Operations, 10media-storage: API request failed (internal_api_error_LocalFileLockError): [Vy45gApAMCEAAGZ5po8AAAAR] Exception Caught: Could not acquire lock for '(1)_Piano_Recital_at_12.jpg.' - https://phabricator.wikimedia.org/T134662#2272562 (10Steinsplitter)
[19:13:42] <wikibugs>	 06Operations, 10media-storage: API request failed (internal_api_error_LocalFileLockError): [Vy45gApAMCEAAGZ5po8AAAAR] Exception Caught: Could not acquire lock for '(1)_Piano_Recital_at_12.jpg.' - https://phabricator.wikimedia.org/T134662#2272583 (10matmarex)
[19:13:46] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2272585 (10matmarex)
[19:14:58] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10matmarex)
[19:26:20] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2272594 (10Steinsplitter) Yet a other one:  https://commons.wikimedia.org/wiki/File...
[19:31:11] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2272615 (10Steinsplitter) And a other one:  https://commons.wikimedia.org/wiki/File...
[19:49:50] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2272625 (10TerraCodes)
[19:52:05] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5924057 keys - replication_delay is 619
[19:57:46] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5923521 keys - replication_delay is 0
[20:09:23] <icinga-wm>	 PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0]
[20:09:34] <Platonides>	 is there any restriction on setting arbitrary servers on X-Wikimedia-Debug: ?
[20:13:54] <SMalyshev>	 !log restarted blazegraph on wdqs1001 and wdqs1002
[20:14:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:17:50] <aude>	 thanks SMalyshev 
[20:28:56] <SMalyshev>	 !log deploying updated Blazegraph version for WDQS to mitigate deadlock issue
[20:29:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:43:09] <icinga-wm>	 PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0]
[20:48:22] <wikibugs>	 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272654 (10MZMcBride) Using the exact URL from the task description, I'm consistently getting a bad response:  ``` >>> from subprocess import check_output >>> args = ['c...
[21:10:16] <icinga-wm>	 RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0]
[21:14:37] <wikibugs>	 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272673 (10Yurivict) It is 100% reproducible.
[21:18:39] <wikibugs>	 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272674 (10Yurivict) It also seems to be related to the size of the response. 32kB is the critical size. The respone is 32766=0x7FFE bytes.
[22:18:25] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] "InitialiseSettings.php is wrapped in the following callback method:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287095 (owner: 10Alexandros Kosiaris)
[22:28:41] <icinga-wm>	 PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail
[22:28:42] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: puppet fail
[22:36:02] <icinga-wm>	 RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0]
[22:38:45] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2272723 (10Papaul)
[22:43:57] <Luke081515>	 huh, what kind of project is ops-eqdfw? About which datacenter?
[22:55:42] <icinga-wm>	 RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:57:42] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures