[00:00:28] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [00:11:29] (03CR) 10Alex Monk: "Faidon: I4b9b1857 & I349653a0 convert it to a simple non-template script using a config file set up by puppet, without ports hardcoded or " [puppet] - 10https://gerrit.wikimedia.org/r/286683 (owner: 10Alex Monk) [00:14:29] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2271454 (10DanielFriesen) @ashley They just want to use MW to host the policy pages rather than use it like an actual wiki. So omitting things like content_actions an... [00:19:45] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [00:23:23] (03PS1) 10Yurik: Removed obsolete graph ext settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287299 [00:24:14] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [00:24:16] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2271484 (10Dzahn) re: "the releng team". Are we talking about the "contint-admins" , "contint-users" groups? These are the ones... [00:30:30] 06Operations, 10Wikimedia-IRC-RC-Server: udpmxircecho should write stats of messages processed and we should alert when that drops to zero - https://phabricator.wikimedia.org/T134326#2262099 (10AlexMonk-WMF) or if it stops writing them I guess, maybe based on the file's timestamp? [00:49:34] (03PS7) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [00:49:53] (03CR) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [01:01:38] !log ori@tin Synchronized php-1.27.0-wmf.22/includes/api/ApiStashEdit.php: If56084466: Bump PRESUME_FRESH_TTL_SEC to improve hit rate and avoid link queries (duration: 00m 34s) [01:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:06:44] 06Operations, 06Labs, 06Project-Admins: Archive old Incident-* projects - https://phabricator.wikimedia.org/T134624#2271591 (10Danny_B) [02:19:03] Request from 90.180.83.194 via cp1045 cp1045, Varnish XID 2072679218 [02:19:06] Error: 503, Service Unavailable at Sat, 07 May 2016 02:18:37 GMT [02:19:18] >>> UNRECOVERABLE FATAL ERROR <<< [02:19:18] Maximum execution time of 10 seconds exceeded [02:19:18] /srv/deployment/phabricator/deployment-cache/revs/7dd45143c333b8fb854b8f40bd96c46ea56a0970/phabricator/src/applications/maniphest/storage/ManiphestTask.php:152 [02:19:22] ┻━┻ ︵ ¯\_(ツ)_/¯ ︵ ┻━┻ [02:24:06] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 09m 34s) [02:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:05] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [02:45:18] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.23) (duration: 05m 38s) [02:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat May 7 02:54:32 UTC 2016 (duration 9m 14s) [02:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:44] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:05:10] gerrit seems very slow to pull from [03:06:57] ottomata, you created a git repository without making the github mirror [03:08:38] 4e707beb 03:05:14.637 03:04:14.637 (retry 370) push git@github.com:wikimedia/operations-debs-druid [03:09:40] fixed [03:13:40] not sure what's up with this one: 4b114dea 03:13:52.211 03:12:52.211 (retry 3024) push gerritslave@antimony.wikimedia.org:/var/lib/git/mediawiki/extensions/WikiShare.git [03:22:34] can pull again now [03:24:11] based on show-queue it seems several instances of "gerrit gc --all" (gerrit2) are gone, as are several instances of "git-upload-pack '/labs/private.git'" (labs-puppet) [03:54:00] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures [04:15:48] 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2271750 (10Krenair) [04:16:52] 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2271764 (10Krenair) This appears to have happened on deployment-memc02 since 13th April, I also found -logstash2 with the issue and -mathoid wi... [04:19:52] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:26:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [04:28:12] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [05:01:22] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: puppet fail [05:18:08] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: puppet fail [05:26:37] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [05:38:34] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: puppet fail [05:45:52] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:05:32] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:15:54] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 646 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5920498 keys - replication_delay is 646 [06:17:52] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5910570 keys - replication_delay is 0 [06:30:42] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:03] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:43] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:02] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:03] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:34] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:43] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:02] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:13] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:33] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:55:53] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:58] 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2271750 (10hashar) Supposed to be fixed by https://gerrit.wikimedia.org/r/#/c/284852/ for T132689 [06:58:02] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:31] 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2271819 (10hashar) [06:58:35] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2206880 (10hashar) [07:05:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [07:06:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5911746 keys - replication_delay is 0 [07:55:44] (03PS1) 10Nikerabbit: Use wfLoadExtension for LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287316 [08:27:35] (03PS2) 10Nikerabbit: Use wfLoadExtension for LocalisationUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287316 [08:34:01] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures [08:46:31] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89958.83 seconds [09:01:11] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:18:39] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2271904 (10Nemo_bis) > IMVHO using such dumb forwarders is a behaviour that should be discouraged How common is such a forwarding mechanism? Email aliases are c... [09:19:43] (03CR) 10Faidon Liambotis: [C: 031] "\o/ (Assuming this works... :)" [puppet] - 10https://gerrit.wikimedia.org/r/287246 (owner: 10Alex Monk) [09:20:13] (03CR) 10Faidon Liambotis: [C: 031] udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 (owner: 10Alex Monk) [10:34:06] PROBLEM - puppet last run on mw2012 is CRITICAL: CRITICAL: Puppet has 1 failures [10:51:05] (03PS1) 10Volans: MariaDB: tune thread-pool to avoid Aborted_connects [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) [10:53:33] (03CR) 10Volans: "Jaime: it's ok do it on all core DBs or you prefer to have it only on the masters given that we encountered the issue only on masters?" [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [11:01:29] RECOVERY - puppet last run on mw2012 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [11:55:37] !log ori@tin Synchronized php-1.27.0-wmf.22/includes/api/ApiStashEdit.php: If56084466: Make stashEditFromPreview() call setCacheTime() (duration: 00m 33s) [11:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5939600 keys - replication_delay is 633 [12:10:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5915914 keys - replication_delay is 0 [12:32:44] !log Restarted blazegraph on wdqs1002 (Unresponsive, even locally: java.io.IOException: Too many open files) T134238 [12:32:46] T134238: Query service fails with "Too many open files" - https://phabricator.wikimedia.org/T134238 [12:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:32] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [13:37:47] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [14:07:08] (03CR) 10Jcrespo: "I would definitely do it only on the masters- this issue never happened in the last 1 year for other slaves, and we still not have a 100% " [puppet] - 10https://gerrit.wikimedia.org/r/287394 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [14:32:02] PROBLEM - HHVM rendering on mw1195 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.007 second response time [14:34:02] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 67081 bytes in 0.204 second response time [14:35:41] 06Operations, 10Analytics-EventLogging, 06Performance-Team, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2272198 (10Krinkle) 05Open>03Resolved No, let's close this. [14:38:50] !log inplace precise upgrade to trusty on db1069 before labs explodes [14:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:21] PROBLEM - Host db1069 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:52] RECOVERY - Host db1069 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [14:44:30] 06Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2272203 (10BBlack) Even if we can't come to a firm consensus on which of `?all` or `-all` is the most-appropriate setting, I think both sides of that debate woul... [14:47:42] RECOVERY - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is OK: TCP OK - 0.037 second response time on port 9042 [15:05:21] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272246 (10BBlack) Thanks for merging in the probably-related tasks. I had somehow missed really noticing T123159 earlier... So probably digging into gunzip itself isn... [15:08:52] 06Operations, 10Traffic, 05Security: Varnish corruption on certain response lengths - https://phabricator.wikimedia.org/T134649#2272249 (10BBlack) [15:15:33] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2272279 (10jcrespo) [15:15:35] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Upgrade db1069 - https://phabricator.wikimedia.org/T134349#2272277 (10jcrespo) 05Open>03Resolved I upgraded it this weekend because if not I was going to become crazy. Hopefully that will fix all tokudb-replication issues for once. [15:19:31] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:27:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:37:06] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272333 (10BBlack) [15:49:07] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272368 (10BBlack) p:05Triage>03High [15:59:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [16:01:16] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5920461 keys - replication_delay is 0 [16:45:03] 07Puppet, 10Beta-Cluster-Infrastructure: /etc/puppet/puppet.conf sometimes gets the old puppetmaster FQDN and breaks puppet - https://phabricator.wikimedia.org/T134631#2272446 (10Krenair) I'm not sure how this is a duplicate? [17:40:21] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2272522 (10Papaul) The OS Installation is done on maps2001, maps2003 and maps2004. On maps2002 after the installation and after the system reboots to load the OS I get the error message below. /bin/sh: E: not... [18:00:15] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:07:04] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:08:33] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:10:02] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:53] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:14:33] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:15:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:15:52] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:16:24] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:24:52] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [18:26:13] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [18:30:42] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:30:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:54:35] 06Operations, 10media-storage: API request failed (internal_api_error_LocalFileLockError): [Vy45gApAMCEAAGZ5po8AAAAR] Exception Caught: Could not acquire lock for '(1)_Piano_Recital_at_12.jpg.' - https://phabricator.wikimedia.org/T134662#2272562 (10Steinsplitter) [19:13:42] 06Operations, 10media-storage: API request failed (internal_api_error_LocalFileLockError): [Vy45gApAMCEAAGZ5po8AAAAR] Exception Caught: Could not acquire lock for '(1)_Piano_Recital_at_12.jpg.' - https://phabricator.wikimedia.org/T134662#2272583 (10matmarex) [19:13:46] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2272585 (10matmarex) [19:14:58] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10matmarex) [19:26:20] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2272594 (10Steinsplitter) Yet a other one: https://commons.wikimedia.org/wiki/File... [19:31:11] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2272615 (10Steinsplitter) And a other one: https://commons.wikimedia.org/wiki/File... [19:49:50] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 2 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2272625 (10TerraCodes) [19:52:05] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5924057 keys - replication_delay is 619 [19:57:46] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5923521 keys - replication_delay is 0 [20:09:23] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [20:09:34] is there any restriction on setting arbitrary servers on X-Wikimedia-Debug: ? [20:13:54] !log restarted blazegraph on wdqs1001 and wdqs1002 [20:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:50] thanks SMalyshev [20:28:56] !log deploying updated Blazegraph version for WDQS to mitigate deadlock issue [20:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:09] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [20:48:22] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272654 (10MZMcBride) Using the exact URL from the task description, I'm consistently getting a bad response: ``` >>> from subprocess import check_output >>> args = ['c... [21:10:16] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [21:14:37] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272673 (10Yurivict) It is 100% reproducible. [21:18:39] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2272674 (10Yurivict) It also seems to be related to the size of the response. 32kB is the critical size. The respone is 32766=0x7FFE bytes. [22:18:25] (03CR) 10Dereckson: [C: 031] "InitialiseSettings.php is wrapped in the following callback method:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287095 (owner: 10Alexandros Kosiaris) [22:28:41] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [22:28:42] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: puppet fail [22:36:02] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [22:38:45] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2272723 (10Papaul) [22:43:57] huh, what kind of project is ops-eqdfw? About which datacenter? [22:55:42] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:57:42] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures