[00:01:31] (03PS1) 10BryanDavis: toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/418709 (https://phabricator.wikimedia.org/T188680) [00:06:44] (03PS2) 10BryanDavis: toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/418709 (https://phabricator.wikimedia.org/T188680) [00:22:05] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [00:22:14] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [00:22:23] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [00:22:53] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1052 bytes in 0.002 second response time [00:23:23] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [00:23:33] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [00:24:24] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [00:28:33] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:30:13] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:24] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:31:43] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:32:13] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [00:33:33] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [00:34:04] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1052 bytes in 0.005 second response time [00:34:33] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [00:35:43] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:36:23] PROBLEM - HHVM rendering on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:36:34] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [00:36:43] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:37:14] RECOVERY - HHVM rendering on mw2113 is OK: HTTP OK: HTTP/1.1 200 OK - 75700 bytes in 0.312 second response time [00:37:29] elukey ^^ (sorry for ping again :)) [00:38:23] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:38:44] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:40:43] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [00:40:43] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [00:41:13] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1052 bytes in 0.002 second response time [00:43:53] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:46:43] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [00:48:03] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:49:03] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [00:49:43] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:50:04] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:50:43] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [00:51:03] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [00:52:03] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:53:03] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [00:53:04] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:53:33] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:03] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [00:55:04] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [00:56:54] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [00:58:33] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1052 bytes in 0.003 second response time [00:59:23] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [01:00:13] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [01:02:13] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [01:02:43] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:23] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [01:03:34] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1052 bytes in 0.002 second response time [01:04:23] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [01:05:14] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-id}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received [01:07:13] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [01:07:50] (03PS1) 10BryanDavis: wiki replicas: Add spamblacklist to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/418710 (https://phabricator.wikimedia.org/T184483) [01:09:44] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [01:09:54] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [01:09:54] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [01:10:13] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [01:10:14] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [01:13:43] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [01:20:13] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [01:41:16] I'd have a patch for this ^. Should I deploy it or do we just ack it? https://gerrit.wikimedia.org/r/418711 [02:01:23] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 35 probes of 297 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:02:03] RECOVERY - Host wdqs2006.mgmt is UP: PING WARNING - Packet loss = 64%, RTA = 36.69 ms [02:06:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 9 probes of 297 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:39:44] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:00:53] RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [03:01:03] PROBLEM - HHVM rendering on mw2182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:53] RECOVERY - HHVM rendering on mw2182 is OK: HTTP OK: HTTP/1.1 200 OK - 74856 bytes in 0.313 second response time [03:07:34] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:23:23] RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.12 ms [03:25:15] (03CR) 10Brian Wolff: [C: 031] wiki replicas: Add spamblacklist to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/418710 (https://phabricator.wikimedia.org/T184483) (owner: 10BryanDavis) [03:27:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 755.39 seconds [03:36:55] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [04:00:23] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 294.84 seconds [04:00:43] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:43] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:43] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:43] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:43] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:54] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:54] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:01:24] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:01:33] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:01:53] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:01:56] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:01:56] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:01:56] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:01:56] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:02:13] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:02:13] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:03:23] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:13] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:23] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:24] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:43] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:29:13] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:29:33] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:29:33] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:29:53] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:30:43] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:30:43] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:30:43] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:30:43] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:30:43] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:31:03] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:31:03] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:31:24] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:31:33] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:31:53] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:31:54] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:31:54] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:31:54] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:32:13] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:32:13] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:33:23] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:12:23] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:13] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 74747 bytes in 0.144 second response time [05:22:53] PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [05:22:54] ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T189403 [05:22:58] 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T189403#4040910 (10ops-monitoring-bot) [05:58:53] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [06:11:23] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T189403#4040923 (10Marostegui) p:05Triage>03High a:03Cmjohnson This is m5 master @cmjohnson do you have an used disk somewhere to replace this one? Thanks! [06:20:34] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:36:23] RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.71 ms [06:36:33] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:23] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 74809 bytes in 0.296 second response time [06:45:53] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:48:13] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [07:58:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:58:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:08:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [08:09:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:11:15] (03PS1) 10Elukey: Fix eventlog1002's ipv6 address [dns] - 10https://gerrit.wikimedia.org/r/418714 (https://phabricator.wikimedia.org/T185667) [08:16:23] RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [08:29:45] aqs/druid failures happened at midnight UTC are due to big queries again, my team is aware and we'll work on it starting tomorrow :) [08:29:49] thanks paladox for the ping! [08:32:16] 10Operations, 10Analytics: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4040987 (10Peachey88) [08:32:42] 10Operations, 10Analytics, 10netops: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4040988 (10elukey) [08:50:38] !log executed sudo rm /etc/logrotate.d/kafkatee-webrequest-analytics on oxygen/rhenium to stop daily cronspam [08:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:31] 10Operations, 10LuaSandbox, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10hardware-requests: Strong reduction of computing time at Wikivoyage needed - https://phabricator.wikimedia.org/T189409#4040993 (10RolandUnger) [09:35:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [09:36:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [09:45:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [09:46:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:36:03] (03PS3) 10Zoranzoki21: Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700 (https://phabricator.wikimedia.org/T148603) (owner: 10Ahmed123) [11:14:54] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:02:43] RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 42.21 ms [12:33:04] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4041118 (10Samwilson) [12:38:46] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4037542 (10MarcoAurelio) I guess this is what's called `restricted` in the puppet config. [12:54:23] PROBLEM - HHVM rendering on mw2192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:14] RECOVERY - HHVM rendering on mw2192 is OK: HTTP OK: HTTP/1.1 200 OK - 74749 bytes in 0.301 second response time [13:12:03] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [15:09:02] (03CR) 10Reedy: Disable abusefilter from collecting private data on Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [15:15:34] PROBLEM - Host wdqs2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:20:23] PROBLEM - HHVM rendering on mw2204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:13] RECOVERY - HHVM rendering on mw2204 is OK: HTTP OK: HTTP/1.1 200 OK - 74763 bytes in 0.302 second response time [15:42:03] RECOVERY - Host wdqs2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 41.80 ms [20:33:23] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [22:59:36] PROBLEM - Host db1069 is DOWN: PING CRITICAL - Packet loss = 100% [23:28:38] RECOVERY - Host db1069 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms