[01:12:26] PROBLEM - HHVM rendering on mw2180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:16] RECOVERY - HHVM rendering on mw2180 is OK: HTTP OK: HTTP/1.1 200 OK - 73557 bytes in 0.291 second response time [03:24:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 768.21 seconds [03:36:17] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:45:25] 10Operations, 10MediaWiki-Vagrant, 10Patch-For-Review: Import kibana package from jessie into stretch - https://phabricator.wikimedia.org/T183071#3859447 (10bd808) The ELK packages are fix now. The cirrussearch role is still failing (T183306) in part due to: ``` ==> default: Error: Could not update: Executio... [03:46:26] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 117.29 seconds [04:01:26] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [04:10:26] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [106250000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:40:36] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [05:08:29] (03PS1) 10BryanDavis: labsdb: add missing CNAMEs for centralauth and meta [puppet] - 10https://gerrit.wikimedia.org/r/400088 (https://phabricator.wikimedia.org/T183651) [05:10:49] (03PS2) 10BryanDavis: labsdb: add missing CNAMEs for centralauth and meta [puppet] - 10https://gerrit.wikimedia.org/r/400088 (https://phabricator.wikimedia.org/T183651) [05:12:46] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:36] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 73762 bytes in 0.144 second response time [05:21:02] (03CR) 10Andrew Bogott: [C: 032] labsdb: add missing CNAMEs for centralauth and meta [puppet] - 10https://gerrit.wikimedia.org/r/400088 (https://phabricator.wikimedia.org/T183651) (owner: 10BryanDavis) [07:01:26] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2022860 [08:07:16] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100% [08:07:16] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:08:06] PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:06] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:07] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:07] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:07] PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100% [08:08:26] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [08:08:36] PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:08:56] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [08:08:56] RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [08:09:06] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 2.45 ms [08:09:06] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [08:09:06] RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms [08:09:06] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 2.12 ms [08:09:06] RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 2.62 ms [08:09:07] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms [08:09:07] RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 2.24 ms [08:09:36] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Internal Server Error [08:09:56] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [08:10:36] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [08:10:36] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [08:13:06] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:24:06] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:29:16] PROBLEM - etc request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 6154888 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:29:56] PROBLEM - Request latencies on chlorine is CRITICAL: CRITICAL - apiserver_request_latencies is 3400034 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:30:57] RECOVERY - Request latencies on chlorine is OK: OK - apiserver_request_latencies is 2062 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:31:16] RECOVERY - etc request latencies on chlorine is OK: OK - etcd_request_latencies is 1665 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:20:16] 10Puppet, 10MediaWiki-Vagrant, 10MediaWiki-extensions-SendGrid, 10Patch-For-Review: Create a MW-Vagrant role for SendGrid extension - https://phabricator.wikimedia.org/T183571#3859558 (10D3r1ck01) 05Resolved>03Open Follow up patch needed, will close this once it's done. [10:41:27] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 5 [11:21:09] (03CR) 10Jayprakash12345: "@Dzahn Can you merge so that We can go ahead?" [dns] - 10https://gerrit.wikimedia.org/r/399831 (https://phabricator.wikimedia.org/T183561) (owner: 10Jayprakash12345) [11:23:16] (03PS1) 10Urbanecm: Create rollbacker user group for ruwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400093 (https://phabricator.wikimedia.org/T183655) [11:27:44] (03PS3) 10Urbanecm: Switch Wikipedias from $wgLogoHD to direct using of a SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399805 (https://phabricator.wikimedia.org/T178942) [11:28:11] (03PS3) 10Urbanecm: Update chrwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399806 (https://phabricator.wikimedia.org/T180553) [11:28:42] (03PS4) 10Urbanecm: Switch Wikipedias from $wgLogoHD to direct using of a SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399805 (https://phabricator.wikimedia.org/T178942) [11:29:01] (03PS4) 10Urbanecm: Update chrwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399806 (https://phabricator.wikimedia.org/T180553) [13:42:49] (03PS1) 10Urbanecm: Enable mapframe on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400096 (https://phabricator.wikimedia.org/T183661) [15:37:18] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3859805 (10Groovier) I have changed my SSH private key because my hard disk crashed. Please update my public key as below: ``` ssh-rsa AAAAB... [16:06:18] (03PS1) 10Giuseppe Lavagetto: apache: add httpd module as a replacement [puppet] - 10https://gerrit.wikimedia.org/r/400100 [16:09:47] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [16:10:05] <_joe_> !log restarted pdfrenderer on scb1002 [16:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:56] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:44:56] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:52:34] thnx _joe_ for pdfrender [22:59:32] 10Puppet, 10MediaWiki-Vagrant, 10MediaWiki-extensions-SendGrid, 10Patch-For-Review: Create a MW-Vagrant role for SendGrid extension - https://phabricator.wikimedia.org/T183571#3860076 (10D3r1ck01) Well tested on Cloud VPS and composer enabled for the role. Thanks @bd808 for merging :) [23:03:18] 10Puppet, 10MediaWiki-Vagrant, 10MediaWiki-extensions-SendGrid, 10Patch-For-Review: Create a MW-Vagrant role for SendGrid extension - https://phabricator.wikimedia.org/T183571#3860083 (10D3r1ck01) 05Open>03Resolved [23:45:02] (03PS33) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956)