[01:04:06] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:04:25] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:04:26] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:04:36] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:04:36] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:05:55] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:16:09] (03CR) 10Alex Monk: "This is looking fairly good, I'll give it a go later on" (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [02:06:46] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:06:46] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:35] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:45] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:55] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:56] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:34:27] PROBLEM - Host labservices1001 is DOWN: CRITICAL - Host Unreachable (208.80.155.117) [02:41:06] !log rebooting labservices1001 from mgmt [02:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:07] RECOVERY - Host labservices1001 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [02:47:21] 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252 (10Andrew) This just happened again -- thanks to better paging I caught it sooner :) There's nothing of interest in the syslog, just a sudden stop: ``` Aug 5 02:32:0... [02:48:26] PROBLEM - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.749 second response time [02:48:45] 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252 (10Andrew) Ah, there were some temp warnings a few minutes earlier: ``` Aug 5 02:29:02 labservices1001 kernel: [3025868.972351] CPU3: Core temperature above threshold... [02:49:08] 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashing, probable overheating - https://phabricator.wikimedia.org/T196252 (10Andrew) [02:53:36] RECOVERY - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.407 second response time [03:02:42] 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashing, probable overheating - https://phabricator.wikimedia.org/T196252 (10Andrew) p:05Normal>03High [03:21:46] RECOVERY - Memory correctable errors -EDAC- on mw2157 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2157&var-datasource=codfw%2520prometheus%252Fops [03:27:25] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 866.64 seconds [03:33:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 155.53 seconds [04:55:36] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:56:58] (03PS3) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) [04:57:55] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 74762 bytes in 0.570 second response time [06:28:16] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/ferm.conf] [06:30:06] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:31:46] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:33:15] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R] [06:55:56] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:15] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:26] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:47:39] (03CR) 10Vgutierrez: provide ACMEv2 support based on certbot/acme library (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [10:24:55] (03PS1) 10Gergő Tisza: Give hewiki interface-admins the rights interface-editors have [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450441 (https://phabricator.wikimedia.org/T200698) [10:24:57] (03PS1) 10Gergő Tisza: Remove hewiki interface-editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450442 (https://phabricator.wikimedia.org/T200698) [10:41:05] (03CR) 10Merlijn van Deen: [C: 031] "lgtm; you could consider also setting the logo_width" [puppet] - 10https://gerrit.wikimedia.org/r/448999 (owner: 10Alex Monk) [10:42:12] (03CR) 10Merlijn van Deen: [C: 031] "...and we should probably update the logo and text to mention Toolforge rather than tool labs ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/448999 (owner: 10Alex Monk) [11:41:33] (03PS1) 10Gergő Tisza: Revert "Configure group management for interface-admin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450 [12:14:08] (03CR) 10Alex Monk: provide ACMEv2 support based on certbot/acme library (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [13:12:55] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 311 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:18:06] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 14 probes of 311 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:21:34] (03PS1) 10Nehajha: Following pep8 coding conventions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 [13:22:06] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:29:34] (03CR) 10Zhuyifei1999: [C: 031] "Is there any other changes you want to do in this patch? Or shall I merge?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 (owner: 10Nehajha) [13:30:27] (03CR) 10Nehajha: "> Is there any other changes you want to do in this patch? Or shall I" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 (owner: 10Nehajha) [13:39:05] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:40:26] (03PS2) 10Nehajha: Following pep8 coding conventions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 [13:42:14] (03CR) 10Nehajha: "> Is there any other changes you want to do in this patch? Or shall I" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 (owner: 10Nehajha) [15:00:05] (03PS1) 10Ori.livneh: Declare and manage a /var/cache/coal_web dir [puppet] - 10https://gerrit.wikimedia.org/r/450468 [16:07:16] (03PS1) 10Urbanecm: Add correct sitename for satwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450469 (https://phabricator.wikimedia.org/T198400) [16:26:03] Doesn't happen to be anyone around who could tell me what OSes /^(actinium|alcyone|alsafi|aluminium)\.wikimedia\.org$/ run? [16:26:06] it's looking like trusty to me [16:34:49] (03CR) 10Andrew Bogott: [C: 032] Following pep8 coding conventions [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450458 (owner: 10Nehajha) [17:17:11] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Marostegui) a:03Papaul Can we get this disk replaced? Thanks! [17:24:02] (03PS2) 10Gergő Tisza: Allow all bureaucrats to remove interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450450 [20:09:26] PROBLEM - Host scb2006 is DOWN: PING CRITICAL - Packet loss = 100% [20:10:05] RECOVERY - Host scb2006 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms