[00:00:57] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:01:31] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[00:02:42] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul) ` papaul@asw-b-codfw# run show interfaces ge-5/0/8 descriptions      Interface       Admin Link Description ge-5/0/8        down  do...
[00:03:07] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul)
[00:04:17] <icinga-wm>	 PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:04:31] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 70 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[00:04:51] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1001 is OK: OK - running: The system is fully operational
[00:04:51] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties
[00:06:05] <XioNoX>	 !log jnt push to msw switches
[00:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:01] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1003 is CRITICAL: 5.955e+06 ge 5e+06 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[00:07:42] <wikibugs>	 (03PS1) 10Papaul: DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/500637
[00:08:41] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006
[00:08:57] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1003 is OK: OK - running: The system is fully operational
[00:09:13] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:09:22] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul)
[00:09:39] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[00:09:57] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1003 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties
[00:10:51] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[00:11:13] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:12:59] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere [puppet] - 10https://gerrit.wikimedia.org/r/500635 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[00:13:23] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1004 is OK: OK - running: The system is fully operational
[00:13:49] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties
[00:16:55] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:16:57] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:17:41] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:18:53] <icinga-wm>	 PROBLEM - tilerator on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[00:19:14] <XioNoX>	 !log replacing accepted-prefix-limit with prefix-limit on one ulsfo peer - T211730
[00:19:17] <icinga-wm>	 PROBLEM - tilerator on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[00:19:19] <icinga-wm>	 PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[00:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:22] <stashbot>	 T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730
[00:20:39] <icinga-wm>	 RECOVERY - Varnishkafka Eventlogging Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All
[00:20:47] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:21:52] <wikibugs>	 10Operations, 10netops: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10ayounsi) Confirmed that replacing accepted-prefix-limit with prefix-limit does NOT cause the peer to bounce.
[00:22:27] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1003 is OK: (C)5e+06 ge (W)1e+06 ge 9.916e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[00:24:05] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[00:25:23] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:25:35] <XioNoX>	 !log replacing accepted-prefix-limit with prefix-limit on all ulsfo peers - T211730
[00:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:25:39] <stashbot>	 T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730
[00:27:15] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:30:33] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:35:59] <icinga-wm>	 RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[00:36:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:40:34] <XioNoX>	 !log replacing accepted-prefix-limit with prefix-limit in [co|eq]dfw - T211730
[00:40:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:40:41] <stashbot>	 T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730
[00:42:43] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere [puppet] - 10https://gerrit.wikimedia.org/r/500635 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[00:43:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:49:07] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[00:51:34] <wikibugs>	 (03PS1) 10Ayounsi: Depooling eqsin because of eqsin-codfw link outage [dns] - 10https://gerrit.wikimedia.org/r/500638
[00:51:51] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depooling eqsin because of eqsin-codfw link outage [dns] - 10https://gerrit.wikimedia.org/r/500638 (owner: 10Ayounsi)
[00:52:19] <XioNoX>	 !log depool eqsin due to Telia eqsin-codfw link outage
[00:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:57:03] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:00:01] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[01:01:00] <wikibugs>	 10Operations, 10Traffic, 10netops: Outage on the primary codfw-eqsin link - https://phabricator.wikimedia.org/T219847 (10ayounsi) p:05Triage→03Normal
[01:02:25] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 49.81 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:05:09] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[01:10:53] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:11:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:14:40] <XioNoX>	 !log replacing accepted-prefix-limit with prefix-limit in eqord - T211730
[01:14:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:14:44] <stashbot>	 T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730
[01:14:44] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Aklapper) @Yann: No, because the "Priority" field is not for users to express how...
[01:15:31] <wikibugs>	 10Operations, 10Traffic, 10netops: Outage on the primary codfw-eqsin link - https://phabricator.wikimedia.org/T219847 (10ayounsi)
[01:17:59] <XioNoX>	 !log replacing accepted-prefix-limit with prefix-limit on cr1-eqiad - T211730
[01:18:00] <wikibugs>	 (03CR) 10Pppery: [C: 03+1] Add editcontentmodel right to the templateeditor group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494016 (https://phabricator.wikimedia.org/T217499) (owner: 10Ammarpad)
[01:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:20:37] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[01:21:55] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[01:22:19] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational
[01:32:13] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[01:34:37] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:34:47] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[01:34:57] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:39:55] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[01:46:11] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:46:21] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[01:47:45] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 96.93 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[01:48:47] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 70.38 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[01:52:29] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[01:56:13] <icinga-wm>	 PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago
[02:08:09] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:09:27] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:12:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:13:23] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:14:39] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:15:43] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:18:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:19:27] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:21:05] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:24:39] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:24:59] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:25:11] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:28:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[02:36:37] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:37:55] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:46:59] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:48:17] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[02:49:17] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational
[02:51:27] <wikibugs>	 (03PS22) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[02:53:23] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 58.44 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[02:58:37] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:01:13] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:10:13] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:10:33] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:12:05] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:12:05] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Figure out why HHVM isn't using error_document404 setting - https://phabricator.wikimedia.org/T187754 (10Krinkle)
[03:14:05] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:17:57] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:18:33] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:23:07] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:24:47] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:27:03] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 70.84 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[03:30:53] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:33:29] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:37:23] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:38:39] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:39:10] <wikibugs>	 (03PS23) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[03:48:59] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:52:53] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[03:53:19] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:01:55] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[04:03:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:03:37] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:05:45] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[04:08:33] <icinga-wm>	 PROBLEM - Disk space on cloudcontrol2001-dev is CRITICAL: DISK CRITICAL - free space: / 343 MB (3% inode=86%)
[04:12:36] <wikibugs>	 (03PS24) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243)
[04:13:45] <icinga-wm>	 RECOVERY - tilerator on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[04:13:47] <icinga-wm>	 RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[04:14:37] <icinga-wm>	 RECOVERY - tilerator on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[04:14:50] <onimisionipe>	 !log restarted tilerator on maps200[1-3] - connection refused
[04:14:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:16:05] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[04:19:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:23:13] <wikibugs>	 10Operations, 10Maps: Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10Mathew.onipe)
[04:23:13] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:23:24] <wikibugs>	 (03CR) 10BryanDavis: "All tests currently passing. Testable at https://tools-checker-03.wmflabs.org or using `curl localhost/...` on tools-checker-03.tools.eqia" [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis)
[04:26:25] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[04:29:07] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:32:05] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:32:35] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:34:17] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:35:33] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:35:57] <icinga-wm>	 PROBLEM - puppet last run on ores1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:36:04] <wikibugs>	 (03PS1) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[04:41:53] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[04:41:54] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud enc: duplicate a placeholder password from 'main' to 'eqiad1' [labs/private] - 10https://gerrit.wikimedia.org/r/500641
[04:42:35] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:42:38] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] cloud enc: duplicate a placeholder password from 'main' to 'eqiad1' [labs/private] - 10https://gerrit.wikimedia.org/r/500641 (owner: 10Andrew Bogott)
[04:44:13] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:47:03] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[04:47:11] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:49:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:49:45] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational
[04:50:57] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[04:50:58] <wikibugs>	 (03PS4) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622
[04:51:00] <wikibugs>	 (03PS2) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[04:51:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[04:52:13] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[04:52:27] <icinga-wm>	 PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:56:49] <wikibugs>	 (03PS5) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622
[04:56:51] <wikibugs>	 (03PS3) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[04:56:53] <wikibugs>	 (03PS1) 10Andrew Bogott: admin_scripts: add a case for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/500642 (https://phabricator.wikimedia.org/T215407)
[04:58:03] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:01:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] admin_scripts: add a case for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/500642 (https://phabricator.wikimedia.org/T215407) (owner: 10Andrew Bogott)
[05:02:13] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500386 (owner: 10Giuseppe Lavagetto)
[05:02:19] <icinga-wm>	 RECOVERY - puppet last run on ores1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[05:02:20] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Edit Project Config [docker-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500385 (owner: 10Giuseppe Lavagetto)
[05:03:59] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:04:27] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:05:37] <wikibugs>	 (03PS4) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[05:06:23] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[05:08:57] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[05:08:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:09:19] <wikibugs>	 (03PS5) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[05:10:15] <ema>	 elukey: I'm sure you'll find out reading the backlog this morning, but still: it seems that something might have gone (still is going?) slightly wrong with kafka during the night
[05:12:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "this needs a bit more work; the diff is still bigger than it should be.  I also need to double-check that this applies cleanly on VMs." [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott)
[05:14:40] <wikibugs>	 (03PS2) 10Marostegui: realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777)
[05:16:45] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[05:17:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777) (owner: 10Marostegui)
[05:18:03] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[05:18:51] <icinga-wm>	 RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:19:23] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:23:13] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[05:24:31] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[05:25:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui)
[05:30:59] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[05:32:23] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:37:29] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[05:37:52] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: New version release [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500643
[05:38:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] New version release [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500643 (owner: 10Giuseppe Lavagetto)
[05:40:27] <wikibugs>	 (03CR) 10jenkins-bot: New version release [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500643 (owner: 10Giuseppe Lavagetto)
[05:48:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Release 1.1.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500644
[05:50:52] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500645
[05:52:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500645 (owner: 10Marostegui)
[05:53:48] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500645 (owner: 10Marostegui)
[05:54:55] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[05:55:06] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1008 (duration: 00m 56s)
[05:55:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:09] <marostegui>	 !log Upgrade pc1008
[05:55:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Release 1.1.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500644 (owner: 10Giuseppe Lavagetto)
[05:56:28] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Marostegui)
[05:58:19] <logmsgbot>	 !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@2a090ef]: New version for T219778
[05:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:22] <stashbot>	 T219778: docker-pkg is unhappy on contint1001 - https://phabricator.wikimedia.org/T219778
[05:58:38] <logmsgbot>	 !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@2a090ef]: New version for T219778 (duration: 00m 19s)
[05:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:05] <icinga-wm>	 PROBLEM - MariaDB Slave IO: pc2 on pc2008 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1008.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1008.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[05:59:10] <marostegui>	 ^ me
[05:59:34] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500648
[05:59:57] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500645 (owner: 10Marostegui)
[06:00:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500648 (owner: 10Marostegui)
[06:01:57] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500648 (owner: 10Marostegui)
[06:02:58] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1008 (duration: 00m 53s)
[06:02:59] <icinga-wm>	 RECOVERY - MariaDB Slave IO: pc2 on pc2008 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[06:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:37] <icinga-wm>	 PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:07:06] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500650
[06:09:47] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[06:11:19] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500648 (owner: 10Marostegui)
[06:11:20] <elukey>	 lovely
[06:11:54] <elukey>	 weird I don't see anything in the graphs?
[06:12:37] <marostegui>	 yeah, there is nothing for this time
[06:14:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500650 (owner: 10Marostegui)
[06:15:24] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500650 (owner: 10Marostegui)
[06:16:27] <elukey>	 even kafka looks good
[06:16:30] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1064 (duration: 00m 54s)
[06:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:53] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:16:53] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500651
[06:22:45] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500650 (owner: 10Marostegui)
[06:23:57] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:23:59] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[06:24:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500651 (owner: 10Marostegui)
[06:25:15] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200- on logstash1007 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[06:25:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500651 (owner: 10Marostegui)
[06:26:20] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1064 (duration: 00m 52s)
[06:26:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:25] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2070 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:6 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T219852
[06:27:30] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T219852 (10ops-monitoring-bot)
[06:27:36] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500652
[06:28:14] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T219852 (10Marostegui) p:05Triage→03Normal a:03Papaul Can we get this disk replaced? Thanks!
[06:28:45] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[06:28:55] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[06:29:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500652 (owner: 10Marostegui)
[06:30:03] <icinga-wm>	 RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:30:23] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500652 (owner: 10Marostegui)
[06:31:40] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1120 (duration: 00m 52s)
[06:31:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:53] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500653
[06:33:03] <icinga-wm>	 PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh]
[06:33:42] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500651 (owner: 10Marostegui)
[06:33:44] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500652 (owner: 10Marostegui)
[06:34:25] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:34:27] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[06:36:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500653 (owner: 10Marostegui)
[06:36:17] <elukey>	 I am not getting why it is still alarming
[06:36:30] <elukey>	 sumSeries(perSecond(varnishkafka.*.webrequest.upload.varnishkafka.kafka_drerr)) on graphite looks flat zero (that is the metric used in the alarm)
[06:37:01] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[06:37:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500653 (owner: 10Marostegui)
[06:38:26] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1120 (duration: 00m 50s)
[06:38:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:34] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763)
[06:44:45] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Expose rsyslog_udp_port to services configs. [puppet] - 10https://gerrit.wikimedia.org/r/498872 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko)
[06:44:47] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500653 (owner: 10Marostegui)
[06:45:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[06:46:15] <wikibugs>	 (03PS1) 10Ema: ATS: strip PKP headers [puppet] - 10https://gerrit.wikimedia.org/r/500655
[06:46:17] <wikibugs>	 (03PS1) 10Ema: ATS: test unsetting Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/500656 (https://phabricator.wikimedia.org/T213263)
[06:47:17] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[06:47:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Expose rsyslog_udp_port to services configs. [puppet] - 10https://gerrit.wikimedia.org/r/498872 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko)
[06:48:35] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[06:57:21] <wikibugs>	 (03CR) 10Marostegui: "@jcrespo does this look ok to you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui)
[06:58:54] <wikibugs>	 (03PS2) 10Ema: ATS: strip PKP headers [puppet] - 10https://gerrit.wikimedia.org/r/500655 (https://phabricator.wikimedia.org/T213263)
[06:59:27] <icinga-wm>	 RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:05:47] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/500407
[07:09:42] <wikibugs>	 (03PS3) 10Ema: ATS: strip PKP headers [puppet] - 10https://gerrit.wikimedia.org/r/500655 (https://phabricator.wikimedia.org/T213263)
[07:11:18] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: strip PKP headers [puppet] - 10https://gerrit.wikimedia.org/r/500655 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema)
[07:13:06] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/500407
[07:14:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/500407 (owner: 10Muehlenhoff)
[07:15:21] <icinga-wm>	 PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:19:55] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:21:14] <wikibugs>	 (03PS2) 10Ema: ATS: test unsetting Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/500656 (https://phabricator.wikimedia.org/T213263)
[07:22:34] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: test unsetting Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/500656 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema)
[07:23:14] <ema>	 moritzm: ok to puppet-merge 1292208e09?
[07:23:49] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Clarify labsdb1004 and 1005 status [puppet] - 10https://gerrit.wikimedia.org/r/500657 (https://phabricator.wikimedia.org/T216749)
[07:24:08] <wikibugs>	 (03PS2) 10Marostegui: site.pp: Clarify labsdb1004 and 1005 status [puppet] - 10https://gerrit.wikimedia.org/r/500657 (https://phabricator.wikimedia.org/T216749)
[07:24:34] <ema>	 moritzm: I'll leave it up to you, my change is minimal and can be merged
[07:25:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Clarify labsdb1004 and 1005 status [puppet] - 10https://gerrit.wikimedia.org/r/500657 (https://phabricator.wikimedia.org/T216749) (owner: 10Marostegui)
[07:25:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Get rid of domains non controlled by WMF [puppet] - 10https://gerrit.wikimedia.org/r/499239 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[07:25:20] <wikibugs>	 (03PS3) 10Vgutierrez: redirects.dat: Get rid of domains non controlled by WMF [puppet] - 10https://gerrit.wikimedia.org/r/499239 (https://phabricator.wikimedia.org/T213705)
[07:25:42] <marostegui>	 ema: got it, moritzm can your change be merged?
[07:26:57] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[07:27:05] <marostegui>	 My change can also be merged anytime, it is just a comment to clarify the status of two hsots
[07:27:09] <marostegui>	 hosts
[07:27:56] <vgutierrez>	 feel free to merge mine as well
[07:28:10] <marostegui>	 nice queue XD
[07:30:31] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:30:56] <moritzm>	 ema: sorry, got distracted by ms-be2026, just merged it
[07:32:04] <marostegui>	 moritzm: I think all the others, ema's, vgutierrez and mine can also be merged
[07:32:38] <vgutierrez>	 +1
[07:32:55] * vgutierrez merging
[07:33:14] <marostegui>	 thanks!
[07:33:27] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[07:33:33] <vgutierrez>	 (done)
[07:34:24] <moritzm>	 thx
[07:34:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: send varnish syslogs via kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/498467 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron)
[07:46:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:47:05] <icinga-wm>	 RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[07:48:32] <wikibugs>	 (03PS10) 10Urbanecm: Test rules reference only existing wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494188 (https://phabricator.wikimedia.org/T217541)
[07:49:50] <wikibugs>	 10Operations, 10ops-codfw: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10MoritzMuehlenhoff)
[07:52:27] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[07:52:57] <moritzm>	 !log removed labvirt1008 from debmonitor (T216661)
[07:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:01] <stashbot>	 T216661: cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661
[07:54:19] <icinga-wm>	 PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 4952 MB (3% inode=85%)
[07:54:41] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:01:01] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:01:31] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10jcrespo) @Yann and @Aklapper please stop discussing that (or at least discussing...
[08:02:47] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:03:04] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10jcrespo)
[08:03:19] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:05:15] <moritzm>	 !log installing openssl1.0 security updates
[08:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665
[08:06:13] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:06:41] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:07:25] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:07:40] <wikibugs>	 (03CR) 10Elukey: "LGTM, added also Filippo! Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto)
[08:09:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:10:53] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[08:11:25] <wikibugs>	 (03CR) 10Vgutierrez: "some of the listed SNIs have several DNS issues: https://phabricator.wikimedia.org/P8325" [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[08:13:13] <moritzm>	 !log installing debdeploy updates on remaining hosts in eqiad/codfw
[08:13:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui)
[08:13:36] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Allow LE issue the non-canonical redirects service certificate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[08:13:37] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:13:38] <marostegui>	 \o/
[08:13:48] <wikibugs>	 (03PS4) 10Vgutierrez: Allow LE issue the non-canonical redirects service certificate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705)
[08:13:49] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:14:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui)
[08:15:06] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui)
[08:16:03] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[08:16:25] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:16:40] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool all slaves in x1 T219777 T143763 (duration: 00m 53s)
[08:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:45] <stashbot>	 T143763: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763
[08:16:45] <stashbot>	 T219777: DBA review of UrlShortener - https://phabricator.wikimedia.org/T219777
[08:18:24] <wikibugs>	 (03CR) 10Vgutierrez: "After merging Ib064d25b82cdc1fcf9372a7881d8caece2433507 looks way better: https://phabricator.wikimedia.org/P8326" [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[08:20:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "no PCC but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto)
[08:20:36] <marostegui>	 !log Compress wikishared.urlshortcodes table on x1, directly on the master with replication (table has 1 row) - T219777
[08:20:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:55] <icinga-wm>	 RECOVERY - Disk space on notebook1003 is OK: DISK OK
[08:24:30] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui)
[08:24:35] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:31:43] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:32:39] <wikibugs>	 (03PS13) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[08:34:12] <wikibugs>	 (03Abandoned) 10Gehel: WIP: experimentation with type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/491812 (owner: 10Gehel)
[08:34:27] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:34:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I've cc'd service ops and ores folks for notification / heads up" [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron)
[08:36:55] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:40:07] <icinga-wm>	 PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka]
[08:41:09] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[08:45:37] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[08:47:22] <wikibugs>	 10Operations, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Vgutierrez)
[08:48:04] <wikibugs>	 10Operations, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Vgutierrez) p:05Triage→03Normal
[08:50:00] <marostegui>	 !log Execute schema change on db1069 x1 master with replication enabled on the following small wikis: aawiki aawikibooks aawiktionary abwiki abwiktionary acewiki advisorswiki advisorywiki adywiki afwiki T143763
[08:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:05] <stashbot>	 T143763: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763
[08:50:39] <icinga-wm>	 RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[08:50:45] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[08:52:08] <wikibugs>	 (03PS2) 10Dzahn: k8s:proxy: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499769
[08:52:37] <icinga-wm>	 RECOVERY - MegaRAID on sodium is OK: OK: optimal, 1 logical, 4 physical
[08:52:50] <wikibugs>	 (03CR) 10Vgutierrez: acme_chief: Issue the non-canonical redirect certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[08:52:53] <mutante>	 :) sodium
[08:53:14] <wikibugs>	 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10Joe) I encountered the same problem, and I think the problem lies elsewhere, specifically in the prerm script from the current `rsyslog` package:  ` Unpacking rsyslog-gnutls (8.1901.0-1~bpo8...
[08:53:44] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/500525 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[08:54:33] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool all slaves in x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500670
[08:56:33] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/500525 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[08:56:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool all slaves in x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500670 (owner: 10Marostegui)
[08:57:52] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool all slaves in x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500670 (owner: 10Marostegui)
[08:58:05] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool all slaves in x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500670 (owner: 10Marostegui)
[08:58:56] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all slaves in x1 T219777 T143763 (duration: 00m 53s)
[08:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:00] <stashbot>	 T143763: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763
[08:59:01] <stashbot>	 T219777: DBA review of UrlShortener - https://phabricator.wikimedia.org/T219777
[08:59:17] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[09:00:23] <wikibugs>	 (03PS4) 10Vgutierrez: acme_chief: Issue the non-canonical redirect certificates [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705)
[09:00:35] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[09:00:37] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) Thanks, it installed with no issues.
[09:01:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500622 (owner: 10Andrew Bogott)
[09:02:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:03:05] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/500637 (owner: 10Papaul)
[09:03:14] <_joe_>	 !log uploaded patched version of bootstrap-vz to account for jessie-updates vanishing (T219683)
[09:03:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:17] <stashbot>	 T219683: Rebuild docker-registry.wikimedia.org/wikimedia-jessie to drop jessie-update/jessie-backports - https://phabricator.wikimedia.org/T219683
[09:04:01] <wikibugs>	 (03CR) 10Vgutierrez: "everything looking good now: https://phabricator.wikimedia.org/P8327" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:04:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/500637 (owner: 10Papaul)
[09:04:49] <wikibugs>	 (03PS1) 10Strainu: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T210325)
[09:05:01] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:06:03] <wikibugs>	 (03PS2) 10Strainu: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855)
[09:08:05] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:09:15] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: DNS: Remove mgmt and production DNS for cloudnet2001-dev [dns] - 10https://gerrit.wikimedia.org/r/500634 (owner: 10Papaul)
[09:10:27] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `...
[09:10:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Issue the non-canonical redirect certificates [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez)
[09:11:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] DNS: Remove mgmt and production DNS for cloudnet2001-dev [dns] - 10https://gerrit.wikimedia.org/r/500634 (owner: 10Papaul)
[09:11:43] <icinga-wm>	 PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:12:19] <icinga-wm>	 RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[09:14:13] <icinga-wm>	 PROBLEM - puppet last run on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused
[09:14:16] <wikibugs>	 (03PS1) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263)
[09:14:31] <icinga-wm>	 PROBLEM - DPKG on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused
[09:14:36] <arturo>	 !log T219776 finally reimaging cloudnet2003-dev.codfw.wmnet (was labtestnet2003)
[09:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:39] <stashbot>	 T219776: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776
[09:14:59] <icinga-wm>	 PROBLEM - configured eth on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused
[09:15:09] <icinga-wm>	 PROBLEM - MD RAID on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused
[09:15:25] <icinga-wm>	 PROBLEM - Disk space on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused
[09:15:43] <icinga-wm>	 PROBLEM - dhclient process on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused
[09:16:45] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776
[09:16:45] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776
[09:16:45] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776
[09:16:45] <icinga-wm>	 ACKNOWLEDGEMENT - configured eth on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776
[09:16:45] <icinga-wm>	 ACKNOWLEDGEMENT - dhclient process on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776
[09:16:46] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776
[09:16:46] <icinga-wm>	 ACKNOWLEDGEMENT - DNS labtestnet2003.mgmt on labtestnet2003.mgmt is CRITICAL: Domain labtestnet2003.mgmt.codfw.wmnet was not found by the server Arturo Borrero Gonzalez T219776 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:17:11] <wikibugs>	 10Operations, 10ops-codfw: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) I can ssh into it via cumin.  The MD raid status is this: `lang=bash $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sda1[0](F)...
[09:18:25] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga: remove unwanted character from elastic check [puppet] - 10https://gerrit.wikimedia.org/r/500678
[09:19:25] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: install_server: fix typo in partman recipe selector for cloudnet2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/500679 (https://phabricator.wikimedia.org/T219776)
[09:21:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] install_server: fix typo in partman recipe selector for cloudnet2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/500679 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez)
[09:22:29] <wikibugs>	 (03PS2) 10Gehel: icinga: remove unwanted character from elastic check [puppet] - 10https://gerrit.wikimedia.org/r/500678 (owner: 10Mathew.onipe)
[09:23:09] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[09:23:50] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: remove unwanted character from elastic check [puppet] - 10https://gerrit.wikimedia.org/r/500678 (owner: 10Mathew.onipe)
[09:24:03] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2003-dev.codfw.wmnet'] `  Of which those...
[09:24:18] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `...
[09:25:27] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[09:25:55] <wikibugs>	 10Operations, 10Acme-chief, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) 05Open→03Resolved The non-canonical certs have been issued successfully:  `root@acmechief1001:~# for i in {1..4}; do openssl x509 -text -no...
[09:26:03] <wikibugs>	 10Operations, 10Acme-chief, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez)
[09:26:25] <icinga-wm>	 PROBLEM - SSH on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:28:47] <icinga-wm>	 RECOVERY - SSH on labtestnet2003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:37:51] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10Joe)
[09:38:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "sorry for that" [puppet] - 10https://gerrit.wikimedia.org/r/500642 (https://phabricator.wikimedia.org/T215407) (owner: 10Andrew Bogott)
[09:39:33] <wikibugs>	 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10MoritzMuehlenhoff) Running the steps from the prerm on a jessie system with 8.38 works fine:   ` jmm@alsafi:~$ sudo systemctl stop syslog.socket jmm@alsafi:~$ sudo invoke-rc.d rsyslog stop j...
[09:41:59] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: wikitech: Lock LDAP accounts when users are blocked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis)
[09:45:03] <wikibugs>	 (03PS6) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203)
[09:45:05] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336)
[09:46:51] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero)
[09:47:06] <icinga-wm>	 PROBLEM - Long running screen/tmux on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused
[09:48:09] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero)
[09:49:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:49:35] <elukey>	 mmmmm
[09:50:03] <elukey>	 today is not the best :D
[09:50:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:50:24] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Server rename: labtestnet2003 to cloudnet2003-dev, update label and switch ports descriptions, etc - https://phabricator.wikimedia.org/T219861 (10aborrero)
[09:50:52] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero)
[09:50:54] <wikibugs>	 10Operations, 10ops-codfw: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) So the `dsa-check-hpssacli` check is happily returning `0` exit code and this output: ` OK: Slot 0: no logical drives --- Slot 0: no drives ` Given that IIRC we add the HP raid check only on the hosts tha...
[09:51:17] <elukey>	 ah ok a burst of request waiting to be cached
[09:52:31] <wikibugs>	 (03PS2) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263)
[09:53:48] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[09:54:00] <icinga-wm>	 RECOVERY - Disk space on cloudcontrol2001-dev is OK: DISK OK
[09:55:32] <icinga-wm>	 PROBLEM - IPMI Sensor Status on labtestnet2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.20.12: Connection reset by peer
[09:57:58] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[09:59:32] <icinga-wm>	 PROBLEM - NTP on labtestnet2003 is CRITICAL: NTP CRITICAL: No response from NTP server https://wikitech.wikimedia.org/wiki/NTP
[09:59:33] <wikibugs>	 (03PS3) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263)
[10:01:36] <icinga-wm>	 RECOVERY - dhclient process on labtestnet2003 is OK: PROCS OK: 0 processes with command name dhclient
[10:01:38] <icinga-wm>	 RECOVERY - MD RAID on labtestnet2003 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[10:02:32] <icinga-wm>	 RECOVERY - configured eth on labtestnet2003 is OK: OK - interfaces up
[10:02:34] <icinga-wm>	 RECOVERY - Disk space on labtestnet2003 is OK: DISK OK
[10:03:12] <icinga-wm>	 PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:03:14] <wikibugs>	 (03PS1) 10Volans: RAID: hpssacli exit with correct code [puppet] - 10https://gerrit.wikimedia.org/r/500684 (https://phabricator.wikimedia.org/T219854)
[10:03:58] <wikibugs>	 (03CR) 10Marostegui: mariadb-backups: Setup dbprov2002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[10:04:32] <wikibugs>	 (03PS3) 10Dzahn: k8s:proxy: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499769
[10:04:36] <icinga-wm>	 RECOVERY - DPKG on labtestnet2003 is OK: All packages OK
[10:04:46] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[10:06:59] <wikibugs>	 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) I had a look at the missing packages:  catch: It's self-contained and has minimal build deps and isn't used anywhere in our fleet, I think we can simply import the...
[10:08:22] <elukey>	 !log manually purge varnishkafka graphite alert's URL as attempt to avoid a flapping alert - T219842
[10:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:25] <stashbot>	 T219842: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842
[10:09:18] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga: align elastic base and nrpe check titles [puppet] - 10https://gerrit.wikimedia.org/r/500686
[10:09:38] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[10:10:00] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2003-dev.codfw.wmnet'] `  and were **ALL*...
[10:10:26] <icinga-wm>	 RECOVERY - puppet last run on labtestnet2003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[10:10:28] <wikibugs>	 (03PS2) 10Mathew.onipe: icinga: align elastic base and nrpe check titles [puppet] - 10https://gerrit.wikimedia.org/r/500686
[10:14:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema)
[10:15:58] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[10:18:32] <wikibugs>	 (03PS2) 10Volans: RAID: hpssacli exit with correct code [puppet] - 10https://gerrit.wikimedia.org/r/500684 (https://phabricator.wikimedia.org/T219854)
[10:19:28] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 3 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn)
[10:20:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] k8s:proxy: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499769 (owner: 10Dzahn)
[10:25:30] <wikibugs>	 (03Abandoned) 10Muehlenhoff: kube-proxy: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500404 (owner: 10Muehlenhoff)
[10:25:36] <icinga-wm>	 RECOVERY - IPMI Sensor Status on labtestnet2003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[10:25:38] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: serverpackages: mitaka: stretch: additional pinning fixes [puppet] - 10https://gerrit.wikimedia.org/r/500691 (https://phabricator.wikimedia.org/T215407)
[10:28:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc OK https://puppet-compiler.wmflabs.org/compiler1002/15489/" [puppet] - 10https://gerrit.wikimedia.org/r/500691 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[10:29:18] <wikibugs>	 (03PS1) 10Greta WMDE: Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767)
[10:30:30] <jbond42>	 !log add debhelper 10.2.5 and dh-systemd 10.2.5 to jessie-wikimedia/backports
[10:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:58] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[10:33:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: drop file cleanup declarations [puppet] - 10https://gerrit.wikimedia.org/r/500694
[10:35:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: drop file cleanup declarations [puppet] - 10https://gerrit.wikimedia.org/r/500694 (owner: 10Arturo Borrero Gonzalez)
[10:35:54] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:36:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/500615 (owner: 10Alex Monk)
[10:36:59] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: service::node: Only try to define node10 repository if it is not already defined [puppet] - 10https://gerrit.wikimedia.org/r/500615 (owner: 10Alex Monk)
[10:37:44] <icinga-wm>	 PROBLEM - puppet last run on cloudnet2002-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:38:14] <icinga-wm>	 PROBLEM - puppet last run on cloudnet1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:39:06] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:39:08] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:39:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:39:14] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:39:18] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:39:25] <jbond42>	 !log add dh-autoreconf 12 to jessie-wikimedia/backports
[10:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:39:44] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:39:51] <_joe_>	 uh what's up with aqs?
[10:39:55] <_joe_>	 elukey: ^^
[10:40:00] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:40:05] <marostegui>	 oh that wikitech page is empty :(
[10:40:12] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[10:40:20] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:40:21] <elukey>	 yeah I think it is somebody using AQS -> Druid to gather edit data
[10:40:44] <wikibugs>	 (03PS1) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531)
[10:40:47] <elukey>	 yes see broker metrics in https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&orgId=1
[10:40:52] <mutante>	 marostegui: should be https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS  will fix url
[10:40:56] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[10:41:02] <marostegui>	 mutante: thanks!
[10:41:58] <elukey>	 might be an expensive query, going to check
[10:42:16] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron-common: fix relationship with sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/500696
[10:42:38] <jbond42>	 !log add strip-nondeterminism 0.034 to jessie-wikimedia/backports
[10:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:44] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:43:03] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron-common: fix relationship with sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/500696 (owner: 10Arturo Borrero Gonzalez)
[10:43:10] <mutante>	 eh..  cant find the check in puppet.. odd.. keep looking though
[10:43:18] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:43:36] <icinga-wm>	 PROBLEM - puppet last run on cloudnet1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:44:06] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:44:06] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[10:44:38] <mutante>	 hosts unknown to pybal ?
[10:45:04] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:45:10] <mutante>	 here it is:  https://config-master.wikimedia.org/pybal/eqiad/aqs
[10:45:22] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:45:35] <mutante>	 no, wrong one.. but here anyways: https://config-master.wikimedia.org/pybal/eqiad/druid-public-broker
[10:45:35] <_joe_>	 druid, not aqs
[10:46:20] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:46:46] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:46:54] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:46:55] <elukey>	 https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&panelId=43&fullscreen&orgId=1&from=now-24h&to=now
[10:46:58] <elukey>	 this is the issue
[10:47:04] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt2001-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:47:34] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[10:47:37] <jbond42>	 !log add catch 1.10 to jessie-wikimedia/backports
[10:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:08] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:48:08] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt2003-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:48:58] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:49:11] <mutante>	 the noise from cloudvirt* .. like cloudvirt1025.. seems already over, puppet run fine
[10:49:42] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt2002-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:50:00] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:50:59] <elukey>	 yes I can confirm, big queries coming from probably a bot
[10:51:36] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[10:51:44] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:51:58] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:52:36] <wikibugs>	 (03PS3) 10Alaa Sarhan: Add wgScoreLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191)
[10:52:54] <mutante>	 marostegui: now i see what happened.. that check is a generic endpoints check for all services (as in 'scb') so the URL is built from Service/Monitoring/$name  . will add a redirect in wiki 
[10:53:26] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[10:53:36] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:53:44] <wikibugs>	 (03PS7) 10Alaa Sarhan: Add wgMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191)
[10:53:58] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:54:02] <icinga-wm>	 RECOVERY - puppet last run on cloudnet1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:54:24] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[10:54:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:54:38] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:56:46] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:57:40] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[10:57:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700
[10:58:04] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "ah. it was the other way around. naked -> www right now" [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[10:58:28] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[10:59:09] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665
[10:59:18] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[11:00:05] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1100).
[11:00:05] <jouncebot>	 Tulsi, Urbanecm, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:10] <Urbanecm>	 Here
[11:00:22] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:00:50] <zeljkof>	 Amir1: want to take the entire swat today? (since there are 3 patches, 1 of them yours) :)
[11:01:22] <zeljkof>	 if Amir1 is not around, I can SWAT
[11:01:32] <zeljkof>	 Tulsi: around for swat?
[11:01:39] <mutante>	 @seen Amir1 
[11:01:39] <wm-bot>	 mutante: Amir1 is in here, right now
[11:01:39] <zeljkof>	 Tulsi|Away: around for swat?
[11:01:49] <Amir1>	 sure
[11:01:52] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[11:02:02] <zeljkof>	 Amir1: great, swat is yours then :)
[11:02:20] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1019 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[11:02:45] <jbond42>	 !log add rapidjson 1.1.0 to jessie-wikimedia/backports
[11:02:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:17] <Amir1>	 Urbanecm: around?
[11:03:24] <Amir1>	 you are
[11:03:24] <Urbanecm>	 Yes Amir1 
[11:03:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494188 (https://phabricator.wikimedia.org/T217541) (owner: 10Urbanecm)
[11:03:52] <Amir1>	 does your patch need syncing? it's test, right?
[11:04:02] <icinga-wm>	 RECOVERY - puppet last run on cloudnet2002-dev is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:04:17] <Urbanecm>	 yes, but I think if you won't sync it, other devs will comply about unsynced changes, right?
[11:04:23] <Urbanecm>	 (yes, it is a test)
[11:04:31] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700
[11:04:34] <Urbanecm>	 not fully sure about procedure for getting this merged, through
[11:04:43] <Amir1>	 Urbanecm: we just need to rebase it
[11:04:46] <wikibugs>	 (03Merged) 10jenkins-bot: Test rules reference only existing wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494188 (https://phabricator.wikimedia.org/T217541) (owner: 10Urbanecm)
[11:05:01] <Amir1>	 it's fine, done it a million times before :P
[11:05:17] <Urbanecm>	 Okay then :)
[11:05:25] <Amir1>	 now it's done
[11:05:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: don't install python-sphinx from our repo [puppet] - 10https://gerrit.wikimedia.org/r/500701 (https://phabricator.wikimedia.org/T215407)
[11:05:56] <Urbanecm>	 thanks
[11:06:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: stretch: don't install python-sphinx from our repo [puppet] - 10https://gerrit.wikimedia.org/r/500701 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[11:06:20] <wikibugs>	 (03CR) 10jenkins-bot: Test rules reference only existing wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494188 (https://phabricator.wikimedia.org/T217541) (owner: 10Urbanecm)
[11:06:24] <Amir1>	 Tulsi, Tulsi|Away Please ping me when you're around
[11:06:34] <wikibugs>	 (03PS2) 10Ladsgroup: Add the 'urlshortener-manage-url' right and enable it for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499777 (https://phabricator.wikimedia.org/T133109)
[11:07:18] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499777 (https://phabricator.wikimedia.org/T133109) (owner: 10Ladsgroup)
[11:08:14] <wikibugs>	 (03Merged) 10jenkins-bot: Add the 'urlshortener-manage-url' right and enable it for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499777 (https://phabricator.wikimedia.org/T133109) (owner: 10Ladsgroup)
[11:08:42] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge.
[11:09:14] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge.
[11:09:54] <icinga-wm>	 RECOVERY - puppet last run on cloudnet1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:10:32] <mutante>	 Amir1: could i ask you about wikiba.se or the best contact at WMDE
[11:10:35] <wikibugs>	 (03CR) 10Ema: [C: 03+1] profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 (owner: 10Giuseppe Lavagetto)
[11:10:53] <mutante>	 Amir1: also added some more test URLs for shortener using Unicode chars
[11:11:28] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC output is expected: https://puppet-compiler.wmflabs.org/compiler1002/15494/" [puppet] - 10https://gerrit.wikimedia.org/r/500686 (owner: 10Mathew.onipe)
[11:11:38] <Amir1>	 mutante: sure, if I can't answer your question, I will tell you who can
[11:11:39] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 (owner: 10Giuseppe Lavagetto)
[11:12:18] <Amir1>	 mutante: yeah but the accepted char set doesn't have those https://github.com/wikimedia/mediawiki-extensions-UrlShortener/blob/master/extension.json#L122 :(((
[11:12:37] <Amir1>	 mutante: btw. Regarding V for Vendetta, that's capital V :P
[11:13:09] <mutante>	 Amir1: currently both wikiba.se and www.wikiba.se work equally, but there are no redirects/rewrites between them. i think we should avoid serving same content from different URLs, so i would rewrite them. but which way around.. do we want to make the www canonical?
[11:13:12] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:13:30] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt2001-dev is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[11:13:33] <mutante>	 Amir1: re: accepted char set:  ooh.. ok, well.. it was a test:)   and capital V makes sense, heh
[11:14:07] <Amir1>	 hmm, I can send the lowercase v to wikivoyage 
[11:14:21] <jbond42>	 !log add cmake 3.6.2 to jessie-wikimedia/backports
[11:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:24] <wikibugs>	 (03PS19) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[11:14:26] <Amir1>	 mutante: hmm, that's a question for our PM I guess
[11:14:31] <Amir1>	 let me ask her
[11:14:33] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/15496/" [puppet] - 10https://gerrit.wikimedia.org/r/487895 (owner: 10Muehlenhoff)
[11:14:36] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[11:14:39] <akosiaris>	 !log T217715 Update mathoid, citoid, cxserver, eventgate grafana dashboards to use the new recording rules for the quantiles
[11:14:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:43] <stashbot>	 T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715
[11:14:52] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1020 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:15:22] <mutante>	 Amir1: thank you! tell her that is part of moving it to WMF prod finally. we are now unblocked. we have a certificate :)
[11:15:29] <mutante>	 Amir1: or https://en.wiktionary.org/wiki/v  if you dont have Wiktionary yet
[11:15:40] <Amir1>	 yay
[11:15:55] <Amir1>	 we have some wiktionary already :D
[11:15:59] <mutante>	 Amir1: if you hack your /etc/hosts you can already see wikiba.se in wmf prod
[11:16:02] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt2002-dev is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:16:15] <mutante>	 Amir1: now just missing the rewrite stuff and HSTS
[11:16:51] <Amir1>	 YESS, I've been waiting for this for years
[11:16:52] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[11:17:08] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[11:17:21] <mutante>	 Amir1: :) https://phabricator.wikimedia.org/T155359#5077009
[11:17:26] <wikibugs>	 (03CR) 10jenkins-bot: Add the 'urlshortener-manage-url' right and enable it for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499777 (https://phabricator.wikimedia.org/T133109) (owner: 10Ladsgroup)
[11:18:08] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[11:18:11] <Amir1>	 Just tried it
[11:18:26] <Amir1>	 it's awesome, we should document how to do deployment because I completely forgot it
[11:18:28] <mutante>	 or put "91.198.174.192 wikiba.se" in /etc/hosts
[11:19:44] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt2003-dev is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[11:21:02] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:499777|Add the urlshortener-manage-url right and enable it for stewards (T133109)]], Part I (duration: 00m 53s)
[11:21:02] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:16] <stashbot>	 T133109: Add basic abuse prevention to UrlShortener - https://phabricator.wikimedia.org/T133109
[11:22:22] <mutante>	 Amir1: done https://wikitech.wikimedia.org/wiki/Microsites#How_to_deploy
[11:22:26] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:499777|Add the urlshortener-manage-url right and enable it for stewards (T133109)]], Part I (duration: 00m 51s)
[11:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:05] <Amir1>	 \o/ You're awesome
[11:23:24] <Amir1>	 !log EU SWAT is done
[11:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:28] <mutante>	 we could create a new admin group for it to be able to run puppet themselves if needed
[11:30:11] <wikibugs>	 (03CR) 10Jcrespo: "I have not changed the original source hosts, probably the more suitable hosts were not available when this was first deployed." [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[11:31:02] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) wikiba.se can now be viewed in WMF production by editing the local `/etc/hosts` file with f.e. `91.198.174.192 wikiba.se`  Open issues...
[11:31:24] <wikibugs>	 (03CR) 10Jcrespo: "Of course, before deployment we will need some grant changes, too." [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[11:32:11] <Amir1>	 nah, it's fine
[11:32:21] <mutante>	 ok
[11:32:56] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar)
[11:33:17] <hashar>	 !log contint1001: cleaning Docker containers #T219850
[11:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:21] <stashbot>	 T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850
[11:34:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[11:35:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] gerrit: admins: ops -> gerritadmin [puppet] - 10https://gerrit.wikimedia.org/r/498431 (owner: 10Hashar)
[11:35:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[11:37:16] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[11:37:36] <wikibugs>	 (03PS2) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531)
[11:37:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[11:38:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[11:38:54] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[11:39:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Lemme know when it's ready to go (what's blocking it?) and I 'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac)
[11:41:40] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar) 05duplicate→03Open p:05Triage→03Unbreak! a:03hashar That task is valid it is for...
[11:42:05] <moritzm>	 !log restarting parsoid on wtp1025 to pick up openssl update
[11:42:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:44] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[11:42:50] <icinga-wm>	 RECOVERY - keystone admin endpoint port 35357 on cloudcontrol2001-dev is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[11:42:56] <icinga-wm>	 RECOVERY - keystone public endoint port 5000 on cloudcontrol2001-dev is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 757 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[11:43:20] <wikibugs>	 (03PS3) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601)
[11:43:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15493/ looks good in the compiler. Clearly much more needs to be done." [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto)
[11:43:24] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:44:04] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[11:44:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 (owner: 10Giuseppe Lavagetto)
[11:44:52] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700
[11:49:22] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational
[11:49:42] <icinga-wm>	 PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:49:52] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:50:56] <wikibugs>	 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) Ok I hit another road block.   leatherman depends on debhelper 11.  I manully updated debian/compat and debian/control to try and build with debhelper 10 .   The first build l...
[11:51:48] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[11:54:20] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[11:55:29] <wikibugs>	 (03PS3) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531)
[11:55:31] <wikibugs>	 (03PS1) 10Dzahn: wikiba.se: add HSTS header with low max_age [puppet] - 10https://gerrit.wikimedia.org/r/500711 (https://phabricator.wikimedia.org/T99531)
[11:55:44] <Krenair>	 jbond42, hey, is T219803 a part of T184564?
[11:55:47] <stashbot>	 T219803: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803
[11:55:55] <Krenair>	 looks like a puppet 5 thing
[11:56:10] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[11:56:32] <jbond42>	 Krenair: checking
[11:56:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[11:57:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, assuming that there is nothing that prevents monitoring hosts (like firewall rules etc..) to contact graphite-in.eqiad.wmnet (don't " [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto)
[11:58:39] <jbond42>	 Krenair: i think they are related however T184564 is talking more about the server side, i am only concentrating on the client side for now.  puppet 5 is allready running on buster systems T219803
[11:58:43] <jbond42>	 T219803
[11:58:55] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[11:58:58] <jbond42>	 ^^ that ticket is about packporting the packages to stretch and jessie 
[11:59:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[12:00:05] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1200)
[12:00:32] <icinga-wm>	 PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 312 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring
[12:01:15] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[12:01:25] <wikibugs>	 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond)
[12:01:26] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:01:28] <wikibugs>	 10Operations, 10Puppet: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564 (10jbond)
[12:01:29] <jynus>	 arturo^ bstorm_ andrewbogott
[12:01:34] <arturo>	 looking
[12:02:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[12:02:36] <wikibugs>	 (03CR) 10BBlack: wikiba.se: add Apache rewrites for www to naked domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[12:03:05] <arturo>	 weird, I can r/w from NFS in toolforge
[12:03:22] <Zppix>	 arturo: it looks like its an error from codfw
[12:03:22] <bstorm_>	 The checker must be screwed
[12:03:26] <icinga-wm>	 PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:03:55] <bstorm_>	 Did the downtime on toolchecker eun out arturo ?
[12:03:59] <_joe_>	 bstorm_: look at http://checker.tools.wmflabs.org/nfs/home
[12:04:14] <_joe_>	 there is an error in the response
[12:04:16] <arturo>	 bstorm_: it was downtimed until today?
[12:04:25] <arturo>	 I know bd808 is reworking toolscheker
[12:04:30] <_joe_>	 it's looking for a file that doesn't exists
[12:04:32] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[12:04:32] <bstorm_>	 Yes
[12:04:34] <arturo>	 but not sure in which status is right now
[12:04:34] <_joe_>	 *exist
[12:04:44] <_joe_>	 I'd go touch that file on NFS
[12:05:01] <_joe_>	 and make it writable only by root
[12:05:14] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 312 bytes in 0.006 second response time Arturo Borrero Gonzalez looking https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring
[12:05:14] <icinga-wm>	 ACKNOWLEDGEMENT - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.404 second response time Arturo Borrero Gonzalez looking https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring
[12:05:47] <bstorm_>	 _joe_ toolcheckers are undergoing rewrite tho. They needed trusty 
[12:05:50] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[12:06:01] <Zppix>	 arturo: could it be the fact its using trusty?
[12:06:09] <bstorm_>	 So they may not be reliable
[12:07:23] <hashar>	 !log contint1001: compressing some MediaWiki debugging logs under /srv/jenkins/builds # T219850
[12:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:30] <stashbot>	 T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850
[12:07:41] <wikibugs>	 (03PS4) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531)
[12:07:58] <icinga-wm>	 PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:09:01] <bstorm_>	 arturo: I know the downtime was set for this week. I suspect it just ended. trusty grid is down intentionally 
[12:09:27] <arturo>	 ok, downtiming again for... 1  month
[12:09:52] <wikibugs>	 (03PS8) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191)
[12:09:55] <bstorm_>	 lol...should be done by then
[12:10:02] <Zppix>	 arturo: hehe now if it would only fix itself right :P
[12:10:20] <arturo>	 Zppix: :-P
[12:10:21] <wikibugs>	 (03PS4) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191)
[12:11:10] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:11:17] * bstorm_ falls asleep
[12:11:42] <arturo>	 !log icinga downtime toolschecker for 1 month T219243
[12:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:46] <stashbot>	 T219243: Migrate tools-checker system to Stretch - https://phabricator.wikimedia.org/T219243
[12:12:27] <wikibugs>	 (03CR) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[12:13:12] <Zppix>	 arturo: now 1004 for k8s is failing...
[12:13:39] <Zppix>	 Actually nevermind i misread
[12:13:42] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:13:49] <arturo>	 Zppix: ok
[12:14:13] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[12:14:20] <wikibugs>	 (03CR) 10Ema: [C: 03+1] monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto)
[12:14:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[12:15:52] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[12:16:04] <icinga-wm>	 RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[12:16:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[12:17:18] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:17:54] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[12:19:41] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336)
[12:19:54] <icinga-wm>	 PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[12:20:36] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:20:43] <wikibugs>	 (03CR) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[12:21:10] <icinga-wm>	 RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All
[12:22:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Let's deploy and adjust if necessary whatever we might face and we haven't thought of :)" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[12:22:34] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[12:23:46] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:25:16] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:25:53] <wikibugs>	 (03PS1) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531)
[12:26:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[12:28:23] <wikibugs>	 (03PS1) 10Vgutierrez: redirects.dat: Remove wikisource.gr [puppet] - 10https://gerrit.wikimedia.org/r/500716 (https://phabricator.wikimedia.org/T213705)
[12:28:28] <wikibugs>	 (03CR) 10BBlack: varnish/trafficserver: add regex to cover www.wikiba.se as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[12:29:39] <wikibugs>	 (03PS11) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407)
[12:29:52] <icinga-wm>	 RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:30:08] <wikibugs>	 (03PS2) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531)
[12:30:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "finally, PCC ok https://puppet-compiler.wmflabs.org/compiler1002/15501/" [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez)
[12:31:18] <wikibugs>	 (03CR) 10Vgutierrez: varnish/trafficserver: add regex to cover www.wikiba.se as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[12:31:28] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:33:56] <wikibugs>	 (03PS2) 10Vgutierrez: redirects.dat: Remove wikisource.gr [puppet] - 10https://gerrit.wikimedia.org/r/500716 (https://phabricator.wikimedia.org/T213705)
[12:35:18] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:37:09] <wikibugs>	 (03CR) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe)
[12:38:59] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10serviceops: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn)
[12:39:34] <icinga-wm>	 RECOVERY - puppet last run on wtp1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[12:40:53] <wikibugs>	 (03PS3) 10Gehel: icinga: align elastic base and nrpe check titles [puppet] - 10https://gerrit.wikimedia.org/r/500686 (owner: 10Mathew.onipe)
[12:41:11] <wikibugs>	 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn)
[12:41:48] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: align elastic base and nrpe check titles [puppet] - 10https://gerrit.wikimedia.org/r/500686 (owner: 10Mathew.onipe)
[12:42:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto)
[12:42:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto)
[12:42:20] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665
[12:43:53] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500718 (https://phabricator.wikimedia.org/T219626)
[12:44:24] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500718 (https://phabricator.wikimedia.org/T219626)
[12:45:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500718 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez)
[12:45:51] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:46:01] <wikibugs>	 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn) Sent a mail about it to Chuck in legal who handles domain registrations.
[12:46:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:46:07] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I will add the appropriate grants to the affected hosts (dbstore2*, db1115* -statistics- and misc hosts) then deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[12:46:25] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:46:27] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:46:31] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:46:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:46:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:47:13] <mutante>	 elukey: should we depool them ?^
[12:48:39] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[12:48:51] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:49:05] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar) 05Open→03Resolved The jobs running MediaWiki tests no gzip the hu...
[12:49:19] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[12:49:23] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero) 05Open→03Resolved p:05Triage→03Normal
[12:49:33] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:50:09] <_joe_>	 mutante: no please
[12:50:46] <wikibugs>	 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) The later cmake version in combination with the debian/rules file tries to enable position independent ELF files, which doesn't work with libcurl-openssl from stan...
[12:51:03] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:51:05] <mutante>	 _joe_: ok
[12:51:15] <wikibugs>	 (03CR) 10Jcrespo: "There is a problem somewhere: https://puppet-compiler.wmflabs.org/compiler1002/15503/" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[12:52:18] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "not now, maybe later" [puppet] - 10https://gerrit.wikimedia.org/r/500711 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[12:52:35] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:52:37] <wikibugs>	 (03PS20) 10Gehel: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[12:52:45] <wikibugs>	 (03CR) 10Jcrespo: "[ 2019-04-02T12:49:56 ] ERROR: Unable to find facts for host dbprov2001.codfw.wmnet, skipping" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[12:52:51] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:53:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "> [ 2019-04-02T12:49:56 ] ERROR: Unable to find facts for host" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[12:53:37] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:53:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[12:53:45] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:54:06] <wikibugs>	 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) >>! In T219803#5077313, @MoritzMuehlenhoff wrote: > This makes the build phase work fine (but it's failing in test suite now, but unrelated).  That seems to be a k...
[12:54:17] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:54:51] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:54:53] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[12:54:58] <wikibugs>	 (03CR) 10Dzahn: "the problem is the host names are new and the compiler does not know them yet, hence the 404s. syncing facts should fix it (https://wikite" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[12:54:59] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:55:07] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:55:47] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:56:47] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:56:51] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:57:51] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:58:14] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10serviceops: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Peachey88)
[12:59:07] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[12:59:09] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[12:59:29] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:59:31] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:00:04] <jouncebot>	 Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1300)
[13:01:09] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:01:16] <wikibugs>	 10Operations, 10puppet-compiler: puppet compiler - puppet facts need refreshing - https://phabricator.wikimedia.org/T110546 (10jcrespo) Don't use the above procedure, I was pointed instead to https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs
[13:01:25] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:01:29] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:02:45] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:02:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "> the problem is the host names are new and the compiler does not" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[13:03:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: uwsgi: allow setting routing rules [puppet] - 10https://gerrit.wikimedia.org/r/500729
[13:03:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: graphite: correctly set Cache-control: no-store [puppet] - 10https://gerrit.wikimedia.org/r/500730
[13:03:19] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:03:27] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:03:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:03:53] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:03:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:04:31] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:05:58] <wikibugs>	 (03PS1) 10BBlack: Non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731
[13:06:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (owner: 10BBlack)
[13:08:16] <wikibugs>	 (03PS14) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[13:09:41] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:12:08] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10serviceops: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) Sent email to Roland Unger  (http://wikivoyage-ev.org/wiki/Kontakt)
[13:13:20] <wikibugs>	 (03PS15) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375
[13:14:43] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:20:42] <jynus>	 !log updating puppet compiler facts
[13:20:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:50] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime
[13:20:50] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[13:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:09] <mutante>	 Amir1: who's the wikibase PM? Lea?
[13:21:22] <Amir1>	 no Lydia
[13:21:36] <Amir1>	 I think she's afk for meetings and lunch
[13:22:00] <mutante>	 ok! no rush, i was just going to email
[13:23:03] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime
[13:23:03] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:25] <wikibugs>	 (03PS1) 10Gehel: [WIP] build with maven instead of bazel [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733
[13:24:37] <volans>	 !log reboot ms-be2026 to see if that fixes the controller - T219854
[13:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:43] <stashbot>	 T219854: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854
[13:26:03] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[13:26:43] <wikibugs>	 (03PS16) 10Volans: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[13:26:45] <wikibugs>	 (03PS2) 10Gehel: [WIP] build with maven instead of bazel [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733
[13:29:22] <wikibugs>	 (03PS7) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203)
[13:29:32] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336)
[13:29:41] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift
[13:29:57] <wikibugs>	 (03PS1) 10Gehel: Cleanup a few warnings. [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500737
[13:30:17] <icinga-wm>	 RECOVERY - MD RAID on ms-be2026 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[13:30:23] <icinga-wm>	 RECOVERY - Disk space on ms-be2026 is OK: DISK OK
[13:30:33] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2026 is OK: OK - running: The system is fully operational
[13:31:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15504/" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[13:33:58] <wikibugs>	 (03CR) 10Volans: [C: 03+2] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[13:34:09] <wikibugs>	 (03CR) 10Gehel: [WIP] build with maven instead of bazel (033 comments) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 (owner: 10Gehel)
[13:34:31] <icinga-wm>	 RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[13:39:17] <wikibugs>	 (03Merged) 10jenkins-bot: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[13:40:28] <wikibugs>	 (03CR) 10jenkins-bot: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel)
[13:44:16] <volans>	 gehel: Error: Could not find any hostgroup matching 'cloudelastic_eqiad' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 3670)
[13:44:27] <volans>	 icinga config is not happy
[13:44:27] <logmsgbot>	 !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on test wikis and mediawikiwiki for T215525
[13:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:33] <stashbot>	 T215525: log_search rows with ls_field='target_author_actor' and empty ls_value are created during actor migration - https://phabricator.wikimedia.org/T215525
[13:45:00] <cdanis>	 how hard would it be to have jenkins test icinga config validity? :D
[13:45:22] <volans>	 impossible given the exported resources :D
[13:45:33] <volans>	 s/impossible/quite hard/
[13:47:11] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) After the reboot the host is back up and running, all seems good so far. Keeping open for a bit to see if it holds.
[13:47:18] <wikibugs>	 (03PS1) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739
[13:47:23] <icinga-wm>	 RECOVERY - Long running screen/tmux on labtestnet2003 is OK: OK: No SCREEN or tmux processes detected.
[13:48:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 (owner: 10Jbond)
[13:48:37] <wikibugs>	 (03PS1) 10Andrew Bogott: bootstrap_vz firstboot: run apt-get upgrade before anything else [puppet] - 10https://gerrit.wikimedia.org/r/500740
[13:49:07] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be2026 is OK: OK: synced at Tue 2019-04-02 13:49:05 UTC.
[13:51:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] bootstrap_vz firstboot: run apt-get upgrade before anything else [puppet] - 10https://gerrit.wikimedia.org/r/500740 (owner: 10Andrew Bogott)
[13:52:40] <wikibugs>	 (03PS1) 10Volans: cloudelastic: add missing monitoring clusters [puppet] - 10https://gerrit.wikimedia.org/r/500742 (https://phabricator.wikimedia.org/T214921)
[13:52:42] <volans>	 gehel: ^^^
[13:54:18] <volans>	 cdanis: but what we could do is to add a check that when a hiera value 'cluster:' is modified it checks that the matching definitions exists in the monitoring.yaml file
[13:54:35] <volans>	 not sure how to gather which DC to add there tbh though and adding all of them seems redundant and useless in most cases
[13:55:16] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] cloudelastic: add missing monitoring clusters [puppet] - 10https://gerrit.wikimedia.org/r/500742 (https://phabricator.wikimedia.org/T214921) (owner: 10Volans)
[13:55:31] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cloudelastic: add missing monitoring clusters [puppet] - 10https://gerrit.wikimedia.org/r/500742 (https://phabricator.wikimedia.org/T214921) (owner: 10Volans)
[13:56:31] <onimisionipe>	 volans: sorry, that was related to merging the cloudelastic patch
[13:57:26] <volans>	 yeah I know :)
[13:58:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] uwsgi::app: Handle ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/498641 (owner: 10Alex Monk)
[13:58:59] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: uwsgi::app: Handle ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/498641 (owner: 10Alex Monk)
[13:59:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] uwsgi::app: Handle ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/498641 (owner: 10Alex Monk)
[13:59:28] <wikibugs>	 (03PS2) 10Zoranzoki21: Add three domains at wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500154 (https://phabricator.wikimedia.org/T216886)
[13:59:36] <wikibugs>	 (03PS4) 10Zoranzoki21: Remove namespace 104 from FlaggedRevs configuration for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500153 (https://phabricator.wikimedia.org/T217507)
[13:59:59] <volans>	 icinga config back to happy (cc gehel, onimisionipe )
[14:00:57] <thib>	 Request from X via cp1082 cp1082, Varnish XID 108983720 Error: 429, Too Many Requests at Tue, 02 Apr 2019 14:00:25 GMT
[14:01:07] <thib>	 when trying to open https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Coluber_plicatilis_-_1734-1765_-_Print_-_Iconographia_Zoologica_-_Special_Collections_University_of_Amsterdam_-_UBA01_IZ12000206.tif/lossy-page1-1280px-Coluber_plicatilis_-_1734-1765_-_Print_-_Iconographia_Zoologica_-_Special_Collections_University_of_Amsterdam_-_UBA01_IZ12000206.tif.jpg
[14:01:50] <thib>	 or any jpg thumbnail for this tiff file
[14:04:24] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[14:05:58] <wikibugs>	 (03CR) 10Alex Monk: Allow ensure absent in monitoring classes without description/nrpe_command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[14:06:54] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[14:09:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-2] "Ensure absent in this resource isn't really needed as the equivalent is achieved otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/498645 (owner: 10Alex Monk)
[14:11:01] <wikibugs>	 (03CR) 10Alex Monk: "good point. I'll make a new change to get rid of the existing use of this resource with ensure absent" [puppet] - 10https://gerrit.wikimedia.org/r/498645 (owner: 10Alex Monk)
[14:12:56] <wikibugs>	 (03PS1) 10Alex Monk: profile::puppetdb: Remove ensure absent ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/500747
[14:12:59] <wikibugs>	 (03PS1) 10Andrew Bogott: labs_bootstrapvz firstboot:  do dist-upgrade instead of just apt-get upgrade [puppet] - 10https://gerrit.wikimedia.org/r/500748
[14:13:26] <wikibugs>	 (03Abandoned) 10Alex Monk: ferm::service: Allow ensure absent without proto/port [puppet] - 10https://gerrit.wikimedia.org/r/498645 (owner: 10Alex Monk)
[14:14:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] labs_bootstrapvz firstboot:  do dist-upgrade instead of just apt-get upgrade [puppet] - 10https://gerrit.wikimedia.org/r/500748 (owner: 10Andrew Bogott)
[14:21:51] <wikibugs>	 10Operations: add wdoran@wikimedia.org to cpt-leads@wikimedia.org alias - https://phabricator.wikimedia.org/T219875 (10kchapman)
[14:22:58] <wikibugs>	 (03PS2) 10Andrew Bogott: labs_bootstrapvz firstboot:  do dist-upgrade instead of just apt-get upgrade [puppet] - 10https://gerrit.wikimedia.org/r/500748
[14:23:44] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "works well after applying next patch (I'd put the cleanup before this one tho)" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 (owner: 10Gehel)
[14:24:02] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] Increase musical notation datatype string length limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE)
[14:24:08] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "LGTM, I don't have +2 on this repo" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500737 (owner: 10Gehel)
[14:24:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz firstboot:  do dist-upgrade instead of just apt-get upgrade [puppet] - 10https://gerrit.wikimedia.org/r/500748 (owner: 10Andrew Bogott)
[14:24:59] <wikibugs>	 (03PS2) 10Gehel: Cleanup a few warnings. [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500737
[14:25:01] <wikibugs>	 (03PS3) 10Gehel: [WIP] build with maven instead of bazel [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733
[14:25:21] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 2: Code-Review+1" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 (owner: 10Gehel)
[14:25:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Allow ensure absent in monitoring classes without description/nrpe_command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[14:26:55] <wikibugs>	 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos)
[14:27:06] <wikibugs>	 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10MSantos)
[14:27:09] <wikibugs>	 10Operations, 10Maps: Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10MSantos)
[14:27:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::puppetdb: Remove ensure absent ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/500747 (owner: 10Alex Monk)
[14:27:17] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: profile::puppetdb: Remove ensure absent ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/500747 (owner: 10Alex Monk)
[14:27:28] <wikibugs>	 (03PS2) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739
[14:29:35] <wikibugs>	 10Operations, 10Maps: Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10MSantos) @Mathew.onipe this is solved and will be fixed when the stretch migration finishes. It's a known issue with the populate_admin script.
[14:29:56] <wikibugs>	 10Operations, 10Maps (Tilerator): Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10MSantos)
[14:30:10] <wikibugs>	 10Operations, 10Maps (Tilerator): Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10MSantos) p:05Triage→03High
[14:30:47] <wikibugs>	 (03PS6) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622
[14:30:56] <wikibugs>	 (03PS6) 10Alex Monk: Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655
[14:31:18] <wikibugs>	 (03PS2) 10Greta WMDE: Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767)
[14:32:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622 (owner: 10Andrew Bogott)
[14:32:51] <wikibugs>	 (03CR) 10Greta WMDE: Increase musical notation datatype string length limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE)
[14:35:35] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Server rename: labtestnet2003 to cloudnet2003-dev, update label and switch ports descriptions, etc - https://phabricator.wikimedia.org/T219861 (10Papaul) p:05Triage→03Normal
[14:36:59] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) a:05Papaul→03RobH @robh there is 1 check box left for this. You can take a look and resolve the task once do...
[14:37:37] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul) 05Open→03Resolved This is complete.
[14:38:48] <wikibugs>	 (03CR) 10Volans: Allow ensure absent in monitoring classes without description/nrpe_command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[14:40:58] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10Kubernetes: Alert "kubelet operational latencies" - https://phabricator.wikimedia.org/T219696 (10crusnov) Minor suggestion, perhaps we could increase the alert threshold if operation isn't actually affected at these levels. Quite often kubelet will sit on the alert threshol...
[14:42:07] <mutante>	 elukey: this ok to go ahead?  just adding more of those notes URLs   https://gerrit.wikimedia.org/r/c/operations/puppet/+/497273
[14:43:38] <wikibugs>	 (03PS8) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203)
[14:43:45] <wikibugs>	 (03PS9) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203)
[14:44:07] <wikibugs>	 (03CR) 10Elukey: hadoop/hue/systemd: add Icinga notes URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn)
[14:44:23] <elukey>	 mutante: all the analytics ones are good, the systemd one might not be.. since it is a generic class
[14:44:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Let's go then!" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[14:45:22] <mutante>	 elukey: oh yea, you are absolutely right. let me just remove that one from this patch and think about it in a later one. for other generic ones i used URLs containing $name or something
[14:45:52] <elukey>	 super
[14:46:14] <wikibugs>	 (03CR) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn)
[14:47:06] <wikibugs>	 (03PS3) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273
[14:47:50] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Minor details, almost there." (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov)
[14:48:23] <elukey>	 "Minor" ==> 11 comments
[14:48:32] <elukey>	 :D
[14:48:34] * elukey runs away
[14:48:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Add qemu processes/Ganeti instances to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/500756 (https://phabricator.wikimedia.org/T135991)
[14:49:25] <wikibugs>	 (03PS2) 10Dzahn: varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512
[14:49:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 (owner: 10Dzahn)
[14:50:08] <wikibugs>	 (03PS6) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[14:50:10] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud instance hiera: prepare for removal of 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/500758
[14:50:24] <wikibugs>	 (03PS3) 10Dzahn: varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512
[14:50:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 (owner: 10Dzahn)
[14:52:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 (owner: 10Dzahn)
[14:53:58] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud instance hiera: prepare for removal of 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/500758
[14:54:00] <wikibugs>	 (03PS7) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[14:54:04] <jbond42>	 !log add leatherman 1.4 to jessie-wikimedia/backports
[14:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:59] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile: do not mutate level for mjolnir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500099 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite)
[14:56:11] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud instance hiera: prepare for removal of 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/500758
[14:56:23] <wikibugs>	 (03PS8) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[14:57:00] <wikibugs>	 (03PS1) 10Esanders: VE section editing: Enable mobile AB test on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500759 (https://phabricator.wikimedia.org/T219564)
[14:57:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud instance hiera: prepare for removal of 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/500758 (owner: 10Andrew Bogott)
[14:57:12] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[14:58:34] <wikibugs>	 (03PS10) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203)
[14:58:49] <wikibugs>	 (03PS4) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273
[15:00:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn)
[15:00:45] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo)
[15:01:11] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336)
[15:01:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott)
[15:01:16] <wikibugs>	 (03CR) 10Andrew Bogott: "compiler diffs:" [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott)
[15:03:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo)
[15:03:27] <wikibugs>	 (03PS5) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273
[15:03:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500756 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:03:40] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:05:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn)
[15:05:27] <wikibugs>	 (03CR) 10Volans: Netbox module for Spicerack (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[15:07:02] <wikibugs>	 (03PS1) 10Zoranzoki21: Enable Draft namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500761 (https://phabricator.wikimedia.org/T214428)
[15:07:20] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:07:55] <moritzm>	 !log stopped/disabled ipmievd on cumin2001
[15:07:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:20] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: Revert "network: Allow customisation of cumin list on a per-project basis" [puppet] - 10https://gerrit.wikimedia.org/r/498797 (owner: 10Arturo Borrero Gonzalez)
[15:10:51] <wikibugs>	 (03Abandoned) 10Alex Monk: network::constants: Move hiera calls to the parameters [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk)
[15:11:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Why not calling the hiera keys `cumin_master` or something?. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk)
[15:11:30] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:13:00] <wikibugs>	 (03PS11) 10Volans: Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov)
[15:13:46] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) a:03Vgutierrez
[15:14:47] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] VE section editing: Enable mobile AB test on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500759 (https://phabricator.wikimedia.org/T219564) (owner: 10Esanders)
[15:16:23] <wikibugs>	 (03Merged) 10jenkins-bot: VE section editing: Enable mobile AB test on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500759 (https://phabricator.wikimedia.org/T219564) (owner: 10Esanders)
[15:16:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott)
[15:21:32] <wikibugs>	 (03CR) 10jenkins-bot: VE section editing: Enable mobile AB test on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500759 (https://phabricator.wikimedia.org/T219564) (owner: 10Esanders)
[15:21:39] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: VE: Enable mobile section editing A/B test on all remaining wikis T219564 (duration: 00m 51s)
[15:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:45] <stashbot>	 T219564: Deploy Section Editing to all wikis - https://phabricator.wikimedia.org/T219564
[15:23:59] <wikibugs>	 10Operations: add wdoran@wikimedia.org to cpt-leads@wikimedia.org alias - https://phabricator.wikimedia.org/T219875 (10Dzahn) 05Open→03Resolved a:03Dzahn done! wdoran@ has been added to cpt-leads@  [master f1c100f] (dzahn) add wdoran@ to cpt-leads@ mail alias (T219875)
[15:28:50] <wikibugs>	 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10Dzahn) Do you want the full features of mailman with user subscription and archives and your own list admins, listinfo page etc?  Or do you just want a simple...
[15:30:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Cmjohnson) Network ports have been set up for the servers below and added to cloud-hosts1 vlan.  I need cables cloudvirt1015 and 1024 and will...
[15:31:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Cmjohnson)
[15:32:29] <wikibugs>	 (03PS3) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739
[15:32:52] <jbond42>	 !log add cpp-hocon 0.1.6 to jessie-wikimedia/backports
[15:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:57] <wikibugs>	 (03PS1) 10Elukey: admin: remove tbayer from analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/500765 (https://phabricator.wikimedia.org/T178802)
[15:34:55] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Depooling eqsin because of eqsin-codfw link outage" [dns] - 10https://gerrit.wikimedia.org/r/500766
[15:36:05] <wikibugs>	 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10elukey) >>! In T178802#5076008, @Tbayer wrote: > @elukey Sure, that totally makes sense! The end of January estimate from T178802#4647106 turned out a b...
[15:36:22] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Depooling eqsin because of eqsin-codfw link outage" [dns] - 10https://gerrit.wikimedia.org/r/500766 (owner: 10Ayounsi)
[15:36:42] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Depooling eqsin because of eqsin-codfw link outage" [dns] - 10https://gerrit.wikimedia.org/r/500766
[15:36:45] <icinga-wm>	 PROBLEM - BGP status on cr1-eqsin is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 316. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:37:06] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] "Tbayer to file a ticket if he were to needs this permits again." [puppet] - 10https://gerrit.wikimedia.org/r/500765 (https://phabricator.wikimedia.org/T178802) (owner: 10Elukey)
[15:37:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: remove tbayer from analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/500765 (https://phabricator.wikimedia.org/T178802) (owner: 10Elukey)
[15:39:15] <elukey>	 XioNoX: --^
[15:39:29] <XioNoX>	 seems like that's the check having issue, I just checked the router itself and it's fine
[15:39:29] <elukey>	 there's a weirdness in the check_bgp
[15:39:55] <XioNoX>	 !log repool eqsin - T219847
[15:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:59] <stashbot>	 T219847: Outage on the primary codfw-eqsin link - https://phabricator.wikimedia.org/T219847
[15:40:23] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: use raid1-lvm.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500768 (https://phabricator.wikimedia.org/T219626)
[15:41:35] <XioNoX>	 great, that script is perl
[15:41:51] <chaomodus>	 yay perl
[15:42:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: use raid1-lvm.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500768 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez)
[15:42:09] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:42:32] <mutante>	 perl issues should go to RT :P jk
[15:43:21] <wikibugs>	 10Operations, 10Traffic, 10netops: Outage on the primary codfw-eqsin link - https://phabricator.wikimedia.org/T219847 (10ayounsi) 05Open→03Resolved a:03ayounsi Telia stabilized the situation, " Services should be stable at the moment, hands are off and we are working with the vendor to provide an RFO i...
[15:48:06] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Mostly minor style issues. Feel free to disagree!" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov)
[15:48:20] <wikibugs>	 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10RobH)
[15:49:13] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:49:57] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 56.56 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:50:50] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe)
[15:50:52] <icinga-wm>	 ACKNOWLEDGEMENT - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet daniel_zahn https://phabricator.wikimedia.org/T219696 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:51:40] <XioNoX>	 the "Varnish traffic drop between 30min ago and now at ulsfo" is normal and due to repooling eqsin
[15:51:57] <jynus>	 I checked, it was so huge, then I saw the SAL :-)
[15:52:58] <mutante>	 !log icinga - re-enabling notifications for scandium. setup task is resolved yet systemd is alerting, should not have been turned off anymore (T201366)
[15:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:01] <stashbot>	 T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366
[15:55:25] <herron>	 !log beginning rolling upgrade of codfw ELK cluster to 5.6.15 T219571
[15:55:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:43] <XioNoX>	 seems like that perl error went away, I'll just pretend it never happened
[15:55:46] <mutante>	 !log scandium - systemctl start parsoid-vd  was failed (T201366)
[15:55:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:10] <XioNoX>	 but my guess would be some snmp packets getting lost on the way
[15:56:57] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:57:07] <arturo>	 !log T219626 reimaging cloudcontrol2001-dev again
[15:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:11] <stashbot>	 T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626
[16:00:05] <jouncebot>	 godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1600).
[16:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:23] <mutante>	 !log icinga - schedule (30d) downtime for kubernetes operational latencies alerts (T219696) on kubernetes1004
[16:00:24] <wikibugs>	 (03PS1) 10Mathew.onipe: cloudelastic: allow elastic to bind to public ip [puppet] - 10https://gerrit.wikimedia.org/r/500773 (https://phabricator.wikimedia.org/T214921)
[16:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:27] <stashbot>	 T219696: Alert "kubelet operational latencies" - https://phabricator.wikimedia.org/T219696
[16:00:41] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk)
[16:01:27] <icinga-wm>	 ACKNOWLEDGEMENT - EDAC syslog messages on wtp2013 is CRITICAL: 82.02 ge 4 daniel_zahn still https://phabricator.wikimedia.org/T194174 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops
[16:01:27] <icinga-wm>	 ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 1155 ge 4 daniel_zahn still https://phabricator.wikimedia.org/T194174 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops
[16:02:39] <mutante>	 !log T194174 - bump. started alerting again 2 days ago
[16:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:46] <stashbot>	 T194174: wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174
[16:04:04] <wikibugs>	 (03PS2) 10Gehel: cloudelastic: allow elastic to bind to public ip [puppet] - 10https://gerrit.wikimedia.org/r/500773 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[16:04:06] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup 3 new single cpu spare pool systems - https://phabricator.wikimedia.org/T219890 (10RobH) p:05Triage→03Normal
[16:04:15] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/15507/cloudelastic1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/500773 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[16:07:25] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar) With compression of mediawiki debug logs, disk usage went down to 287...
[16:09:46] <wikibugs>	 (03CR) 10EBernhardson: [C: 04-1] "This needs to be split into two patches for a clean sync. Likely WikibaseSearchSettings.php needs to be duplicated in the first patch, syn" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[16:12:05] <XioNoX>	 !log - replacing accepted-prefix-limit with prefix-limit on cr2-eqiad - T211730
[16:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:09] <stashbot>	 T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730
[16:13:48] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:17:26] <wikibugs>	 (03PS6) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954)
[16:17:28] <wikibugs>	 (03PS1) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954)
[16:17:30] <wikibugs>	 (03PS1) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954)
[16:18:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[16:18:22] <wikibugs>	 (03PS7) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954)
[16:18:24] <wikibugs>	 (03PS2) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954)
[16:18:26] <wikibugs>	 (03PS2) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954)
[16:18:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[16:21:30] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:24:14] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:24:44] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:24:46] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 91.25 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:28:50] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:33:06] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:33:42] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:35:26] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:36:55] <XioNoX>	 !log - replacing accepted-prefix-limit with prefix-limit on esams - T211730
[16:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:01] <stashbot>	 T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730
[16:39:14] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:39:36] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@6026ad1]: Switch to swagger 3 T218218
[16:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:39] <stashbot>	 T218218: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218
[16:39:58] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:40:37] <wikibugs>	 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo)
[16:43:14] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:43:44] <wikibugs>	 10Operations, 10Patch-For-Review: apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10Andrew)
[16:43:54] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:44:27] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@6026ad1]: Switch to swagger 3 T218218 (duration: 04m 52s)
[16:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:45] <XioNoX>	 akosiaris: should we downtime those kubernetes alerts or are they useful?
[16:47:03] <mutante>	 i downtimed one of them linking to the ticket ..but there are more hosts
[16:47:22] <XioNoX>	 !log - replacing accepted-prefix-limit with prefix-limit in eqsin - T211730
[16:47:24] <mutante>	 https://phabricator.wikimedia.org/T219696
[16:47:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:26] <stashbot>	 T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730
[16:47:33] <mutante>	 T21969
[16:47:34] <stashbot>	 T21969: Special:WantedCategories shows categories that are "not wanted" or already exist - https://phabricator.wikimedia.org/T21969
[16:47:42] <mutante>	 T219696
[16:47:43] <stashbot>	 T219696: Alert "kubelet operational latencies" - https://phabricator.wikimedia.org/T219696
[16:48:12] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:48:35] <chaomodus>	 grump
[16:49:00] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:50:46] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:52:58] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:56:04] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:57:14] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[16:59:16] <wikibugs>	 (03PS2) 10BBlack: Turn on non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263)
[16:59:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Turn on non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack)
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1700).
[17:10:16] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:13:19] <wikibugs>	 10Operations, 10netops: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10ayounsi) 05Open→03Resolved All set, no down or bouncing peers, no mentions of `accepted-prefix-limit` in Rancid
[17:16:06] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey)
[17:16:21] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) 05Stalled→03Resolved
[17:18:06] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) 05Stalled→03Open Vega GPU mounted on stat1005, it looks go...
[17:19:22] <wikibugs>	 10Operations, 10monitoring: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10Joe)
[17:21:17] <wikibugs>	 10Operations, 10Discovery-Search, 10Operations-Software-Development, 10Wikidata, and 2 others: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe)
[17:22:39] <wikibugs>	 (03PS5) 10Alex Monk: Move cumin_masters out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/499355
[17:27:36] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@3dcf328] (dev-cluster): Upgrade swagger to v3, attempt 2, T218218
[17:27:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:40] <stashbot>	 T218218: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218
[17:28:00] <wikibugs>	 10Operations, 10monitoring: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10Joe) Fun finding: if we eliminate either  the `until=Xmin` or the `from=Xmin` we have in the request url for `check_graphite` we get back `Cache-Control: max-age=120`.  If we d...
[17:29:58] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:30:38] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@3dcf328] (dev-cluster): Upgrade swagger to v3, attempt 2, T218218 (duration: 03m 02s)
[17:30:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:49] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@3dcf328]: Upgrade swagger to v3, attempt 2, T218218
[17:30:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:06] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.128, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:35:30] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[17:35:30] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[17:37:02] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:37:52] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:37:59] <volans>	 XioNoX: FYI mr1-eqsin ^^^
[17:38:42] <XioNoX>	 volans: seems like equinix either outage or planned maintenance, no big deal for now, will keep an eye
[17:40:00] <XioNoX>	 volans: yup, planned maintenance - REMINDER - Scheduled Equinix Connect Software Upgrade-SG Metro Area Network Maintenance (SERVICE IMPACTING)-03-APR-2019 [5-185702129870]
[17:40:09] <XioNoX>	 it's april 3rd Singapore time
[17:40:26] <volans>	 ack
[17:40:33] <XioNoX>	 volans: so I guess best thing to do now it so downtime it for the duration of their window
[17:40:43] <XioNoX>	 so it can alert if it's still down afterwards
[17:42:05] <volans>	 given it's already down it will not re-alert after the downtime
[17:42:06] <volans>	 expires
[17:42:12] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:42:20] <volans>	 doh
[17:42:32] <wikibugs>	 10Operations, 10monitoring: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10CDanis) For Prometheus, there is just a LVS service IP that goes to local Apache, which on a quick glance does not seem to have any caching modules enabled. Looking at a curl,...
[17:43:02] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[17:43:56] <wikibugs>	 (03PS1) 10Ayounsi: LLDP fact - return correct port information [puppet] - 10https://gerrit.wikimedia.org/r/500795
[17:44:52] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:45:34] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:46:26] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 224.06 ms
[17:46:26] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 232.24 ms
[17:50:04] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:50:44] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:51:36] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@3dcf328]: Upgrade swagger to v3, attempt 2, T218218 (duration: 20m 47s)
[17:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:40] <stashbot>	 T218218: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218
[17:54:02] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:54:40] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:55:56] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Rename labvirt1008 to cloudvirt1008 [puppet] - 10https://gerrit.wikimedia.org/r/499316 (owner: 10Andrew Bogott)
[17:56:04] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@2cb53a7] (dev-cluster): Kafka logging pipeline, dev cluster only T211125
[17:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:08] <stashbot>	 T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125
[17:56:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797
[17:57:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez)
[17:59:10] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:59:29] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@2cb53a7] (dev-cluster): Kafka logging pipeline, dev cluster only T211125 (duration: 03m 25s)
[17:59:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1800)
[18:00:07] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirts: update six servers to use 10Gb nics [puppet] - 10https://gerrit.wikimedia.org/r/500799 (https://phabricator.wikimedia.org/T216195)
[18:00:12] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:02:08] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudvirts: update six servers to use 10Gb nics [puppet] - 10https://gerrit.wikimedia.org/r/500799 (https://phabricator.wikimedia.org/T216195)
[18:02:28] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:02:47] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "It seems like a more sensible and explicit approach, once the linter is happy.   As long as a sane default ends up somewhere for cloud VPS" [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez)
[18:03:06] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:04:08] <wikibugs>	 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) The new UI has been deployed. Next step here - explore the new features in openAPI 3.0, see what we can start using,...
[18:04:12] <wikibugs>	 (03CR) 10Andrew Bogott: openstack: clientpackages: fix missing deb repo installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez)
[18:04:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: update six servers to use 10Gb nics [puppet] - 10https://gerrit.wikimedia.org/r/500799 (https://phabricator.wikimedia.org/T216195) (owner: 10Andrew Bogott)
[18:05:12] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "PCC fails https://puppet-compiler.wmflabs.org/compiler1002/15508/" [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez)
[18:06:23] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, canary on restbase2010 T211125
[18:06:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:26] <stashbot>	 T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125
[18:08:45] <wikibugs>	 (03PS1) 10Andrew Bogott: Rename labvirt1008 to cloudvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/500800 (https://phabricator.wikimedia.org/T216661)
[18:08:56] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, canary on restbase2010 T211125 (duration: 02m 33s)
[18:08:58] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:08:59] <wikibugs>	 (03PS2) 10Andrew Bogott: Rename labvirt1008 to cloudvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/500800 (https://phabricator.wikimedia.org/T216661)
[18:09:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Rename labvirt1008 to cloudvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/500800 (https://phabricator.wikimedia.org/T216661) (owner: 10Andrew Bogott)
[18:10:22] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797
[18:10:28] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527)
[18:11:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[18:18:04] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:20:09] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125
[18:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:12] <logmsgbot>	 !log ppchelko@deploy1001 deploy aborted: Kafka logging pipeline, full deploy T211125 (duration: 00m 03s)
[18:20:13] <stashbot>	 T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125
[18:20:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:22] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125
[18:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:58] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:22:01] <marxarelli>	 !log cutting mediawiki branch 1.33.0-wmf.24
[18:22:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:28] <marxarelli>	 !log cutting mediawiki branch 1.33.0-wmf.24 (T206678)
[18:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:31] <stashbot>	 T206678: 1.33.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T206678
[18:23:45] <wikibugs>	 (03CR) 10Bstorm: "Looks legit, though I think the compiler one should go and one other change." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis)
[18:25:01] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) Filippo, did you decide r494685 wasn't necessary?
[18:26:02] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[18:33:40] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:35:23] <wikibugs>	 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) thanks to moritz all dependencies have been built but now getting the following error while building facter  ` root@boron:/tmp/buildd/facter-3.11.0# /usr/bin/c++  -DBOOST_ALL_...
[18:38:54] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:38:58] <icinga-wm>	 PROBLEM - Apache HTTP on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:40:14] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:40:16] <icinga-wm>	 RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[18:40:50] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:41:10] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125 (duration: 20m 49s)
[18:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:14] <stashbot>	 T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125
[18:44:50] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:45:28] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:46:44] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:50:05] <wikibugs>	 (03PS2) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527)
[18:51:54] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) hacks abound, but basically:  * Added `deb [arch=amd64]...
[18:51:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[18:55:54] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:00:04] <jouncebot>	 marxarelli: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1900).
[19:03:59] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10greg) This is a great example of a almost-worst case scenario, sadly.  Things tha...
[19:05:00] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:05:20] <chaomodus>	 ^ can confirm that ssh is really slow on that host, however i don't see any particular reason for it
[19:09:24] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Later), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) I have deployed a new pipeline for RESTBase in production and it all looks great. Next step -...
[19:11:21] <wikibugs>	 (03PS1) 10Dduvall: Group0 to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500812
[19:12:57] <logmsgbot>	 !log dduvall@deploy1001 Started scap: testwiki to php-1.33.0-wmf.24 and rebuild l10n cache
[19:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:30] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:17:26] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), 10Services (done): Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) a:05holger.knust→03Pchelolo
[19:19:33] <wikibugs>	 10Operations, 10Citoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Pchelolo)
[19:19:54] <wikibugs>	 10Operations, 10Citoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Pchelolo) a:05Pchelolo→03Mvolz
[19:20:44] <Krinkle>	 marxarelli: Can I push out a patch after you're done to fix a Vector regression?
[19:20:57] <Krinkle>	 Wikipedia article "0" no longer has a <h1> title :)
[19:21:13] <wikibugs>	 10Operations, 10CX-cxserver, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 (10Pchelolo)
[19:22:04] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:22:40] <wikibugs>	 10Operations, 10CX-cxserver, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 (10Pchelolo)
[19:23:40] <wikibugs>	 10Operations, 10Analytics, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move eventstreams logging to new logging pipeline - https://phabricator.wikimedia.org/T219922 (10Pchelolo)
[19:25:02] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10Pchelolo)
[19:26:00] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:26:35] <wikibugs>	 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move mobile apps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10Pchelolo)
[19:27:20] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:27:53] <wikibugs>	 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10Pchelolo)
[19:28:59] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Move recommendation-api logging to new logging pipeline - https://phabricator.wikimedia.org/T219926 (10Pchelolo)
[19:29:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:30:15] <wikibugs>	 10Operations, 10Parsoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move parsoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219927 (10Pchelolo)
[19:31:18] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:32:01] <wikibugs>	 10Operations, 10Analytics, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10Pchelolo)
[19:32:19] <marxarelli>	 Krinkle: gah. of course. i'm running a bit late on the train, but the full sync is happening now
[19:32:40] <Krinkle>	 marxarelli: no worries, you're perfectly within the window. :)
[19:33:09] <marxarelli>	 i'll ping ya when i'm done!
[19:33:11] <Krinkle>	 marxarelli: is it alright if I start the gate tests?
[19:33:42] <marxarelli>	 yeah, go for it
[19:33:52] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:39:50] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:42:00] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:45:06] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:51:40] <wikibugs>	 (03CR) 10Hashar: "iirc that followed a discussion with Alexandros, Giuseppe and Faidon on IRC.  I am not sure whom approval we need?" [puppet] - 10https://gerrit.wikimedia.org/r/498431 (owner: 10Hashar)
[19:52:46] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 53.34, 25.00, 16.15
[19:53:02] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 56.82, 27.56, 18.15
[19:53:29] <marxarelli>	 ^ current scap-cdb-rebuild most likely
[19:53:34] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:55:06] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:55:22] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 19.22, 23.65, 17.19
[19:55:36] <icinga-wm>	 PROBLEM - Disk space on mwdebug2001 is CRITICAL: DISK CRITICAL - free space: / 1464 MB (3% inode=67%)
[19:55:36] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 19.72, 25.18, 18.86
[19:56:56] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[19:57:17] <logmsgbot>	 !log dduvall@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.24 and rebuild l10n cache (duration: 44m 20s)
[19:57:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:06] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:59:51] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] Group0 to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500812 (owner: 10Dduvall)
[20:00:00] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:01:02] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudvirt1008/1009: updated hiera settings for stretch and 10Gb [puppet] - 10https://gerrit.wikimedia.org/r/500820
[20:01:04] <wikibugs>	 (03Merged) 10jenkins-bot: Group0 to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500812 (owner: 10Dduvall)
[20:01:59] <wikibugs>	 (03CR) 10jenkins-bot: Group0 to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500812 (owner: 10Dduvall)
[20:02:04] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:02:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1008/1009: updated hiera settings for stretch and 10Gb [puppet] - 10https://gerrit.wikimedia.org/r/500820 (owner: 10Andrew Bogott)
[20:03:12] <wikibugs>	 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) ping @Miriam @Gilles so they know the status of this.
[20:03:18] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cloudcontrol2001-dev is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[20:03:36] <icinga-wm>	 PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:03:44] <icinga-wm>	 PROBLEM - keystone public endoint port 5000 on cloudcontrol2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 804 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:03:47] <wikibugs>	 (03PS9) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640
[20:04:06] <icinga-wm>	 PROBLEM - keystone admin endpoint port 35357 on cloudcontrol2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 805 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:04:30] <icinga-wm>	 PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[keystone]
[20:07:40] <logmsgbot>	 !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Group0 to 1.33.0-wmf.24
[20:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:56] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:08:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott)
[20:11:50] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:12:16] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud cumin: move extra cumin modules to eqiad1 profile [puppet] - 10https://gerrit.wikimedia.org/r/500822
[20:12:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) Please note I've been going through and updating the firmware of the ilom and bios for the following systems:  [x] - cloudvirt1008 [x] - cloudvirt1009 [] -...
[20:13:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud cumin: move extra cumin modules to eqiad1 profile [puppet] - 10https://gerrit.wikimedia.org/r/500822 (owner: 10Andrew Bogott)
[20:15:10] <wikibugs>	 (03PS1) 10Alex Monk: openstack::monitor::spreadcheck: Use a list of projects [puppet] - 10https://gerrit.wikimedia.org/r/500823
[20:15:12] <wikibugs>	 (03PS1) 10Alex Monk: openstack::monitor::spreadcheck: rm old renaming absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/500824
[20:16:31] <marxarelli>	 !log 1.33.0-wmf.24 successfully deployed to group0. errors rates look normal (T206678)
[20:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:35] <stashbot>	 T206678: 1.33.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T206678
[20:16:47] <marxarelli>	 Krinkle: all done :)
[20:17:09] <Krinkle>	 marxarelli: thx
[20:17:14] * Krinkle stages on mwdebug1002
[20:19:06] <wikibugs>	 (03PS1) 10Alex Monk: openstack::monitor::spreadcheck: add cloudinfra config [puppet] - 10https://gerrit.wikimedia.org/r/500825
[20:19:59] <wikibugs>	 (03PS6) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032
[20:20:09] <wikibugs>	 (03CR) 10CRusnov: "minor updates, fixing issues raised. thanks!" (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov)
[20:20:42] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.23/skins/Vector/includes/: I6e04b512d / T219864 (duration: 01m 00s)
[20:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:45] <stashbot>	 T219864: Articles about zero (0) not displaying title in Vector skin - https://phabricator.wikimedia.org/T219864
[20:21:06] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:22:36] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[20:23:32] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/skins/Vector/includes/: I6e04b512d / T219864 (duration: 00m 59s)
[20:23:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:23] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10jcrespo) For example https://www.mediawiki.org/wiki/How_to_report_a_bug is a very...
[20:24:38] * Krinkle done staging on mwdebug102
[20:24:50] <wikibugs>	 (03PS3) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527)
[20:24:58] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:25:42] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:26:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[20:29:26] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:30:16] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 75489 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[20:30:34] <wikibugs>	 (03PS4) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527)
[20:31:26] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:31:32] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:32:12] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:32:27] <wikibugs>	 (03CR) 10Bstorm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[20:32:46] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:34:12] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[20:34:22] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[20:34:40] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[20:34:42] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:37:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "I can merge this when I have time to watch it apply :)" [puppet] - 10https://gerrit.wikimedia.org/r/500823 (owner: 10Alex Monk)
[20:37:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] openstack::monitor::spreadcheck: rm old renaming absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/500824 (owner: 10Alex Monk)
[20:38:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] openstack::monitor::spreadcheck: add cloudinfra config [puppet] - 10https://gerrit.wikimedia.org/r/500825 (owner: 10Alex Monk)
[20:38:21] <wikibugs>	 (03PS1) 10Kosta Harlan: (wip) Enable ORES RCFilters for eswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160)
[20:40:04] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:42:00] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[20:42:10] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[20:43:58] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:45:58] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:47:50] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:49:00] <icinga-wm>	 PROBLEM - Disk space on mwdebug2001 is CRITICAL: DISK CRITICAL - free space: / 1463 MB (3% inode=67%)
[20:50:16] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:52:47] <wikibugs>	 (03PS5) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527)
[20:54:41] <andrewbogott>	 !log restarting pdns and pdns-recursor on labservices1001 and 1002 in hopes of getting those machines to act a bit less sluggish
[20:54:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:14] <wikibugs>	 (03PS1) 10CRusnov: profile kubernetes node: Adjust latency alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/500839
[20:57:26] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[20:58:28] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS
[21:00:06] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:00:46] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:01:56] <wikibugs>	 (03PS1) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844
[21:01:56] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:02:14] <icinga-wm>	 PROBLEM - Auth DNS on labservices1002 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:02:58] <icinga-wm>	 PROBLEM - Check for gridmaster host resolution UDP on labservices1002 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:02:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk)
[21:03:40] <icinga-wm>	 PROBLEM - Check for gridmaster host resolution TCP on labservices1002 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:04:10] <icinga-wm>	 PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:04:40] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:04:50] <icinga-wm>	 RECOVERY - Check for gridmaster host resolution TCP on labservices1002 is OK: DNS OK - 0.778 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:06:36] <wikibugs>	 (03PS6) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527)
[21:06:47] <wikibugs>	 (03PS2) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844
[21:07:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk)
[21:10:00] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.061 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS
[21:10:21] <wikibugs>	 (03CR) 10Catrope: [C: 03+1] (wip) Enable ORES RCFilters for eswikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) (owner: 10Kosta Harlan)
[21:11:38] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[21:14:06] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10Halfak)
[21:14:22] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:14:32] <icinga-wm>	 RECOVERY - Check for gridmaster host resolution UDP on labservices1002 is OK: DNS OK - 0.019 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:14:34] <icinga-wm>	 RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:15:08] <icinga-wm>	 RECOVERY - Auth DNS on labservices1002 is OK: DNS OK: 0.014 seconds response time. labs-ns1.wikimedia.org returns https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[21:15:16] <wikibugs>	 10Operations, 10CX-cxserver, 10Wikimedia-Logstash, 10service-runner, and 3 others: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 (10KartikMistry) @Pchelolo Added patch. Feel free to fix :)
[21:16:11] <andrewbogott>	 !log rebooting labservices1002
[21:16:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:41] <wikibugs>	 (03PS3) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844
[21:17:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk)
[21:19:58] <wikibugs>	 (03PS4) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844
[21:22:15] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Jdforrester-WMF) >>! In T219279#5068956, @Joe wrote: > @Anomie so you're suggesting we...
[21:22:45] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio)
[21:23:17] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio)
[21:23:44] <wikibugs>	 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Performance, 10User-Banyek: Issues with purgeUnusedProjects.php cron job on mwmaint1002  (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10MarcoAurelio) This has evolved into fatals: T219935.
[21:24:28] <wikibugs>	 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10Halfak)
[21:25:47] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10Niharika) @MarcoAurelio I don't think this...
[21:26:40] <wikibugs>	 (03PS5) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844
[21:27:18] <wikibugs>	 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Performance, 10User-Banyek: Issues with purgeUnusedProjects.php cron job on mwmaint1002  (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10aezell) This is interesting. T219935 seems to indicate that the query is now poorl...
[21:27:25] <andrewbogott>	 !log rebooting labservices1001
[21:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:12] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio) https://phabricator.wikimedi...
[21:28:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[21:29:24] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio) >>! In T219935#5079562, @Nih...
[21:29:54] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:30:52] <icinga-wm>	 RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:33:46] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:34:04] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: Investigate memory usage of ORES in kubernetes - https://phabricator.wikimedia.org/T210264 (10Halfak)
[21:36:22] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[21:36:54] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:37:44] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:38:35] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10aezell) >>! In T219935#5079587, @MarcoAure...
[21:39:37] <wikibugs>	 (03PS7) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527)
[21:39:51] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10Niharika) @aezell Possibly because beta do...
[21:42:32] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10aezell) I should have a patch shortly.
[21:42:54] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:46:12] <wikibugs>	 (03PS8) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527)
[21:48:12] <wikibugs>	 (03CR) 10Bstorm: "Ok, now I think this is about ready.  The thing to watch out for is the systemd stuff around maintain-dbusers.  It looks like it's a funct" [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[21:49:26] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[21:56:22] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[22:12:32] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:26:29] <wikibugs>	 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, and 2 others: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio) 05Open→03Resolved a:03aezell ` maurelio@...
[22:28:06] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:32:12] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:32:19] <wikibugs>	 (03PS1) 10Bstorm: clouddb: add DNS alias for wikilabels.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/500855 (https://phabricator.wikimedia.org/T219563)
[22:33:18] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:35:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) Ok, so attempting to load the following on cloudvirt1012 didn't work, when it worked just fine for cloudvirt100[89].  All are the same DL360 gen8 systems.  S...
[22:41:06] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:41:28] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] clouddb: add DNS alias for wikilabels.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/500855 (https://phabricator.wikimedia.org/T219563) (owner: 10Bstorm)
[22:51:02] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:57:14] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[22:57:30] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:06] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:07:50] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:13:20] <icinga-wm>	 PROBLEM - SSH on labcontrol1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:15:48] <icinga-wm>	 RECOVERY - SSH on labcontrol1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:17:33] <wikibugs>	 (03PS4) 10Jforrester: Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle)
[23:17:38] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle)
[23:19:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle)
[23:22:08] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:22:25] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SDC: Add 'depicts' statements to search index on testcommons (duration: 00m 59s)
[23:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:47] <wikibugs>	 (03PS3) 10Jforrester: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) (owner: 10Strainu)
[23:23:50] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) (owner: 10Strainu)
[23:24:56] <wikibugs>	 (03Merged) 10jenkins-bot: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) (owner: 10Strainu)
[23:25:41] <wikibugs>	 (03CR) 10jenkins-bot: Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle)
[23:25:43] <wikibugs>	 (03CR) 10jenkins-bot: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) (owner: 10Strainu)
[23:27:01] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Enable SandboxLink for rowiki T219855 (duration: 00m 56s)
[23:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:04] <stashbot>	 T219855: Activate the extension "SandboxLink" for rowiki - https://phabricator.wikimedia.org/T219855
[23:27:20] <wikibugs>	 (03PS4) 10Jforrester: Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar)
[23:27:34] <icinga-wm>	 PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:28:04] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar)
[23:29:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar)
[23:30:10] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Blocked on SRE." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482099 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester)
[23:30:15] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Blocked on SRE." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester)
[23:30:25] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Blocked on below." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482102 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester)
[23:30:26] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Add new WMCS IP range to wgRateLimitsExcludedIps T167432 (duration: 00m 57s)
[23:30:29] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Blocked on below." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482103 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester)
[23:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:34] <stashbot>	 T167432: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432
[23:30:36] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Blocked on below." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482104 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester)
[23:33:15] <wikibugs>	 (03PS6) 10Jforrester: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712)
[23:33:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712)
[23:36:44] <wikibugs>	 (03CR) 10jenkins-bot: Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar)
[23:36:55] <wikibugs>	 (03PS7) 10Jforrester: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712)
[23:37:24] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:38:21] <wikibugs>	 (03PS8) 10Jforrester: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712)
[23:40:50] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712)
[23:41:55] <wikibugs>	 (03Merged) 10jenkins-bot: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712)
[23:44:17] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot T219261 (duration: 00m 58s)
[23:44:19] <wikibugs>	 (03PS6) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844
[23:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:21] <stashbot>	 T219261: Enwiki configuration: remove move-categorypages from 'user' group - https://phabricator.wikimedia.org/T219261
[23:47:45] <wikibugs>	 (03CR) 10jenkins-bot: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712)
[23:47:53] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) Forgot to mention that during the reboot it printed: ` Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V3.56) 14 Logical Drive(s) - Operation Failed  - 1719-Slot 3 Drive Arra...
[23:48:24] <wikibugs>	 (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[23:48:31] <wikibugs>	 (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[23:48:51] <wikibugs>	 (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[23:51:16] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[23:51:56] <wikibugs>	 (03CR) 10Smalyshev: [C: 04-1] Disable wbcs dispatching query builder on commons (1/3) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson)
[23:56:48] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational